How to improve the performance of Drill query with many in-lists

Document created by Hao Zhu Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

 

Original Publication Date: August 25, 2015

 

Environment :

Drill 1.1

Goal

How to improve the performance of Drill query with many in-lists.

Root Cause

For Drill query like: select * from table where col in (1,2,3,...);

The in-list evaluation is done in sequential. It means if you increase the number of in-list, the performance of that query could degrade linearly.

 

However if the number of in-list elements reaches 20(including 20),  Drill can optimize this query to use a in-memory hash table to store the in-list elements, and then do a table join instead. This optimization can increase the performance a lot.

 

See the differences in below 2 plans.

 

a. No optimization

 

explain plan for

select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19);

+------+------+

| text | json |

+------+------+

| 00-00 Screen

00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])

00-02 Project($f0=[1])

00-03 SelectionVectorRemover

00-04 Filter(condition=[OR(=(CAST($0):INTEGER, 1), =(CAST($0):INTEGER, 2), =(CAST($0):INTEGER, 3), =(CAST($0):INTEGER, 4), =(CAST($0):INTEGER, 5), =(CAST($0):INTEGER, 6), =(CAST($0):INTEGER, 7), =(CAST($0):INTEGER, 8), =(CAST($0):INTEGER, 9), =(CAST($0):INTEGER, 10), =(CAST($0):INTEGER, 11), =(CAST($0):INTEGER, 12), =(CAST($0):INTEGER, 13), =(CAST($0):INTEGER, 14), =(CAST($0):INTEGER, 15), =(CAST($0):INTEGER, 16), =(CAST($0):INTEGER, 17), =(CAST($0):INTEGER, 18), =(CAST($0):INTEGER, 19))])

00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])

b. With optimization

 

explain plan for

select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);

+------+------+

| text | json |

+------+------+

| 00-00 Screen

00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])

00-02 Project($f0=[1])

00-03 Project(f6=[$1], ROW_VALUE=[$0])

00-04 MergeJoin(condition=[=($1, $0)], joinType=[inner])
00-06 SelectionVectorRemover
00-08 Sort(sort0=[$0], dir0=[ASC])
00-10 HashAgg(group=[{0}])
00-12 Values

00-05 SelectionVectorRemover

00-07 Sort(sort0=[$0], dir0=[ASC])

00-09 Project(f6=[CAST($0):INTEGER])

00-11 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])

 

Solution:

Add duplicate in-list columns to make the number of in-list reach 20 to take advantage of this optimization.

 

For example, change the where condition from:

 

"where col1 in (1,2,3)"

To:

"where col1 in (1,2,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)"

 

 

Improvement JIRA DRILL-3710 is filed to make number "20" configurable.

Attachments

    Outcomes