AnsweredAssumed Answered

slow queries on apache drill

Question asked by brett_ on Jan 17, 2017
Latest reply on Jan 17, 2017 by brett_

Hi, I am benchmarking several file formats and comparing drill and hive queries to see if we want to go forward with leveraging MapR Drill for ad-hoc queries. I'm running into something that doesn't make sense: drill "SELECT * FROM" queries taking a long time, 30 - 60 seconds. We have 12 drillbits running on a MapR development cluster.

 

Here is our hive configuration plugin:

{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://hivemetastore01:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://hm01/test_hive?createDatabaseIfNotExist=false",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "maprfs:///",
"hive.metastore.sasl.enabled": "false",
"hive.metastore.cache-ttl-seconds": "120",
"hive.metastore.cache-expire-after": "write"
}
}

To launch the drill shell, we are running:

 

 sqlline -u jdbc:drill:zk=zookeeper01:5181,zookeeper02:5181,zookeeper03:5181 -n mapr

We are running the hive shell from a client/edge node.

 

Below, you can see the shell, number of rows, format, and query times for a "SELECT * FROM" and "SELECT * FROM" run <2 seconds immediately after the first query. These tables are partitioned.

 

 

SHELL

 

 

 

hive

TABLE

 

 

 

table_parquettest

ROWS

 

 

 

26000

FORMAT

 

 

 

parquet

"select * from"

 

 

 

1.104

"select * from", consecutive query, <2s. delay

 

 

0.087

hivetable_avrotest26000avro0.1690.09
drilltable_parquettest26000parquet64.56960.325
drilltable_avrotest26000avro72.76234.279

 

I'm curious if there is a Drill performance optimization we are missing in the configuration plugin, MapR system or drill shell. Also, in Drill, we are prefacing the test by setting the schema to hive (USE hive;).

 

Here is the output from "hive> show create table table_parquettest;"

CREATE TABLE `table_parquettest`(
`col1` date COMMENT '',
`col2` string COMMENT '',
`col3` string COMMENT '',
`col4` string COMMENT '',
`col5` string COMMENT '',
`col6` string COMMENT '',
`col7` string COMMENT '',
`col8` string COMMENT '',
`col9` string COMMENT '',
`col10` string COMMENT '',
`col11` string COMMENT '',
`col12` string COMMENT '',
`col13` string COMMENT '',
`col14` string COMMENT '')
PARTITIONED BY (
`xyz` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/table_parquettest'
TBLPROPERTIES (
'transient_lastDdlTime'='1484676643')

 

 

And same for the Avro table...

CREATE TABLE `table_avrotest`(
`col1` date COMMENT '', 
`col2` string COMMENT '', 
`col3` string COMMENT '', 
`col4` string COMMENT '', 
`col5` string COMMENT '', 
`col6` string COMMENT '', 
`col7` string COMMENT '', 
`col8` string COMMENT '', 
`col9` string COMMENT '', 
`col10` string COMMENT '', 
`col11` string COMMENT '', 
`col12` string COMMENT '', 
`col13` string COMMENT '', 
`col14` string COMMENT '')

PARTITIONED BY (
`xyz` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/table_avrotest'
TBLPROPERTIES (
'transient_lastDdlTime'='1482261503')

 

Other notes: drill-env.sh memory settings are on default:

 

export DRILL_HEAP=${DRILL_HEAP:-"4G"}

 

# Maximum amount of direct memory to allocate to the Drillbit in the format

# supported by -XX:MaxDirectMemorySize. Default is 8G.

 

 export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"8G"}

 

# Value for the JVM -XX:MaxPermSize option for the Drillbit. Default is 512M.

 

 export DRILLBIT_MAX_PERM=${DRILLBIT_MAX_PERM:-"512M"}

 

 

Any feedback/questions/comments will be much appreciated, thanks.

Outcomes