AnsweredAssumed Answered

DRILL vs HIVE performance on Parquet data

Question asked by AmarnathVibhute on Oct 3, 2016
Latest reply on Oct 10, 2016 by AmarnathVibhute

Hello,

I have a 3 node cluster with Drill bit installed on all 3 nodes. On this cluster I have a Hive managed table with some complex data types(ARRAY<STRUCT<) and data is stored in Parquet.


I am experiencing a strange performance behaviour, when I query this Parquet data using Hive query & using DRILL query. My understanding was DRILL query should result much faster than Hive query. So I built a DRILL view on top of parquet data to FLATTEN all data from complex data. When I query on Hive managed table using HQL it results in seconds but when I use FLATTENED DRILL view to query the count takes 22+ minutes to count 17,47,156 records.

 

Below are the queries used to count records :

Hive query - select count(col1) from hive_table;

DRILL query - select count(col1) from dfs.tmp.drill_view;

 

Note:
1. 'col1' is not part of complex data type.
2. To create DRILL view I have used kvgen() and Flatten() functions.

Outcomes