I have a 3 node cluster with Drill bit installed on all 3 nodes. On this cluster I have a Hive managed table with some complex data types(ARRAY<STRUCT<) and data is stored in Parquet.
I am experiencing a strange performance behaviour, when I query this Parquet data using Hive query & using DRILL query. My understanding was DRILL query should result much faster than Hive query. So I built a DRILL view on top of parquet data to FLATTEN all data from complex data. When I query on Hive managed table using HQL it results in seconds but when I use FLATTENED DRILL view to query the count takes 22+ minutes to count 17,47,156 records.
Below are the queries used to count records :
Hive query - select count(col1) from hive_table;
DRILL query - select count(col1) from dfs.tmp.drill_view;
1. 'col1' is not part of complex data type.
2. To create DRILL view I have used kvgen() and Flatten() functions.