AnsweredAssumed Answered

Utilizing Drill Parquet Predicate Push-Down

Question asked by john.humphreys on Jun 7, 2017
Latest reply on Jun 15, 2017 by aalvarez

I am in the process of having our test cluster upgraded so we can move from Drill 1.6 to Drill 1.10, and I am mainly doing it as I see that Drill 1.9 introduced predicate push-downs for parquet.

Drill 1.9 introduces the Parquet filter pushdown option. Parquet filter pushdown is a performance optimization that prunes extraneous data from a Parquet file to reduce the amount of data that Drill scans and reads when a query on a Parquet file contains a filter expression. Pruning data reduces the I/O, CPU, and network overhead to optimize Drill’s performance.

Reference: Parquet Filter Pushdown - Apache Drill 

I'm hoping that it makes some key queries fast enough to mitigate the need for using other faster query technologies.  

I have a couple of questions in this regard:

  1. Do the parquet min/max indices get maintained when spark writes a parquet file?  If not, I understand I can recreate my files with Drill CTAS.
  2. The documentation says "The query planner can typically prune more data when the tables in the Parquet file are sorted by row groups.".  What does "sorted by row groups" mean? I assumed sorting my entire file by a column (say, host-name if I'm storing metrics), would suffice assuming all of my look-ups tend to include the host-name in a where clause.
  3. If the min-max indices are built into the row-groups, then the row-group size must be pretty important.  Can I confirm that the row-group size is based on store.parquet.block-size?  Also, does the row-group size have anything to do with the coalesce(#) you use when writing a parquet file from spark?  I'm unsure if row-groups are related to # sub-files in the parquet file.