I am in the process of having our test cluster upgraded so we can move from Drill 1.6 to Drill 1.10, and I am mainly doing it as I see that Drill 1.9 introduced predicate push-downs for parquet.
Drill 1.9 introduces the Parquet filter pushdown option. Parquet filter pushdown is a performance optimization that prunes extraneous data from a Parquet file to reduce the amount of data that Drill scans and reads when a query on a Parquet file contains a filter expression. Pruning data reduces the I/O, CPU, and network overhead to optimize Drill’s performance.
Reference: Parquet Filter Pushdown - Apache Drill
I'm hoping that it makes some key queries fast enough to mitigate the need for using other faster query technologies.
I have a couple of questions in this regard:
- Do the parquet min/max indices get maintained when spark writes a parquet file? If not, I understand I can recreate my files with Drill CTAS.
- The documentation says "The query planner can typically prune more data when the tables in the Parquet file are sorted by row groups.". What does "sorted by row groups" mean? I assumed sorting my entire file by a column (say, host-name if I'm storing metrics), would suffice assuming all of my look-ups tend to include the host-name in a where clause.
- If the min-max indices are built into the row-groups, then the row-group size must be pretty important. Can I confirm that the row-group size is based on store.parquet.block-size? Also, does the row-group size have anything to do with the coalesce(#) you use when writing a parquet file from spark? I'm unsure if row-groups are related to # sub-files in the parquet file.