Sorry this question was originally confusing; it was part of another thread and someone moved it here for me. The context was lost from the original question .
- Does coalescing parquet files written by spark (to say, 5) make them faster when read by drill? Spark by default will write 200 snappy.parquet part files per data frame, so the coalesce reduces the amount of part files, and I would assume that makes it faster for drill to query but I am not sure.
- When writing parquet from spark, does sorting the data by the main column that will be queried help Drill? I'm not sure if parquet has indices or metadata maintained by Spark that drill is capable of using.
- Does having spark partition the parquet data based on a column before saving it help drill query speeds?