AnsweredAssumed Answered

Drill - Block size vs # file parts

Question asked by john.humphreys on Jun 6, 2017
Latest reply on Jul 5, 2017 by john.humphreys

Hey,

 

I understand that to optimize Drill queries, you should have a parquet file block-size the same as the block size in the file system.

 

I'm not sure how many parts the parquet files should be broken in to optimally though.  If I save a file in Spark and coalesce to 1,440 (one sub-file per minute of a day), my performance is far worse than if I coalesce to, say, 40 (which ends up being ~1GB per sub file).

 

Is there some general target number I should aim for when coalescing on my cluster for parquet files that will be used in Drill? (like 1/node, etc).

 

Thanks,

 

-John

Outcomes