AnsweredAssumed Answered

Drill speed vs parquet file size?

Question asked by john.humphreys on Jun 22, 2017
Latest reply on Jul 1, 2017 by smopsy

Hey everyone,

 

I'm running two queries against parquet files in Drill.  

 

The first file literally only has data for 1 host (1,440 rows).  The second file has data for 40,000 hosts (57,600,000 rows).  For some reason both take the same time to return results here (~2 seconds).

 

The files:

  • Have 325 metric columns, 1 time-stamp column, and 1 host-name column.
  • Have 1440 points per host.
select host, metric_name_1
from dfs.`/nmr/eis/sysm/pmp/work/dev/maprdb/one_day_one_server.parquet`
where host='hostname1';

select host, metric_name_1
from dfs.`/nmr/eis/sysm/pmp/work/dev/maprdb/test-metrics-24-hour.parquet`
where host='hostname31000';

 

I have repeatedly seen this in MapR forums:

Whether you are creating the parquet files using Drill or through Hive/Spark etc., it is recommended to set the parquet block size to match the MFS chunk size for optimal performance.  The default MFS chunk size is 256 MB. To determine the MFS chunk size for file /a/b/f, run the following command:

but given one file is 10's of MB and one is many GB, I don't see how it can explain it.

 

Though just to confirm; this does mean that the minimum data read would be 256MB in the case of the quote, right?  Even for a small (say 1MB) file?

 

Thank you!

-John Humphreys

Outcomes