AnsweredAssumed Answered

Spark Data Frame Save As Parquet - Too Many Files?

Question asked by john.humphreys on May 31, 2017
Latest reply on Jun 1, 2017 by john.humphreys

I'm trying to generate a substantial test data set in parquet to see the query speeds I can get from Drill.


My parquet file seems to have a whole ton of very tiny sub-files though, and I believe I read that this is bad for drill performance.  For example, here's what I see when interrogating the generated parquet file.  There are around 15,631 sub-files.  each one is very tiny (~8k).  The total file size is just 56MB, so it seems insane to have nearly 16,000 sub files.



  • Will this many small file slow down drill?  It seems pretty slow right now.
  • What size should my parquet file-parts be and how can I make Spark write them that size?
  • I think I read that gz is bad and snappy is better.  Is snappy the best format for drill over parquet?
  • Are there any special things I can do to my parquet file to make it faster with drill?  I am reading through this best practices document; but any specific suggestions are still appreciated Apache Drill Best Practices from the MapR Drill Team 


Interrogation and Code For Reference


cd test-metrics-2.parquet

ls -lah


-rwxr-xr-x 1 glpmpbd pmpbatch 7.7K May 31 12:43 part-r-15622-b34a6764-0aef-46fa-afeb-c5780e021e22.gz.parquet
-rwxr-xr-x 1 glpmpbd pmpbatch 376 May 31 12:43 part-r-15623-b34a6764-0aef-46fa-afeb-c5780e021e22.gz.parquet
-rwxr-xr-x 1 glpmpbd pmpbatch 7.7K May 31 12:43 part-r-15624-b34a6764-0aef-46fa-afeb-c5780e021e22.gz.parquet
-rwxr-xr-x 1 glpmpbd pmpbatch 0 May 31 12:46 _SUCCESS
glpmpbd@psclxd00018(test-metrics-2.parquet)$ ls -lah | wc -l

glpmpbd@psclxd00018(test-metrics-2.parquet)$ du -h .


I've copied the code below; but I don't think it's needed.  Basically I'm:

  1. Generating a 40,000 element list (host names), a 325 element list (metric names), and a 10 element list (epoch times).
  2. Turning each into a data frame.
  3. Doing a Cartesian join between them to generate a 130 million record data set.
  4. Saving it as parquet using data_frame_instance.saveAsParquetFile()


import sqlContext.implicits._
import org.apache.spark.sql.functions.{concat, lit}
import java.util.Calendar;
import java.util.Date;
import java.util.TimeZone;

val blockSize = 1024 * 1024 * 256      // 16MB
sc.hadoopConfiguration.setInt( "dfs.blocksize", blockSize )
sc.hadoopConfiguration.setInt( "parquet.block.size", blockSize )

def epochFinderInSeconds(hour : Integer, minute : Integer, second : Integer) : Long = {
    val c : Calendar = Calendar.getInstance()
    c.setTime(new Date())
    c.set(c.get(Calendar.YEAR), c.get(Calendar.MONTH), c.get(Calendar.DAY_OF_MONTH), hour, minute, second)
    c.getTimeInMillis() / 1000

def getEpochTimeOfToday(hours : Integer, minutes : Integer, seconds : Integer) : Long = {
    epochFinderInSeconds(hours, minutes, seconds);

def randomWithRange(min : Integer, max : Integer) : Double = {
    val range : Integer = (max - min) + 1;
    (Math.random() * range) + min;

val hostsRDD = sc.parallelize(
val metricsRDD = sc.parallelize(

val hostNamesDF ="hostname"), col("_1")).as("host"))
val metricNamesDF =""), col("_1")).as("metric"))

val metricVsHostDF = hostNamesDF.join(metricNamesDF)

//Check - this prints 325 which is correct as there are 325 metrics per host.
//metricVsHostDF.where(col("host") === "hostname1").count

val startOfDay = epochFinderInSeconds(0,0,0);
val endOfDay = epochFinderInSeconds(0,10,0);

val times = scala.collection.mutable.ListBuffer.empty[Long]
for (theTime <- startOfDay until endOfDay by 60) {
     times += theTime

val timesDF = sc.parallelize(times).toDF().select(col("_1").as("timestamp"))

val timeVsHostVsMetricDF = metricVsHostDF.join(timesDF)