AnsweredAssumed Answered

Small files problem in MapR

Question asked by jid1 on Apr 14, 2015
Latest reply on Apr 14, 2015 by jid1

We have a folder with a lot of small files (~400K) that we are processing with Spark. As we are aware of the small files problem, we have a job that merges these files together into much larger blocks.

We have noticed that the performace of console 'ls' is much better than that of 'listFiles(..)', so we benchmarked 'listFiles(...)' and we noticed that for the first 500-600 files, it takes roughly a second / 100 paths, and then it goes upto to roughly 2 seconds / 100 paths.

Is there a way to speed this process up (either by code or configuration) ? I am not talking about orders of magnitude, but 2x,3x would be nice.

The metrics are the following:

>hdfs dfs -ls ==> 3250 fps

>NFS ls         ==> 1577 fps

>fs.listFiles() ==>   555 fps