AnsweredAssumed Answered

Jobs with large number of reducers - each map task generates a file per reducer

Question asked by akarjp on Jan 9, 2013
Latest reply on Jan 11, 2013 by gera
MapR version: 2.0.1.15869.GA / M3

When running a job with a large number of reducers, we are seeing that each map task generates a file per reducer. So if the job has 10k mappers and 6k reducers, each map task generates 6k files. As a result, we’re getting an alarm:

Number of Files on Volume mapr.machine001.domain.com.local.mapred has exceeded threshold 20000000
 
Eventually, there are so many files that “hadoop fs –ls†on any of those map output directories hangs. For e.g.,
hadoop fs –ls /var/mapr/local/machine001.domain.com/mapred/taskTracker/output/job_201301091546_0002/
 
Is this expected behavior? If MapR can deal with that many files, should we disable our alarms?
 
My understanding of how this works in Apache Hadoop is that each map task generates a single map output file (after merging its various spills). This file contains partitions within it for each reducer. Is there an option in MapR to switch to that behavior? Or should we deal with this problem in some other way?

Outcomes