AnsweredAssumed Answered

Roadmap for YARN / MRv2 support?

Question asked by heathbar on Apr 17, 2013
Latest reply on Nov 7, 2013 by mandoskippy
I would love to use MRv2 with MapR... Any estimates when this might be ready for testing?

One of the key features I need that is in MRv2 is proper resource based scheduling (currently just RAM would be fine). Without this feature, I have to continuously monitor for jams caused by too many high-memory mappers being run on a node which causes jobs to get killed. It's easy for me to predict how much RAM a mapper needs to run to completion, but not easy to ask hadoop to deliver these resource guarantees.

Hadoop 20.2 doesn't seem to support this, even with the CapacityScheduler (resource scheduling based on physical memory seems to have been removed, though it appears in the documentation still).

My current work-around is to wait for all jobs in other flows to complete, manually edit mapred-site.xml to lower the number of map slots, restart all the task trackers, run a stage of the pipeline, and then repeat. Because the whole cluster is set for the resource needs of a single stage of a single flow of jobs, I can only run one flow at a time which is a large waste of resources.

The lack of good resource based scheduling was one of the major design flaws MRv2 (YARN) has fixed.  There are many other features that would be "nice to have" (like the REST api).

Outcomes