AnsweredAssumed Answered

Sharing common YARN job failure scenario

Question asked by davidtucker on May 6, 2015
Latest reply on Apr 20, 2016 by vgonzalez
After returning to a MapR cluster left running but idle for an extended period, it is common to see the following failure when launching a YARN job :

    15/05/06 09:26:03 INFO mapreduce.Job:  map 0% reduce 0%
    15/05/06 09:26:03 INFO mapreduce.Job: Job job_1429643074260_0008 failed with state FAILED due to: Application application_1429643074260_0008 failed 2 times due to AM Container for appattempt_1429643074260_0008_000002 exited with  exitCode: -1000 due to: Application application_1429643074260_0008 initialization failed (exitCode=20) with output: main : command provided 0
    main : user is mapr
    main : requested yarn user is mapr
    Failed to create directory /tmp/hadoop-mapr/nm-local-dir/usercache/mapr - No such file or directory

The clue is the "No such file or directory" warning.   The /tmp/hadoop-mapr directory will not exist, and yet the mapr user will have no trouble creating it.

This is often a side-effect of the tmpwatch utility, which is run daily on CentOS systems to clean up /tmp/files not recently accessed.   The NodeManager service will not recreate the top level hadoop.tmp.dir (which defaults to /tmp/hadoop-${} ) when it launches a job.

The quick solution is to restart the NodeManager service.   The long term solution is to disable tmpwatch or update /etc/cron.daily/tmpwatch so protect the temporary hadoop directories.   The correct syntax for tmpwatch should be something like :

    /usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
            -x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix \
            -X '/tmp/hadoop-*'  -X '/tmp/hsperfdata_*' 10d /tmp   

Regards !