AnsweredAssumed Answered

Job failed for pi example program

Question asked by asadali on Jul 5, 2017
Latest reply on Jul 6, 2017 by satz

Let me first apologize for making multiple threads.

 

Background: A previous cluster with 10 nodes running 16.04 was having issues, so I decided to create a new cluster with Ubuntu 14.04 and the newest version of mapr. This cluster has 3 nodes.

 

Here is a snapshot of the control system:

 

One thing that jumps out is the cluster utilization. There is also a clock skew error which I think is not the source of the problem. 

 

Here is what my yarn-site.xml looks like, which I copied from the old cluster

 

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>

<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>

<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>4</value>
</property>

 

Finally, here is a small snapshot of the output from running

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar pi 2 50

 

17/07/05 17:15:59 INFO impl.YarnClientImpl: Submitted application application_1499300101841_0001
17/07/05 17:15:59 INFO mapreduce.Job: The url to track the job: http://hadoop2:8088/proxy/application_1499300101841_0001/
17/07/05 17:15:59 INFO mapreduce.Job: Running job: job_1499300101841_0001
17/07/05 17:16:05 INFO mapreduce.Job: Job job_1499300101841_0001 running in uber mode : false
17/07/05 17:16:05 INFO mapreduce.Job: map 0% reduce 0%
17/07/05 17:16:19 INFO mapreduce.Job: Task Id : attempt_1499300101841_0001_m_000001_0, Status : FAILED
17/07/05 17:16:20 INFO mapreduce.Job: Task Id : attempt_1499300101841_0001_m_000000_0, Status : FAILED
17/07/05 17:16:33 INFO mapreduce.Job: Task Id : attempt_1499300101841_0001_m_000001_1, Status : FAILED
17/07/05 17:16:34 INFO mapreduce.Job: Task Id : attempt_1499300101841_0001_m_000000_1, Status : FAILED
17/07/05 17:16:39 INFO mapreduce.Job: map 50% reduce 0%
17/07/05 17:16:47 INFO mapreduce.Job: Task Id : attempt_1499300101841_0001_m_000001_2, Status : FAILED
17/07/05 17:17:03 INFO mapreduce.Job: map 100% reduce 100%
17/07/05 17:17:03 INFO mapreduce.Job: Job job_1499300101841_0001 failed with state FAILED due to: Task failed task_1499300101841_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

 

When I used to get errors, there was some complaint posted on the output too, such as memory issues and what not. This time, I just get task failed, so I'm not sure where I went wrong. Some help woudl be appreciated!

 

Here are some things I have tried to diagnose the problem.

I checked the syslog in

/opt/mapr/hadoop/hadoop-2.7.0/logs/userlogs/application_1499300101841_0001/container_e20_1499300101841_0001_01_000001

 

and found things like

 

2017-07-05 17:16:38,204 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:24576, vCores:21, disks:2.99>
2017-07-05 17:16:38,204 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 2
2017-07-05 17:16:39,207 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e20_1499300101841_0001_01_000010
2017-07-05 17:16:39,208 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:26624, vCores:22, disks:3.49>
2017-07-05 17:16:39,208 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 2
2017-07-05 17:16:39,208 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:10 ContRel:4 HostLocal:2 RackLocal:0
2017-07-05 17:16:39,208 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1499300101841_0001_m_000000_2: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

 

And also a whole bunch of 

2017-07-05 17:16:19,084 ERROR [CommitterEvent Processor #1] com.mapr.fs.MapRFileSystem: Failed to delete path maprfs:/user/hadoop/QuasiMonteCarlo_1499300157826_59858442/out/_temporary/1/_temporary/attempt_1499300101841_0001_m_000001_0, error: No such file or directory (2)
2017-07-05 17:16:19,085 WARN [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete maprfs:/user/hadoop/QuasiMonteCarlo_1499300157826_59858442/out/_temporary/1/_temporary/attempt_1499300101841_0001_m_000001_0
2017-07-05 17:16:19,087 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1499300101841_0001_m_000001_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED

 

 

Still not sure what is going on. I think I've messed up my yarn-site.xml but if it was a memory problem it should be mentioned somewhere.

 

UPDATE: Using classic mapreduce works and runs without errors. It's only yarn that isn't happy at all.

Outcomes