Spark on Yarn job fails when launching container

Document created by Hao Zhu Employee on Feb 17, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

Original Publication Date: November 26, 2014

 

Symptom:

When running Spark jobs on Yarn:

bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster lib/spark-examples*.jar 10

the job fails with below error message from resource manager log:

INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1417043690539_0013 failed 2 times due to AM Container for appattempt_1417043690539_0013_000002 exited with exitCode: 1 due to: Exception from container-launch:

org.apache.hadoop.util.Shell$ExitCodeException:

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)

  at org.apache.hadoop.util.Shell.run(Shell.java:418)

  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)

  at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:295)

  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:314)

  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)

  at java.util.concurrent.FutureTask.run(FutureTask.java:262)

  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

  at java.lang.Thread.run(Thread.java:744)

 

main : command provided 1

main : user is root

main : requested yarn user is root

 

Container exited with a non-zero exit code 1

.Failing this attempt.. Failing the application.

Root Cause:

To find the root cause, we can follow below troubleshooting path, especially when Yarn log aggregation is not enabled.(By default, yarn.log-aggregation-enable=false)

1. Check which node manager has the failure from resource manager log.

By default, both resource manager log and node manager log are located at /opt/mapr/hadoop/hadoop-2.4.1/logs .In this case, this attempt is on node "yarn-fcs-2":

INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1417043690539_0013_000002 released container
container_1417043690539_0013_02_000001 on node: host: yarn-fcs-2:42846

2. Check which container has the failure and what is the error message in node manager log.

In this case, the container is container_1417043690539_0013_02_000001 and the error message in node manager log is :

 

WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor:

Exception from container-launch with container ID: container_1417043690539_0013_02_000001 and exit code: 1

org.apache.hadoop.util.Shell$ExitCodeException:

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)

  at org.apache.hadoop.util.Shell.run(Shell.java:418)

  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)

  at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:295)

  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:314)

  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)

  at java.util.concurrent.FutureTask.run(FutureTask.java:262)

  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

  at java.lang.Thread.run(Thread.java:744)

From above error message, we know this issue happened when launching container on this node.However above error message does not tell the reason.

3. Check container stdout and stderr for the reason of the failure.

The container log is determined by parameter "yarn.nodemanager.log-dirs" in yarn-site.xml.By default, it is set to ${yarn.log.dir}/userlogs, which means $HADOOP_YARN_HOME/logs/userlogs.In this case, the container log is located here:

/opt/mapr/hadoop/hadoop-2.4.1/logs/userlogs/application_1417043690539_0013/container_1417043690539_0013_02_000001

The reason of the failure is :

# cat stderr Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

4. Find out which jar file has this missing class.

By searching, the missing jar file is located here:

/opt/mapr/spark/spark-1.1.0-bin-2.4.1-mapr-1408/lib/spark-assembly-1.1.0-hadoop2.4.1-mapr-1408.jar

This jar file is very important for running Spark on Yarn.

Solution:

1. Put spark-assembly-<version>.jar in property "yarn.application.classpath" from yarn-site.xml.

<property>

    <name>yarn.application.classpath</name>

    <value>/opt/mapr/spark/spark-1.1.0-bin-2.4.1-mapr-1408/lib/spark-assembly-1.1.0-hadoop2.4.1-mapr-1408.jar,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/*,/contrib/capacity-scheduler/*.jar,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib/*

    </value>

</property>

2. Restart resource manager and node managers.After that, the query works fine.

Attachments

    Outcomes