How to troubleshoot and resolve issues with starting TaskTracker

Document created by jbubier Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Jonathan Bubier

 

Original Publication Date: November 12, 2014

 

When MapR's warden service attempts to start the TaskTracker it may fail to initialize properly.  This can be due to an issue with warden when it is initializing the service or with the service itself when it is started that prevents it from coming up successfully.  This is typically indicated by alarms as well as failed service status in the MapR Control System (MCS) and an inability to run map-reduce tasks on the node. 

There are a number of reasons this can occur and below are some of the most common issues. 

 

 

1. Issues with JAVA_HOME

 

Before the warden service can start the TaskTracker the hadoop script verifies that the environment variable JAVA_HOME is properly set to determine the location of Java.  If JAVA_HOME is not set and the 'java' binary cannot be found from known locations the TaskTracker cannot be started and warden will print a corresponding error in /opt/mapr/logs/warden.log.  Verify the current setting of JAVA_HOME in /opt/mapr/conf/env.sh and set it manually if a custom install location was done for the JDK. 

 

 

2. Incorrect permissions on task-controller binary

 

If warden is able to start the TaskTracker but the TaskTracker repeatedly fails to start properly check the TaskTracker log under /opt/mapr/hadoop/hadoop-0.20.2/logs/.  This is typically named hadoop-<user>-tasktracker-<hostname>.log where <user> is the MapR user name and <hostname> is the node's fully qualified domain name.  During the startup sequence an error similar to the one below may be seen.

 

2014-11-12 15:46:06,261 ERROR mapred.TaskTracker [main]: Can not start TaskTracker because java.io.IOException: Cannot run program "/opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/Linux-amd64-64/bin/task-controller": java.io.IOException: error=13, Permission denied

  at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:267)

  at org.apache.hadoop.util.Shell.run(Shell.java:249)

  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442)

  at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:142)

  at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2268)

  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5444)

Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied

  at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)

  at java.lang.ProcessImpl.start(ProcessImpl.java:65)

  at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)

  ... 6 more

Verify the permissions on the task-controller binary under /opt/mapr/hadoop/hadoop-0.20.2/bin/Linux-amd64-64/bin/.  It should have all permissions for the root user, read and execute for the 'mapr' group and read for all other users, with the setuid bit set.  Ex:

 

# ls -l /opt/mapr/hadoop/hadoop-0.20.2/bin/Linux-amd64-64/bin/task-controller 
-rwsr-xr-- 1 root mapr 101311 Jun 12 09:45 /opt/mapr/hadoop/hadoop-0.20.2/bin/Linux-amd64-64/bin/task-controller

 

If these permissions do not match they can be reset by running '/opt/mapr/server/configure.sh -R' or by resetting them manually using the chmod command.

 

3. Incorrect configuration of mapreduce.taskcontroller.group

 

If the permissions on the task-controller binary are as expected and the TaskTracker fails to startup properly an error similar to the one below may be seen.

 

2014-11-12 15:48:01,995 WARN mapred.LinuxTaskController [main]: Exit code from checking binary permissions is : 22

2014-11-12 15:48:01,995 INFO mapred.TaskController [main]: Reading task controller config from /opt/mapr/hadoop/hadoop-0.20.2/conf/taskcontroller.cfg

2014-11-12 15:48:01,995 INFO mapred.TaskController [main]: number of groups = 1

2014-11-12 15:48:01,995 INFO mapred.TaskController [main]: The configured tasktracker group 0 is different from the group of the executable 2147483632

2014-11-12 15:48:01,995 INFO mapred.TaskController [main]: Invalid permissions on task-controller binary.

2014-11-12 15:48:01,996 ERROR mapred.TaskTracker [main]: Can not start TaskTracker because java.io.IOException: Task controller setup failed because of invalidpermissions/ownership with exit code 22

  at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:148)

  at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2268)

  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5444)

Caused by: org.apache.hadoop.util.Shell$ExitCodeException:

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:322)

  at org.apache.hadoop.util.Shell.run(Shell.java:249)

  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442)

  at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:142)

  ... 2 more

This indicates that the group specified by 'mapreduce.tasktracker.group' does not match the current group assigned to the task-controller binary.  This configuration variable is set in /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml using the following property:

 

<property> 
     <name>mapreduce.tasktracker.group</name>
     <value>mapr</value> <description>Expert: Group to which TaskTracker belongs. If
     LinuxTaskController is configured via mapreduce.tasktracker.taskcontroller,
     the group owner of the task-controller binary '$HADOOP_HOME/bin/platform/bin/task-controller'
     should be same as this group. </description>
</property>

This setting is then propagated to /opt/mapr/hadoop/hadoop-0.20.2/conf/taskcontroller.cfg and the TaskTracker verifies the value of mapreduce.tasktracker.group matches the group assigned to the binary.  If it does not match the TaskTracker fails to start.  Verify the desired value is set in mapred-site.xml, typically the 'mapr' group and restart the TaskTracker to resolve the issue.

 

4. Issues with creating local mapred volume

 

When the TaskTracker starts it executes the /opt/mapr/server/createTTVolume.sh script to create a local volume dedicated for map-reduce.  This map-reduce volume typically named mapr.<hostname>.local.mapred is used for holding intermediate output from map tasks and spill files.  If there is a problem executing the script and it does not complete successfully the TaskTracker will fail to start and an error similar to the below will be seen in the TaskTracker log.

 

2014-11-13 09:48:00,978 INFO mapred.TaskTracker [main]: Checking for local volume. If volume is not present command will create and mount it. Command invoked is : /opt/mapr/server/createTTVolume.sh host1.domain.prv /var/mapr/local/host1.domain.prv/mapred /var/mapr/local/host1.domain.prv/mapred/taskTracker

^@^@^@^@2014-11-13 09:52:06,107 ERROR mapred.TaskTracker [main]: Failed to create and mount local mapreduce volume at /var/mapr/local/host1.domain.prv/mapred. Please see logs at /opt/mapr/logs/createTTVolume.log

2014-11-13 09:52:06,107 ERROR mapred.TaskTracker [main]: Command ran /opt/mapr/server/createTTVolume.sh host1.domain.prv /var/mapr/local/host1.domain.prv/mapred /var/mapr/local/host1.domain.prv/mapred/taskTracker

2014-11-13 09:52:06,107 ERROR mapred.TaskTracker [main]: Command output

2014-11-13 09:52:06,109 ERROR mapred.TaskTracker [main]: Can not start TaskTracker because org.apache.hadoop.util.Shell$ExitCodeException:

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)

  at org.apache.hadoop.util.Shell.run(Shell.java:418)

  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)

  at org.apache.hadoop.mapred.TaskTracker.createTTVolume(TaskTracker.java:1921)

  at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:978)

  at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2226)

  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5361)

 

2014-11-13 09:52:06,113 INFO mapred.TaskTracker [Thread-2]: SHUTDOWN_MSG:

When the TaskTracker fails to start due to an issue with the createTTVolume.sh script as seen above the next step is to review the script output.  The log file for the script is /opt/mapr/logs/createTTVolume.<uid>.log where <uid> is the user ID of the 'mapr' user.  Review the log file for messages at the time the script is called based on the TaskTracker log.  Below is sample output from the above TaskTracker startup failure.

 

2014-11-13 09:48:00 INFO This script was called with the arguments: host1.domain.prv /var/mapr/local/host1.domain.prv/mapred /var/mapr/local/host1.domain.prv/mapred/taskTracker

2014-11-13 09:48:01 INFO Checking if MapRFS is online

2014-11-13 09:48:01 DEBUG Will launch command "hadoop fs -stat /" with a command attempt timeout of 60 seconds a maximum of 1000 attempts and a sleep time of 1 seconds between failed command attempts

2014-11-13 09:48:01 DEBUG Launching "hadoop fs -stat /"

2014-11-13 09:48:03 DEBUG Command attempt 1 completed successfully in 2 seconds

2014-11-13 09:48:03 DEBUG Command completed successfully after 1 attempts and after 2 seconds

2014-11-13 09:48:03 DEBUG Will launch command "hadoop fs -stat /var/mapr/local/host1.domain.prv" with a command attempt timeout of 60 seconds a maximum of 1000 attempts and a sleep time of 1 seconds between failed command attempts

2014-11-13 09:48:03 DEBUG Launching "hadoop fs -stat /var/mapr/local/host1.domain.prv"

2014-11-13 09:48:05 DEBUG Command attempt 1 completed successfully in 2 seconds

2014-11-13 09:48:05 DEBUG Command completed successfully after 1 attempts and after 2 seconds

2014-11-13 09:48:05 INFO MapRFS is online. Checking whether MFS on this node is online

2014-11-13 09:48:05 DEBUG Will launch command "/opt/mapr/server/mrconfig -p 5660 info fsstate" with a command attempt timeout of 60 seconds a maximum of 60 attempts and a sleep time of 3 seconds between failed command attempts

2014-11-13 09:48:05 DEBUG Launching "/opt/mapr/server/mrconfig -p 5660 info fsstate"

2014-11-13 09:48:06 DEBUG Command attempt 1 failed with return code 1 after 1 seconds, sleeping for 3 seconds

2014-11-13 09:48:09 DEBUG Launching "/opt/mapr/server/mrconfig -p 5660 info fsstate"

2014-11-13 09:48:10 DEBUG Command attempt 2 failed with return code 1 after 1 seconds, sleeping for 3 seconds

2014-11-13 09:48:13 DEBUG Launching "/opt/mapr/server/mrconfig -p 5660 info fsstate"

2014-11-13 09:48:14 DEBUG Command attempt 3 failed with return code 1 after 1 seconds, sleeping for 3 seconds

...

2014-11-13 09:52:02 DEBUG Launching "/opt/mapr/server/mrconfig -p 5660 info fsstate"

2014-11-13 09:52:03 DEBUG Command attempt 60 failed with return code 1 after 1 seconds, sleeping for 3 seconds

2014-11-13 09:52:06 FATAL Command did not complete successfully after 60 attempts and after 241 seconds.

2014-11-13 09:52:06 INFO The command run was:

/opt/mapr/server/mrconfig -p 5660 info fsstate

 

2014-11-13 09:52:06 INFO The output of the last failed command attempt:

2014-11-13 09:52:02,0941 ERROR Global mrconfig.cc:1237 PrintServerState: rpc failed 104.

2014-11-13 09:52:02,0943 ERROR Global mrconfig.cc:3927 GetServerState failed Connection reset by peer.104.

The output indicates that the createTTVolume.sh script launched '/opt/mapr/server/mrconfig -p 5660 info fsstate' to get the current state of MFS on the node and the command failed for 60 attempts.  This could indicate a problem with the running MFS process on the node.  After 60 failed attempts the script exits and the TaskTracker restarts.  There are numerous other reasons that the script will fail so it is important to review the createTTVolume.log to get more detail about the failure and determine the steps to resolve it. 

Attachments

    Outcomes