AnsweredAssumed Answered

Tasktracker failed to start after the tasktracker node is rebooted

Question asked by hui_hu on Jun 28, 2013
Latest reply on Jun 30, 2013 by hui_hu
Hi, I'm using MapR 2.1.3 to create a data compute separated cluster.  The problem is after run 'sudo reboot' on a tasktracker node (hostname doesn't change), the mapr-warden service is started and running, but tasktracker daemon failed to start.  If manually run 'sudo service mapr-warden restart' on that node after bootup, the tasktracker can start correctly.

$ rpm -qa | grep mapr
<pre>
mapr-fileserver-2.1.3.19871.GA-1
mapr-core-2.1.3.19871.GA-1
mapr-tasktracker-2.1.3.19871.GA-1
</pre>

$ ll /opt/mapr/roles/
<pre>
total 0
-rwxr-xr-x 1 root root 0 May  9 02:24 fileserver
-rwxr-xr-x 1 root root 0 May  9 02:24 tasktracker
</pre>

/opt/mapr/hadoop/hadoop-0.20.2/logs/hadoop-mapr-tasktracker-<hostname>.log :
<pre>
2013-06-28 23:59:44,361 INFO org.apache.hadoop.mapred.TaskTracker: Checking for local volume. If volume is not present command will create and mount it. Command invoked is : /opt/mapr//server/createTTVolume.sh [hostname] /var/mapr/local/[hostname]/mapred/ /var/mapr/local/[hostname]/mapred/taskTracker/
2013-06-28 23:59:48,585 ERROR org.apache.hadoop.mapred.TaskTracker: <b>Failed to create and mount local mapreduce volume at /var/mapr/local/[hostname]/mapred/ </b>. Please see logs at /opt/mapr//logs/createTTVolume.log
2013-06-28 23:59:48,585 ERROR org.apache.hadoop.mapred.TaskTracker: Command ran /opt/mapr//server/createTTVolume.sh [hostname] /var/mapr/local/[hostname]/mapred/ /var/mapr/local/[hostname]/mapred/taskTracker/
2013-06-28 23:59:48,585 ERROR org.apache.hadoop.mapred.TaskTracker: Command output
2013-06-28 23:59:48,586 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start TaskTracker because org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:322)
        at org.apache.hadoop.util.Shell.run(Shell.java:249)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442)
        at org.apache.hadoop.mapred.TaskTracker.createTTVolume(TaskTracker.java:1879)
        at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:961)
        at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2176)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5309)

2013-06-28 23:59:48,587 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down TaskTracker at [hostname]
************************************************************/
</pre>

/opt/mapr/logs/createTTVolume.501.log
<pre>
2013-06-28 23:59:46 INFO MapRFS is online. Checking whether MFS on this node is online
2013-06-28 23:59:46 DEBUG Will launch command "/opt/mapr//server/mrconfig -p 5660 info fsstate" with a command attempt timeout of 60 seconds a maximum of 3 attempts and a maximum cumulative timeout of 60 seconds
2013-06-28 23:59:46 DEBUG Launching "/opt/mapr//server/mrconfig -p 5660 info fsstate"
2013-06-28 23:59:46 DEBUG Command attempt 1 failed with return code 1 after 0 seconds
2013-06-28 23:59:46 DEBUG Launching "/opt/mapr//server/mrconfig -p 5660 info fsstate"
2013-06-28 23:59:47 DEBUG Command attempt 2 failed with return code 1 after 1 seconds
2013-06-28 23:59:47 DEBUG Launching "/opt/mapr//server/mrconfig -p 5660 info fsstate"
2013-06-28 23:59:48 DEBUG Command attempt 3 failed with return code 1 after 1 seconds
2013-06-28 23:59:48 <b>FATAL Command did not complete successfully after 3 attempts and after 2 seconds.</b>
2013-06-28 23:59:48 INFO The command run was:
/opt/mapr//server/mrconfig -p 5660 info fsstate

2013-06-28 23:59:48 INFO The output of the last failed command attempt:
All SPs are loaded.
</pre>

 /opt/mapr/logs/createsystemvolumes.log
<pre>
---- Fri Jun 28 23:59:24 UTC 2013 --- ALL OK
mkdir: cannot create directory /var/mapr/local/[hostname]: File exists
ERROR (22) -  FileServer [hostname]:5660 has not heartbeated with CLDB for 77
2013-06-28 23:59:28.745 [hostname] createsystemvolumes.sh(4788) Install CreateLocalVolumeDirectories:213 local volume 'logs' already exists
2013-06-28 23:59:28.748 [hostname] createsystemvolumes.sh(4788) Install CreateLocalVolumeDirectories:226 'logs' volume is created
2013-06-28 19:46:40
mkdir: cannot create directory /var/mapr/local/[hostname]/logs/hbase: File exists
mkdir: cannot create directory /var/mapr/local/[hostname]/logs/maprcli: File exists
mkdir: cannot create directory /var/mapr/local/[hostname]/logs/mapred: File exists
mkdir: cannot create directory /var/mapr/local/[hostname]/logs/mapred/jobtracker: File exists
mkdir: cannot create directory /var/mapr/local/[hostname]/logs/mapred/tasktracker: File exists
ERROR (10003) -  Volume create mapr.[hostname].local.metrics failed, Volume exists
2013-06-28 23:59:51.996 [hostname] createsystemvolumes.sh(4788) Install CreateLocalVolumeDirectories:266 local volume 'metrics' already exists
</pre>

/opt/mapr/logs/warden.log
<pre>
2013-06-28 23:59:43,397 INFO  com.mapr.warden.service.baseservice.Service [main-EventThread]: ZK Connect state:State:CONNECTED Timeout:30000 sessionid:0x3f8c5125fa014d local:/10.141.73.149:42327 remoteserver:hostname/10.141.73.96:5181 lastZxid:4294972119 xid:24 sent:26 recv:31 queuedpkts:0 pendingresp:0 queuedevents:0
2013-06-28 23:59:44,469 INFO  com.mapr.warden.service.baseservice.Service$ServiceRun [tasktracker_monitor]: starting tasktracker, logging to /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/hadoop-mapr-tasktracker-[hostname].out

2013-06-28 23:59:54,475 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: [e_SERV_RUN, hostName, ma_host, ma_process]
2013-06-28 23:59:54,475 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: []
2013-06-28 23:59:54,728 <b>ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [tasktracker_monitor]: Monitor command: [/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh, status, tasktracker]cannot determine if service: tasktracker is running. Number of retrials exceeded. Closing Zookeeper</b>
2013-06-28 23:59:54,728 INFO  com.mapr.warden.service.baseservice.Service [tasktracker_monitor]: 35 about to close zk for service: tasktracker
2013-06-28 23:59:54,734 INFO  com.mapr.warden.service.baseservice.Service [main-EventThread]: Process path: /services/tasktracker/[hostname]. Event state: SyncConnected. Event type: NodeDeleted
2013-06-28 23:59:54,734 INFO  com.mapr.warden.service.baseservice.Service [main-EventThread]: ZK Connect state:State:CLOSED sessionid:0x23f8c521c8e0140 local:/10.141.73.149:44423 remoteserver:hostname/10.141.73.111:5181 lastZxid:4294972121 xid:22 sent:25 recv:29 queuedpkts:0 pendingresp:0 queuedevents:1
2013-06-28 23:59:54,734 INFO  com.mapr.warden.service.baseservice.Service [main-EventThread]: ZK is closed for service: tasktrack
er
2013-06-28 23:59:54,740 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: [e_SERV_FAIL, hostName, ma_host, ma_process]
2013-06-28 23:59:54,740 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: []
2013-06-28 23:59:54,740 INFO  com.mapr.warden.service.baseservice.Service [tasktracker_monitor]: Alarm raising command: [/opt/mapr/bin/maprcli, alarm, raise, -alarm, NODE_ALARM_SERVICE_TT_DOWN, -entity, [hostname], -description, Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs]
</pre>

Outcomes