AnsweredAssumed Answered

tasktracker shuts down and goes back... randomly

Question asked by tomek_cejner on Mar 5, 2012
Latest reply on Mar 5, 2012 by nabeel
Hello everyone,
I have set up small cluster with 4 machines (they are virtual machines for my "private cloud").
I have installed everything without much trouble, with CLDB and ZooKeeper on first node,
and TaskTrackers and FileServers on all of them.

My cluster is still idle, but I am observing strange behaviors: TaskTracker services are randomly shutting down,
and then recovering.


    2012-03-06 07:25:51,874 INFO org.apache.hadoop.mapred.TaskTracker: Checking for local volume. If volume is not present command will create and mount it. Command invoked is : /opt/mapr//server/createTTVolume.sh ctovm1652.dev.internal.com /var/mapr/local/ctovm1652.dev.internal.com/mapred/ /var/mapr/local/ctovm1652.dev.internal.com/mapred/taskTracker/
    2012-03-06 07:37:39,261 ERROR org.apache.hadoop.mapred.TaskTracker: Failed to create and mount local mapreduce volume at /var/mapr/local/ctovm1652.dev.internal.com/mapred/. Please see logs at /opt/mapr//logs/createTTVolume.log
    2012-03-06 07:37:39,261 ERROR org.apache.hadoop.mapred.TaskTracker: Command ran /opt/mapr//server/createTTVolume.sh ctovm1652.dev.internal.com /var/mapr/local/ctovm1652.dev.internal.com/mapred/ /var/mapr/local/ctovm1652.dev.internal.com/mapred/taskTracker/
    2012-03-06 07:37:39,261 ERROR org.apache.hadoop.mapred.TaskTracker: Command output
    2012-03-06 07:37:39,263 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start TaskTracker because org.apache.hadoop.util.Shell$ExitCodeException: Failed to create local mapred volume.
    Command used: /opt/mapr/bin/maprcli volume create -name "mapr.ctovm1652.dev.internal.com.local.mapred" -localvolumehost ctovm1652.dev.internal.com -localvolumeport 5660 -shufflevolume true -rereplicationtimeoutsec 300 -replication 1
    
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:322)
            at org.apache.hadoop.util.Shell.run(Shell.java:249)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442)
            at org.apache.hadoop.mapred.TaskTracker.createTTVolume(TaskTracker.java:1889)
            at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:976)
            at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2155)
            at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5216)
    
    2012-03-06 07:37:39,263 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down TaskTracker at ctovm1652.dev.internal.com/127.0.0.1
    ************************************************************/

and the createTTVolume.log say:

    ---- Tue Mar  6 07:25:53 CST 2012 --- mfs is up and has sent the first full container report.
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred
    ERROR (22) -  Unable to map host: ctovm1652.dev.internal.com to non-local ipaddress while creating volume mapr.ctovm1652.dev.internal.com.local.mapred

Any clues what may be causing this behavior?

ctovm1652.dev.internal.com is the node I am currently looking at. Hostname comes from our infrastructure
it is always resolvable by DNS server:

    $ host ctovm1652.dev.internal.com
    ctovm1652.dev.internal.com has address 10.14.52.152

I did a simple check following hints from other posts:

    $ hadoop fs -ls /
    Found 2 items
    drwxrwxrwx   - root root          0 2012-03-05 08:12 /hbase
    drwxrwxrwx   - root root          1 2012-03-05 08:15 /var



maprcli node list -json

returns big JSON output with entries for all nodes in my cluster: 4 in total, each entry has valid "ip" and "hostname" attributes (among others).

Only warning sign I see is visible on web admin console:


2:50:14 PM - Failed to SSH to host ctovm1652.dev.internal.com.
Either the host is unreachable, or root access to the node via passwordless SSH is not set up.
Please ensure that the node is reachable and passwordless SSH is configured, and try again.
Alternatively, you may manually login to the node as root and run the following commands to list, add and remove disks.
sudo /opt/mapr/bin/maprcli disk list -host 127.0.0.1
sudo /opt/mapr/bin/maprcli disk add -host 127.0.0.1 -disk diskname
sudo /opt/mapr/bin/maprcli disk remove -host 127.0.0.1 -disk diskname

I have possibility of passwordless SSH setup, but on normal user account, not root. Some answerers suggested that this message can be ignored (?)


There is chance that something in my environment is causing this (machines are virtual ones), I can consult with engineering team, but I need a clue what to ask for.



Outcomes