AnsweredAssumed Answered

Failed to Create Volume, Duplicate FileServer ServerIDs and NFS Server

Question asked by mshirley on Mar 13, 2013
Latest reply on Apr 19, 2013 by mshirley
We just recently had an issue with a node that was re-imaged during testing.  All the configuration was successful up to the point where it tried to start tasktracker.  Tasktracker failed to start because the /var/local/mapr/(node name)/mapred volume didn't exist. 

Looking at /opt/mapr/logs/createTTVolume.log we saw the following

    ---- Wed Mar 13 20:12:14 GMT 2013 --- /var/mapr/local/nodename is up
    ---- Wed Mar 13 20:12:14 GMT 2013 --- mfs is up and has sent the first full container report.
    mapr.nodename.local.mapred: DIR missing, VOL missing
    mapr.nodename.local.mapred: creating VOL
    maprcli volume create -name mapr.nodename.local.mapred -path /var/mapr/local/nodename/mapred -replication 1 -localvolumehost nodename -localvolumeport 5660 -shufflevolume true
    ERROR (10003) -  Volume create mapr.nodename.com.local.mapred failed, Input/output error
    mapr.nodename.local.mapred: Failed to create volume

Checking our CLDB log /opt/mapr/logs/cldb.log we noticed this while it was trying to create the volumes.
    
    2013-03-13 18:07:43,519 ERROR CLDBServer [RPC-thread-1]: VolumeCreate: VolName: mapr.nodename.local.mapredCould not create root container. Aborting VolumeCreate
    2013-03-13 18:07:52,687 ERROR CLDBServer [RPC-thread-12]: VolumeLookup: VolName: mapr.nodename.local.mapred Volume not found
    2013-03-13 18:07:54,656 INFO CLDBServer [RPC-thread-9]: containerCreateRetry : volume mapr.nodename.local.mapredis local, and the local node is not hearbeating.

There were no disk space issues, cldb was up, warden was running.

We looked at the CLDB page on the webui and noticed multiple rows with different ServerIDs but the same hostname.  The status for one of the nodes was "INACTIVE" with a "Last Heartbeat" of 1363207248.  There was also 2 lines under "Active NFS Servers", one of which was the newly rebuilt node that used to be the NFS gateway but was switched over.  It's status was INACTIVE.

We attempted to remove the duplicate inactive node using the maprcli and bad things happened.

First we tried removing it using the hex hostid

    [root@cldb mapr]# maprcli node remove -hostids 537f79b6d29aed87
    ERROR (10003) -  Error while trying to parse hostid 537f79b6d29aed87

Then we attempted to use the numeric serverid

    [root@cldb mapr]# maprcli node remove -hostids 6016661453314649479
    ERROR (10009) -  Couldn't connect to the CLDB service

OOPS...  Looks like something crashed out CLDB during this process.

    2013-03-13 20:22:59,288 ERROR CLDBServer [RPC-thread-8]: RPC: PROGRAMID: 2345 PROCEDUREID: 41 REQFROM: 1.1.1.1:57773 Exception during processing RPC null
    java.lang.NullPointerException
            at com.mapr.fs.cldb.topology.Topology.removeFileServer(Topology.java:2341)
            at com.mapr.fs.cldb.CLDBServer.fileServerRemove(CLDBServer.java:5533)
            at com.mapr.fs.cldb.CLDBServer.processRpc(CLDBServer.java:3251)
            at com.mapr.fs.cldb.CLDBServer.requestArrived(CLDBServer.java:2187)
            at com.mapr.fs.Rpc$RpcExecutor.run(Rpc.java:151)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
            at java.lang.Thread.run(Thread.java:722)

    2013-03-13 20:22:59,302 FATAL CLDB [RPC-thread-8]: CLDBShutdown: CldbError

After CLDB restarted the node that was INACTIVE with a year long "last heartbeat" was no longer in the FileServers list as well as the inactive NFS server line. 

We restarted warden on the newly built server and the volumes created properly.

    ---- Wed Mar 13 20:30:42 GMT 2013 --- /var/mapr/local/nodename is up
    ---- Wed Mar 13 20:31:12 GMT 2013 --- mfs is up and has sent the first full container report.
    mapr.nodename.local.mapred: DIR missing, VOL missing
    mapr.nodename.local.mapred: creating VOL
    maprcli volume create -name mapr.nodename.local.mapred -path /var/mapr/local/nodename/mapred -replication 1 -localvolumehost nodename -localvolumeport 5660 -shufflevolume true
    mapr.nodename.local.mapred: volume created and mounted

Questions:

 1. What caused CLDB to keep records for 2 separate serverids with the same hostname?  Was this something we did incorrectly when doing a rebuild?
 2. If we see these multiple serverids will this prevent all new nodes from generating /var/mapr/local volumes or will it only impact nodes that show up twice in the CLDB list?  We didn't test it.
 3. If we see this again with other nodes what is the proper procedure for removing this erroneous information without causing CLDB to crash out as well as not removing newly built nodes that share the same hostname?

Outcomes