AnsweredAssumed Answered

Dropped all disks on CLDB node - cluster not happy

Question asked by akarjp on Nov 13, 2012
Latest reply on Nov 19, 2012 by mandoskippy
Hello all,

We were in the process of reconfiguring a development cluster (M3 2.0.1) to do some performance comparison testing. Due to limited resources we decided to drop 2 of 3 disks that each node had installed. So on each node we went through the drop all disks, re-add one disk to MapRFS dance. We had no problems until we dropped the disks on the one node that was running CLDB. This seemed to make the cluster very unhappy and the club was unable to be contacted. Steps I took to try and recover are roughly:

 1)Install CLDB on another node
 2)Shutdown warden  on all nodes and zookeeper on 3 nodes
 3)ran configure.sh on all nodes to tell them where new CLDB should be
 4)start warden on all nodes and zookeeper on 3 nodes
 5)wait many hours

What I see in the cldb.log on the new node:

    2012-11-14 08:07:20,655 INFO  com.mapr.fs.cldb.CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.2.xxx.125:1111 Rejecting rpc with status 30 as CLDB is not yet initialized.
    2012-11-14 08:07:21,091 INFO  com.mapr.fs.cldb.CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.2.xxx.130:1111 Rejecting rpc with status 30 as CLDB is not yet initialized.
    2012-11-14 08:07:21,092 INFO  com.mapr.fs.cldb.CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.2.xxx.130:1111 Rejecting rpc with status 30 as CLDB is not yet initialized.

What I see in the cldb.log on the original CLDB node:

    2012-11-14 08:07:26,856 INFO  com.mapr.fs.cldb.CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 4 from 10.2.110.129:57427 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
    2012-11-14 08:07:36,510 INFO  com.mapr.fs.cldb.CLDBServer [RPC-thread-4]: RPC: PROGRAMID: 2345 PROCEDUREID: 61 from 10.2.110.129:57465 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
    2012-11-14 08:08:44,005 FATAL com.mapr.fs.cldb.CLDB shutdown [WaitForLocalKvstore Thread]: CLDBShutdown: CLDB had master lock and was waiting for its local mfs to become Master.Waited for 7 (minutes) but mfs did not become Master. Shutting down CLDB to release master lock.
    2012-11-14 08:08:44,007 INFO  com.mapr.fs.cldb.CLDBServer [WaitForLocalKvstore Thread]: Shutdown: Stopping CLDB
    2012-11-14 08:08:44,008 INFO  com.mapr.fs.cldb.CLDB [Thread-11]: CLDB ShutDown Hook called
    2012-11-14 08:08:44,008 INFO  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [Thread-11]: Setting the clean cldbshutdown flag to true
    2012-11-14 08:08:44,023 INFO  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [Thread-11]: Zookeeper Client: Closing client connection:
    2012-11-14 08:08:44,032 INFO  com.mapr.fs.cldb.CLDBServer [main-EventThread]: ZooKeeper event NodeDeleted on path /datacenter/controlnodes/cldb/active/CLDBMaster
    2012-11-14 08:08:44,032 INFO  com.mapr.fs.cldb.CLDBServer [main-EventThread]: ZooKeeper event of type: NodeDeleted on path /datacenter/controlnodes/cldb/active/CLDBMaster
    2012-11-14 08:08:44,032 INFO  com.mapr.fs.cldb.CLDB [Thread-11]: CLDB shutdown

Any pointers to get CLDB working again? Note: this cluster was little used in the weeks leading up to this problem. We have some important data in the file system that we would like to preserve but nothing that was modified recently. I can attach additional logs as needed.

Thanks...

Outcomes