AnsweredAssumed Answered

CLDB Down; unable to restore from replica

Question asked by thealy on Oct 8, 2015
Latest reply on Feb 21, 2016 by mufeed
MapR BuildVersion: 3.1.0.23703.GA / M3

Cluster down after apparent power loss affecting CLDB and 2 of 3 CLDB replica machines.

Today I re-ran configure.sh on all nodes, using the original CLDB after attempt to use remaining CLDB replica failed. Here are some assorted log entries. Any suggestions on how to proceed very much appreciated. I followed the procedure given to me by support in the past to switch to the remaining replica machine and bring it up as CLDB. But it never finishes initializing.


cldb.log:

Some snippets from cldb.log (with dates removed for brevity)

09:00:20,585 INFO ClientCnxn [main-SendThread(hd17.sec.bnl.local:5181)]: Opening socket connection to server hd17.sec.bnl.local/192.168.4.17:5181. Will attempt to SASL-authenticate using Login Context section 'Client_simple'

09:00:20,586 INFO ClientCnxn [main-SendThread(hd17.sec.bnl.local:5181)]: Socket connection established to hd17.sec.bnl.local/192.168.4.17:5181, initiating session

09:00:20,612 INFO ClientCnxn [main-SendThread(hd17.sec.bnl.local:5181)]: Session establishment complete on server hd17.sec.bnl.local/192.168.4.17:5181, sessionid = 0x501981e71107c9, negotiated timeout = 30000

09:00:20,613 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null

09:00:20,613 INFO CLDBServer [main-EventThread]: onZKConnect: The CLDB has successfully connected to the ZooKeeper server State:CONNECTED Timeout:30000 sessionid:0x501981e71107c9 local:/192.168.4.39:47549 remoteserver:hd17.sec.bnl.local/192.168.4.17:5181 lastZxid:0 xid:1 sent:2 recv:1 queuedpkts:0 pendingresp:0 queuedevents:0 in the ZooKeeper ensemble with connection string hd17.sec.bnl.local:5181,hd22.sec.bnl.local:5181,hd4.sec.bnl.local:5181

09:00:20,875 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null

09:00:20,885 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master

09:00:40,898 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore does not have epoch entry CLDB trying to wait until it is Ready
09:00:43,901 INFO ZooKeeperClient [ZK-Connect]: Waiting for local KvStoreContainer to become valid. KvStore ContainerInfo  Container ID:1 Master:192.168.4.46-232(186602747192494912) Servers:  192.168.4.46-232(186602747192494912) 192.168.4.20-232(660185602214392390) 192.168.4.57-232(4431655818004946071) 192.168.4.4-232(1300176099474568970) Inactive:  Unused:  Epoch:232 SizeMB:0 CLDB ServerID : 1097292582906516482

09:01:20,095 INFO CLDBServer [RPC-5]: Rejecting RPC 2345.40 from 192.168.4.4:53830 with status 30 as CLDB is not yet initialized.

09:02:20,109 INFO CLDBServer [RPC-8]: Rejecting RPC 2345.40 from 192.168.4.4:34566 with status 30 as CLDB is not yet initialized.

09:03:20,183 INFO CLDBServer [Lookup-3]: Rejecting RPC 2345.5 from 192.168.4.17:44717 with status 30 as CLDB is not yet initialized.

...continual repeats every ~1 min

tail hoststats.err

2015-10-07 10:52:53,5137 ERROR Client fs/client/fileclient/cc/client.cc:386 Thread: 140554700613408 Failed to initialize client for cluster cs.bnl.gov, error Read-only file system(30)

tail createTTVolume.21396.log

2015-10-07 10:54:52 DEBUG Command attempt 96 failed with return code 255 after 1 seconds, sleeping for 1 seconds
2015-10-07 10:54:53 DEBUG Launching "hadoop fs -stat /"

cldb.jsp

Container Location Database

CLDB mode : INITIALIZE

CLDB BuildVersion: 3.1.0.23703.GA
Master for CLDB volume ready: false


Outcomes