AnsweredAssumed Answered

FATAL CLDB Exception - Again.

Question asked by Terry on Aug 9, 2017
Latest reply on Aug 9, 2017 by Terry

Using community version 5.2. on a Redhat/CentOS cluster of ~40 systems.

We lost a disk in the CLDB and the world ended. Having been through this drill several times each year we shutdown the CLDB, migrated to a node which had a copy of cid:1, set cldb.ignore.stale.zk=true, ran configure on all nodes - starting with Zookeepers - and brought things back up. Several waiting processes returned to work and all functions were back to normal. Bad disk was replaced in former CLDB; removed CLDB package, re-configured, restarted OK. Everything ran fine, including NFS, for ~6 hours. First indication of failure was NFS backup that runs after-hours.

 

Logs of log digging seems to center on this from cldb.log: (This is one of the attempted auto-restarts)

<code>

2017-08-09 04:01:50,344 FATAL CLDB [RPC-18]: CLDBShutdown: CldbError

...[Some INFO 'asked container role' noise removedhere  for brevity]

2017-08-09 04:01:50,345 FATAL CLDB [RPC-18]: CLDB Exception
java.lang.NullPointerException
at com.mapr.fs.cldb.util.Util.ipBelongsToServer(Util.java:204)
at com.mapr.fs.cldb.ErrorNotificationHandler.containerOnFileServerFail(ErrorNotificationHandler.java:200)
at com.mapr.fs.cldb.CLDBServer.containerOnFileServerFail(CLDBServer.java:8709)
at com.mapr.fs.cldb.CLDBServer.processContainerOpRpc(CLDBServer.java:4038)
at com.mapr.fs.cldb.CLDBServer.processRpc(CLDBServer.java:4223)
at com.mapr.fs.cldb.CLDBServer.requestArrived(CLDBServer.java:3101)
at com.mapr.fs.Rpc$RpcExecutor.run(Rpc.java:160)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

...[Some INFO 'asked container role' noise removedhere  for brevity]

2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: Thread: Thread-6 ID: 21
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: java.lang.Thread.sleep(Native Method)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: com.mapr.fs.cldb.WriteBackAtimeUpdater$IdleFlusher.run(WriteBackAtimeUpdater.java:71)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: Thread: main-SendThread(hd23.sec.bnl.local:5181) ID: 17
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:338)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1070)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: Thread: HB-2 ID: 48
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: sun.misc.Unsafe.park(Native Method)
2017-08-09 04:01:50,358 ERROR CLDB [RPC-18]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2017-08-09 04:01:50,359 ERROR CLDB [RPC-18]: java.lang.Thread.run(Thread.java:722)

</code>

 

I have re-run configure on the Zookeepers and CLDB; tried toggling the  cldb.ignore.stale.zk=true setting, restarting warden, etc. i.e. all my usual acts of desperation. I can only think to migrate to another node holding the cid:1 and try again, but I would like to understand the cause of the exception so I don't have to repeat this procedure so frequently.

Any suggestions?

Outcomes