AnsweredAssumed Answered

CLDB Failure - AGAIN.

Question asked by Terry on Oct 4, 2016
Latest reply on Nov 30, 2016 by mufeed

Running 5.2, ran nearly 2 weeks with intensive data loading. Disk failed on CLDB, causing the whole thing to go off-line. Restart after disk replacement stuck, as ALWAYs seem to be the case, in the endless loop of:

 

INFO CLDBServer [RPC-3]: Rejecting RPC 2345.17 from 192.168.4.20:5660 with status 3 as CLDB is waiting for local kvstore to become master.

....See below

INFO CLDBServer [WaitForLocalKvstore Thread]: Shutdown: Stopping CLDB

 

 

Punctuated by dozens of these sequences:

2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: DestroyJavaVM ID: 32
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: HB-1 ID: 38
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: sun.misc.Unsafe.park(Native Method)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:722)
2016-10-04 07:30:10,561 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: RPC-7 ID: 50

 

 

But of course the local disks will not come on-line because the CLDB is down. Catch-22. Restart warden - no help. Reboot - no help. Post here asking for help - no reply.

 

This is at least the 5th time this has happened to the cluster over the past 3+ years. If this goes like earlier attempts, I will try alternate CLDBs, which will fail. Then I'll reset it back to this CLDB, restart everything involved for 3-10 days, and then the same unit will come up magically with no trouble ever isolated.

 

Is this going to be the case forever? I'm losing hope. 

Outcomes