AnsweredAssumed Answered

cldb problem after network power failure

Question asked by matt on Jul 2, 2012
Latest reply on Jul 3, 2012 by srivas
We have a version 1.2.3 cluster that is having some CLDB problems that we believe are due to its network switch losing power during the storms on the US east coast on Friday. This morning (the Monday after the storm) we noted that MapRfs seemed to be completely offline and our mapr-nfsserver logs were indicating that the cldb connections had been reset.

After a lot of investigation, we surmised that it might be possible that simply restarting our CLDB node would solve the problem. CLDB had gone down due to reasons we weren't sure of and mapr-warden stop/start wasn't able to bring it back up. After a node restart, we are still seeing that CLDB is unable to start up due to: "CLDB running with stale ZooKeeper information." in the cldb.log.

The other info we have is that `maprcli dump cldbnodes -cluster <cluster-name> -zkconnect <zk-string> -json` reports that we have three valid entries for the cldb data.

We found another question ([about restoring cldb after a failure][1]) that seemed to indicate we can force cldb to start up by setting `cldb.ignore.stale.zk=true` in  /opt/mapr/conf/cldb.conf. It is highly unlikely that anybody was writing any critical data to the cluster, but I thought I would sanity check this solution before we attempt it.

Is there more troubleshooting that we should be doing to resolve the problem? If we resort to telling cldb to ignore the stale zookeeper data, are there steps we should take to check the health of the cluster afterward?

Thanks for the assistance.