Unable to start Zookeeper on one or more nodes due to "Unable to load database on disk" error

Document created by jbubier Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Jonathan Bubier

 

Original Publication Date: November 10, 2014

 

After a restart of the Zookeeper service or a full node restart the Zookeeper service may fail to start during its initialization.  In the Zookeeper log (zookeeper.log) under /opt/mapr/zookeeper/zookeeper-<version>/logs/ where <version> is the installed Zookeeper version an error sequence similar to the following may be seen:

 

Starting zookeeper ...
2014-11-10 15:32:48,111 [myid:] - INFO
[main:QuorumPeerConfig@101] - Reading configuration from: /opt/mapr/zookeeper/zookeeper-3.4.5/conf/zoo.cfg
2014-11-10 15:32:48,116 [myid:] - INFO [main:QuorumPeerConfig@334] - Defaulting to majority quorums
2014-11-10 15:32:48,121 [myid:1] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2014-11-10 15:32:48,122 [myid:1] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2014-11-10 15:32:48,123 [myid:1] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2014-11-10 15:32:48,132 [myid:1] - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2014-11-10 15:32:48,133 [myid:1] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2014-11-10 15:32:48,154 [myid:1] - INFO [main:Login@293] - successfully logged in.
2014-11-10 15:32:48,157 [myid:1] - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:5181
2014-11-10 15:32:48,168 [myid:1] - INFO [main:QuorumPeer@913] - tickTime set to 2000
2014-11-10 15:32:48,168 [myid:1] - INFO [main:QuorumPeer@933] - minSessionTimeout set to -1
2014-11-10 15:32:48,169 [myid:1] - INFO [main:QuorumPeer@944] - maxSessionTimeout set to -1
2014-11-10 15:32:48,169 [myid:1] - INFO [main:QuorumPeer@959] - initLimit set to 20
2014-11-10 15:32:48,179 [myid:1] - INFO [main:FileSnap@83] - Reading snapshot /opt/mapr/zkdata/version-2/snapshot.100000026
2014-11-10 15:32:48,206 [myid:1] - ERROR [main:FileTxnSnapLog@210] - Parent /services_config/hoststats missing for /services_config/hoststats/host1
2014-11-10 15:32:48,207 [myid:1] - ERROR [main:QuorumPeer@453] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /services_config/hoststats
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /services_config/hoststats
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
2014-11-10 15:32:48,208 [myid:1] - ERROR [main:QuorumPeerMain@89] -
Unexpected exception, exiting abnormally

 

The 'Unable to load database on disk' error indicates there is some corruption in the snapshot file /opt/mapr/zkdata/version-2/snapshot.100000026.  Specifically the parent for the znode /services_config/hoststats/host1 does not exist and the snapshot cannot be loaded.  As a result Zookeeper shuts down and cannot join the running Zookeeper quorum.  Note that the missing znode is not unique to /services_config/hoststats, this problem has also been observed for the /services_config/fileserver/ parent znode and can occur for others. The root cause of this issue is under investigation and has been observed in multiple Zookeeper versions. 

 

The resolution to this issue is as follows:
1. Shutdown the Zookeeper service if it is running on the problematic node(s)
2. Move out (but do not delete) the contents of /opt/mapr/zkdata/version-2/ on the problematic node(s)
3. Restart the Zookeeper service. 

 

If there is a running healthy Zookeeper quorum and this problem affects only one node this can be safely done to get the problematic node back in the quorum.  With the zkdata files removed the problematic node can get a snapshot from the functioning Zookeeper nodes and initialize without an issue.  Monitor the Zookeeper log zookeeper.log and use the 'service mapr-zookeeper qstatus' command after taking the above steps to ensure the service starts up correctly. 

 

If this problem affects multiple Zookeeper nodes and the Zookeeper quorum cannot be started as a result please engage MapR Support as the recovery process is more involved.  Please collect a support-dump from each Zookeeper node (output of /opt/mapr/support/tools/mapr-support-dump.sh) in the cluster and save the contents of /opt/mapr/zkdata/version-2/ from the problematic nodes. 

Attachments

    Outcomes