AnsweredAssumed Answered

zookeeper fails on "Only 0 followers, need 1"

Question asked by dimamah on Aug 12, 2013
Latest reply on Aug 14, 2013 by dimamah
A query i ran failed on MapRedTask. In the logs its clear that the problem was with the connection to the zookeeper : 
  
    13/08/13 03:44:03 INFO exec.MapRedTask: 2013-08-13 03:44:03,542 Stage-1 map = 16%,  reduce = 0%
    13/08/13 03:44:06 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 15000ms for sessionid 0x405e0eb6f80334, closing socket connection and attempting reconnect
    13/08/13 03:44:07 INFO zookeeper.ClientCnxn: Opening socket connection to server server-02/XXX.YYY.Z.82:5181
    13/08/13 03:44:07 INFO zookeeper.ClientCnxn: Socket connection established to server-02/XXX.YYY.Z.82:5181, initiating session
    13/08/13 03:44:18 INFO zookeeper.ClientCnxn: EventThread shut down
    13/08/13 03:44:18 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x405e0eb6f80334 has expired, closing socket connection
    2013-08-13 03:45:04,098 Stage-1 map = 16%,  reduce = 0%
    13/08/13 03:45:04 INFO exec.MapRedTask: 2013-08-13 03:45:04,098 Stage-1 map = 16%,  reduce = 0%
    2013-08-13 03:45:31,167 Stage-1 map = 100%,  reduce = 100%
    13/08/13 03:45:31 INFO exec.MapRedTask: 2013-08-13 03:45:31,167 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201307140956_24940 with errors
    13/08/13 03:45:31 ERROR exec.MapRedTask: Ended Job = job_201307140956_24940 with errors
    FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
    13/08/13 03:45:31 ERROR ql.Driver: FAILED: Execution Error, return code 2 from  org.apache.hadoop.hive.ql.exec.MapRedTask



The zookeeper is running on two nodes (out of a 5 node cluster **maybe this is the problem?**) 

**In the Leaders log we see :** 

    2013-08-13 03:43:24,012 - INFO  [SessionTracker:ZooKeeperServer@314] - Expiring session 0x406dea36820191, timeout of 30000ms exceeded
    2013-08-13 03:43:52,581 - INFO  [SessionTracker:ZooKeeperServer@314] - Expiring session 0x3fdbf5f9d1137e, timeout of 30000ms exceeded
    2013-08-13 03:44:01,687 - INFO  [SessionTracker:ZooKeeperServer@314] - Expiring session 0x1406dea369a0039, timeout of 30000ms exceeded
    2013-08-13 03:43:46,105 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Leader@390] - Shutdown called
    java.lang.Exception: shutdown Leader! reason: Only 0 followers, need 1
            at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:390)
            at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:367)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)

    2013-08-13 03:44:21,755 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:QuorumPeer@620] - LOOKING
    2013-08-13 03:44:22,109 - WARN  [LearnerHandler-/XXX.YYY.Z.82:45686:LearnerHandler@479] - ******* GOODBYE /XXX.YYY.Z.82:45686 ********
    2013-08-13 03:44:22,439 - WARN  [LearnerHandler-/XXX.YYY.Z.82:45686:LearnerHandler@486] - Ignoring unexpected exception
    java.lang.InterruptedException
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
            at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
            at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:269)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:484)
    2013-08-13 03:44:23,121 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FileSnap@82] - Reading snapshot /opt/mapr/zkdata/version-2/snapshot.2100000e68

    2013-08-13 03:44:23,121 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FileSnap@82] - Reading snapshot /opt/mapr/zkdata/version-2/snapshot.2100000e68
    2013-08-13 03:44:23,988 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 146028889781 (n.zxid), 4 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
    {repreats 4 times}
    2013-08-13 03:44:40,343 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 146028889781 (n.zxid), 4 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
    2013-08-13 03:44:45,109 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FastLeaderElection@663] - New election. My id =  1, Proposed zxid = 146028889780
    2013-08-13 03:44:45,109 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FastLeaderElection@721] - Updating proposal
    2013-08-13 03:44:45,221 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 146028889780 (n.zxid), 4 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
    2013-08-13 03:44:45,222 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:QuorumPeer@642] - FOLLOWING
    2013-08-13 03:44:45,369 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 146028889781 (n.zxid), 4 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state)
    2013-08-13 03:44:45,370 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 146028889781 (n.zxid), 4 (n.round), FOLLOWING (n.state), 1 (n.sid), FOLLOWING (my state)
    2013-08-13 03:44:47,272 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Learner@72] - TCP NoDelay set to: true
    2013-08-13 03:44:47,274 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/mapr/zkdata/version-2 snapdir /opt/mapr/zkdata/version-2
    2013-08-13 03:44:48,941 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Learner@285] - Getting a snapshot from leader
    2013-08-13 03:44:48,952 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Learner@315] - Setting leader epoch 23
    2013-08-13 03:44:51,156 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FileTxnSnapLog@208] - Snapshotting: 2300000000
    2013-08-13 03:44:51,157 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Follower@116] - Got zxid 0x2300000001 expected 0x1

**The follower log :**

    2013-08-13 03:44:18,529 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Learner@335] - Missing session 0x3fdbf5f9d11a16 for validation
    2013-08-13 03:44:18,532 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1406dea369a0038, likely client has closed socket
    2013-08-13 03:44:18,651 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /XXX.YYY.Z.83:38785 which had sessionid 0x3fdbf5f9d11a15
    2013-08-13 03:44:18,833 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] - Closed socket connection for client /XXX.YYY.Z.82:45748 which had sessionid 0x1405e0eb7300000
    2013-08-13 03:44:18,833 - ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception:
    java.nio.channels.CancelledKeyException
            at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
            at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
            at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
            at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
            at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
            at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
            at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
            at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:574)
            at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:814)
            at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:786)
            at org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:299)
            at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:103)
            at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
    2013-08-13 03:44:21,759 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Follower@82] - Exception when following the leader
    java.io.EOFException
            at java.io.DataInputStream.readInt(DataInputStream.java:392)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
            at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:140)
            at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:78)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
    2013-08-13 03:44:21,760 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Follower@166] - shutdown called
    java.lang.Exception: shutdown Follower
            at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
    2013-08-13 03:44:21,760 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:NIOServerCnxn@1435] - Closed socket connection for client /XXX.YYY.Z.82:45766 which had sessionid 0x405e0eb6f8033b
    2013-08-13 03:44:22,352 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FinalRequestProcessor@378] - shutdown of request processor complete
    2013-08-13 03:44:22,352 - INFO  [CommitProcessor:0:CommitProcessor@148] - CommitProcessor exited loop!
    2013-08-13 03:44:22,352 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn$Factory@251] - Accepted socket connection from /XXX.YYY.Z.82:53335
    2013-08-13 03:44:22,352 - INFO  [FollowerRequestProcessor:0:FollowerRequestProcessor@93] - FollowerRequestProcessor exited loop!
    2013-08-13 03:44:22,353 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@639] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
    2013-08-13 03:44:22,464 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /XXX.YYY.Z.82:53335 (no session established for client)
    2013-08-13 03:44:22,353 - INFO  [SyncThread:0:SyncRequestProcessor@151] - SyncRequestProcessor exited!
    2013-08-13 03:44:22,464 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:QuorumPeer@620] – LOOKING
    2013-08-13 03:44:29,093 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:FastLeaderElection@697] - Notification time out: 6400
    2013-08-13 03:44:29,093 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 146028889781 (n.zxid), 4 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
    2013-08-13 03:44:45,454 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:QuorumPeer@654] - LEADING
    2013-08-13 03:44:46,187 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:Leader@54] - TCP NoDelay set to: true
    2013-08-13 03:44:46,276 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:5181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/mapr/zkdata/version-2 snapdir /opt/mapr/zkdata/version-2
    2013-08-13 03:44:48,897 - INFO  [LearnerHandler-/XXX.YYY.Z.83:34047:LearnerHandler@247] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@48c87b2
    2013-08-13 03:44:48,906 - WARN  [LearnerHandler-/XXX.YYY.Z.83:34047:LearnerHandler@326] - Sending snapshot last zxid of peer is 0x22000006b4  zxid of leader is 0x2300000000
    2013-08-13 03:44:48,970 - WARN  [LearnerHandler-/XXX.YYY.Z.83:34047:Leader@471] - Commiting zxid 0x2300000000 from /XXX.YYY.Z.82:2888 not first!
    2013-08-13 03:44:48,970 - WARN  [LearnerHandler-/XXX.YYY.Z.83:34047:Leader@473] - First is 0
    2013-08-13 03:44:48,971 - INFO  [LearnerHandler-/XXX.YYY.Z.83:34047:Leader@497] - Have quorum of supporters; starting up and setting last processed zxid: 150323855360


From what i understand the first thing that happened is `Shutdown called java.lang.Exception: shutdown Leader! reason: Only 0 followers, need 1` in the leader log but in the follower log there was no indication on any timeouts or problems prior to that time (03:43).

Any ideas why this happened or how can i investigate further?


**EDIT :** 
I forgot to mention that obviously the CLDB failed and restarted normally.

Outcomes