AnsweredAssumed Answered

zookeeper won't rejoin quorum

Question asked by pbuster on Oct 24, 2017
Latest reply on Oct 25, 2017 by deborah

we needed to relocate a zk node to another rack, using 'service mapr-zookeeper stop|start' the service is restarting, but not joining.  this is MapR 3.1, M3

 

# /usr/local/zookeeper/bin/zkServer.sh qstatus
JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

 

zoo.cfg is the same on all three nodes - this is node 0, 2 is the current leader

 

tickTime=2000
initLimit=20
syncLimit=10
dataDir=/opt/mapr/zkdata
clientPort=5181
autopurge.purgeInterval=24
superUser=root
readUser=anyone
mapr.cldbkeyfile.location=/opt/mapr/conf/cldb.key
authMech=SIMPLE-SECURITY
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
mapr.usemaprserverticket=true

maxClientCnxns=256
maxSessionTimeout=180000
server.0=10.18.1.9:2888:3888
server.1=10.18.1.19:2888:3888
server.2=10.18.2.9:2888:3888

 

I tried this from another post

 

1. Shutdown the Zookeeper service if it is running on the problematic node(s)
2. Move out (but do not delete) the contents of /opt/mapr/zkdata/version-2/ on the problematic node(s)
3. Restart the Zookeeper service.

 

Logs on the local server repeat this

 

2017-10-24 11:10:19,263 [myid:0] - INFO [QuorumPeer[myid=0]/0.0.0.0:5181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 0)
2017-10-24 11:10:19,264 [myid:0] - INFO [QuorumPeer[myid=0]/0.0.0.0:5181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (1, 0)
2017-10-24 11:10:19,265 [myid:0] - INFO [QuorumPeer[myid=0]/0.0.0.0:5181:FastLeaderElection@774] - Notification time out: 25600

 

Logs from the leader repeat this

 

2017-10-24 10:59:21,688 [myid:2] - WARN [SendWorker:0:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2017-10-24 10:59:21,688 [myid:2] - WARN [RecvWorker:0:QuorumCnxManager$RecvWorker@762] - Connection broken for id 0, my id = 2, error =
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)

 

finding hits suggesting a zk bug and a need for a rolling restart of zks -- is that right ?

 

thanks

Outcomes