AnsweredAssumed Answered

HB Master Down on Web UI

Question asked by thealy on May 22, 2013
Latest reply on Jun 7, 2013 by thealy
Running v2.1.3 / M3 / hbase-0.94.5

I have 3 HBase Masters configured as recommended. They are co-located with ZK. I Have a node alarm on one of two inactive masters; the last line in the hbasemaster log show "Terminating Master". The Web shows "failed". The last few lines of this log are shown below as "Node with alarm (hd40)". If it matters, pressing "Start" has no effect at all that is visible in the log, nor does "maprcli node services -nodes hd40 -hbmaster restart".

On the other inactive master, there is no "Terminating Master" message and no alarm. There is also no activity for ~3 days, but HBase is working fine on the active Master. The last lines of this log are shown below labeled as "Node with No alarm (hd4)".

How can I get rid of the alarm, and why are the states of the two inactive masters different?

<code>
Node with alarm (HD40):
2013-05-20 15:57:44,836 INFO org.apache.hadoop.hbase.master.metrics.MasterMetrics: Initialized
2013-05-20 15:57:44,877 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/master already exists and this is not a retry
2013-05-20 15:57:44,878 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Adding ZNode for /hbase/backup-masters/hd40.www.yyy.local,60000,1369079864567 in backup master directory
2013-05-20 15:57:44,887 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the active master, hd17.xxx.yyy.zzz,60000,1369066152343; waiting to become the next active master
Mon May 20 16:12:35 EDT 2013 Terminating master


Node with No alarm (HD4):
2013-05-20 14:38:02,626 INFO org.apache.hadoop.hbase.master.metrics.MasterMetrics: Initialized
2013-05-20 14:38:02,665 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/master already exists and this is not a retry
2013-05-20 14:38:02,665 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Adding ZNode for /hbase/backup-masters/hd4.xxx.yyy.zzz,60000,1369075082313 in backup master directory
2013-05-20 14:38:02,672 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the active master, hd17.xxx.yyy.zzz,60000,1369066152343; waiting to become the next active master
2013-05-20 14:56:57,212 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x13ec33485750003, likely server has closed socket, closing socket connection and attempting reconnect
2013-05-20 14:56:57,758 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hd17.www.yyy.local/192.168.4.17:5181. Will not attempt to authenticate using SASL (unknown error)
2013-05-20 14:56:57,759 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hd17.www.yyy.local/192.168.4.17:5181, initiating session
2013-05-20 14:56:57,762 WARN org.apache.zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable
2013-05-20 14:56:57,762 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hd17.www.yyy.local/192.168.4.17:5181, sessionid = 0x13ec33485750003, negotiated timeout = 40000
</code>

Outcomes