HBase distributed log splitting is not progressing after region server restart

Document created by wade on Feb 27, 2016
Version 1Show Document
  • View in full screen mode

Author: Nabeel Moidu, last modified by Jonathan Bubier on May 11, 2015

 

Original Publication Date: April 30, 2015

 

Environment

  • HBase 0.94.X

 

Symptom
Upon region server restart, the HBase master assigns the hlog file for the region server to be split in distributed mode.  The distributed log splitting does not complete for many hours. The status can be viewed in Tasks running section in the HBase master UI page http://<hbase-master-node:60010>/master.jsp. The corresponding region server does not get any regions assigned to it until the log splitting completes.

Root Cause
The distributed log splitting feature involves multiple region servers splitting a single WAL file as assigned by the Split Log Manager component of the HBase Master. The Split Log Worker component of the region server picks up the tasks assigned to it by the master. While the region server performs the split, it also syncs up the progress status in Zookeeper. At any point, when there's an inconsistency in the state of the WAL file recovery status and the Zookeeper znode values, the split gets stuck. See HBASE-3890 (https://issues.apache.org/jira/browse/HBASE-3890) for more information in addition to MapR Bug 7897.

 

Solution

As a workaround, turn off the distributed log splitting feature on the cluster. With this feature disabled the full task of splitting the WAL file is done by the HBase master and avoids an inconsistency scenario as described above. To do this, on all HBase master nodes, edit /opt/mapr/hbase/hbase-<version>/conf/hbase-site.xml and set the following property:

<property> 
<name>hbase.master.distributed.log.splitting</name>
<value>false</value>
</property>

 

Restart the active HBase master on the cluster. You'll now see on the HBase master web page the status of the hlog splitting task.
The downside to disabling distributed log splitting is
a single node may take a longer time to process the commit log and complete the recovery process if the commit log is very large compared to multiple node doing the same in parallel.

Attachments

    Outcomes