How to avoid skew on reducer for "Group-By" queries in Hive

Document created by Hao Zhu Employee on Feb 18, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

Original Publication Date: May 5, 2015

Env:

Hive 0.13

Symptom:

A "Group-By" query has heavy skew on one reducer. For example, even if we set the number of reducers to 100 using the below commands, one reducer takes hours to finish while other reducers only take seconds or minutes to finish.

MRv1:

set mapred.reduce.tasks=100;

MRv2:

set mapreduce.job.reduces=100;

By looking at the MR job statistics from JobTracker or ResourceManager web UI, "REDUCE_INPUT_RECORDS" are shown high on that reducer compared to other reducers.

Root Cause:

By default Hive puts the data with the same group-by keys to the same reducer. If the distinct value of the group-by columns has data skew, one reducer may get most of the shuffled data. As a result that reducer takes a much longer time to finish compared to other reducers.

Solution:

At session level, set the below configuration in Hive CLI(for HS1) or Beeline(for HS2). This configuration causes Hive to trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.

set hive.groupby.skewindata=true;

After setting it, the reducers' statistics for "REDUCE_INPUT_RECORDS" should show data is more evenly distributed to each reducer. It is suggested to set this configuration for the query which has group-by data skew, rather than setting it globally in /opt/mapr/hive/hive-<version>/conf/hive-site.xml.

Attachments

    Outcomes