Author: Hao Zhu
Original Publication Date: May 5, 2015
A "Group-By" query has heavy skew on one reducer. For example, even if we set the number of reducers to 100 using the below commands, one reducer takes hours to finish while other reducers only take seconds or minutes to finish.
By looking at the MR job statistics from JobTracker or ResourceManager web UI, "REDUCE_INPUT_RECORDS" are shown high on that reducer compared to other reducers.
By default Hive puts the data with the same group-by keys to the same reducer. If the distinct value of the group-by columns has data skew, one reducer may get most of the shuffled data. As a result that reducer takes a much longer time to finish compared to other reducers.
At session level, set the below configuration in Hive CLI(for HS1) or Beeline(for HS2). This configuration causes Hive to trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.
After setting it, the reducers' statistics for "REDUCE_INPUT_RECORDS" should show data is more evenly distributed to each reducer. It is suggested to set this configuration for the query which has group-by data skew, rather than setting it globally in /opt/mapr/hive/hive-<version>/conf/hive-site.xml.