See all drill best practice FAQs.
Drill users frequently ask the Drill team how to deal with data skew.
Suppose you are joining 2 tables on a join key, and the join key happens to have data skew:
SELECT * FROM T1 INNER JOIN T2 ON T1.a1 = T2.a2
Depending on the estimated row count, Drill may do a hash-distribute on both inputs of the join - columns a1 and a2. Suppose these columns have only 5 distinct values and you have a 30 node cluster. In that case, only 5 nodes will be used to process the join. Let’s suppose T1 is a large fact table and is already well distributed across all nodes, and T2 is a dimension table. In such cases, it may be better to broadcast the dimension (smaller) table and utilize the entire cluster to process the join. There is no way to force a broadcast join, but you can hint the optimizer by increasing the broadcast threshold.
Retrieving data ...