See all drill best practice FAQs.
Drill performs hash based aggregation by default where the group-by keys are stored in a hash table in each minor fragment (execution thread). The number of unique groups and the data type of each group-by key determines how much memory is consumed by the hash table.
If your query has a large number of group-by keys, there is a high likelihood that the groups are all unique and will occupy more memory compared to the scenario where grouping collapses the rows into fewer groups. As a best practice, it is useful to examine the group-by columns to see whether (a) any one of the columns is a unique key -- clearly grouping by such a column is not very meaningful, or (b) a combination of them produces all unique groups. If you are grouping by several columns, you can run a SELECT COUNT(DISTINCT col) on each column to get a sense of the number of potential combinations.
A second aspect is the data type of the columns. For instance, long varchar columns in the group-by occupy more space in the hash table compared to numeric data types. Poorly written queries may sometimes contain long varchar fields in the group-by which may not be meaningful to the use case.
Note that Drill currently supports spilling to disk for sort based aggregation and soon to come, hash based aggregation. This should address the out-of-memory issues for group-by queries.
Retrieving data ...