How can I control spilling to disk in Drill?
See all drill best practice FAQs.
Hash aggregation and hash join are hash-based operations. Streaming aggregation and merge join are sort-based operations. Both hash-based and sort-based operations consume memory; However, currently, hash-based operations will not spill to disk as needed, while the sort-based operations will.
The parameter "planner.memory.max_query_memory_per_node" sets the maximum estimate of memory for a query per node. If the estimate is too low, Drill re-plans the query excluding the memory-constrained operators from consideration. By default, the parameter is set to 2G per node.
Normally if you have enough memory, you should try to avoid spilling to disk, by increasing planner.memory.max_query_memory_per_node to a larger value.
Are there any good metrics or guidelines for setting max_query_memory_per_node? I know the default drill-env.sh ships with 8G of Direct Memory and 4G of heap... what is good solid amount say if I had 24G of Heap and 84G of direct?
Any documentation on how to initially set this and tune this for a enterprise cluster?
This is a good question. I would suggest we keep this parameter as default as long as there is no performance concern for each query.
Normally we increase the value for this parameter at session level for individual large query which requires lots of disk spilling. This value can be set up to the total amount of direct memory.
It also depends on the concurrency of that large query. We do not want to give all direct memory to just one query.
Retrieving data ...