Partitioning the data appropriately is a key strategy to achieve interactive performance on big data. Drill provides a very sophisticated optimization to recognize and leverage partitions at the query execution time. Here are few things to consider while determining the partitioned lay out of data.
First, determine the access patterns from either your existing SQL queries or the expected queries. Columns frequently used in the WHERE clause are good candidates for partition keys. Note that you can specify multiple partition keys for a given table.
Next, determine the number of partitions.
- The total number of partitions for a given table = Number of distinct values for partition key1 * Number of distinct values for partition key2 *..* Number of distinct values for partition keyN
- Ensure the total number of partitions is in the 1000s (or less). Note the number of partitions impact the query planning time. With a larger number of partitions, you can do better partition pruning, but at the cost of increased planning time.
- To optimize the time spent in partition pruning, you might need to reduce the number of partition keys if the number of partitions is excessive.