Apache Drill Best Practices from the MapR Drill Team

Document created by zfong Employee on Mar 23, 2016Last modified by maprcommunity on Jun 6, 2017
Version 71Show Document
  • View in full screen mode

Apache drill.jpeg

Data Layout and Deployment

What are best practices for Drill deployment? 

How do I calculate the MEDIAN of a column using Drill window functions? 

 

Schema Considerations

Does Drill have a limit on # of columns?

Are there datatypes in Drill that should be favored vs avoided?

 

Storage Formats

What is the preferred storage format for Drill?

 

Parquet Best Practices

What is the preferred compression type when using Parquet with Drill?

What is the recommended parquet block size (when running on MapR-FS) for Drill?

Can Parquet files created by other tools (e.g., Hive, Spark) be read by Drill?

Should I enable compression in both MFS and Drill?

In the case of Parquet, does Drill prefer a larger number of small files or a smaller number of large files? How do I get the best mileage of Drill parallelism by controlling the layout?

How do I determine if the MFS chunk size is the same as the Parquet block size? 

 

Data Partitioning

How can I partition data in Drill?

How do I determine the right partitioning strategy?

Are there situations where partition pruning cannot be done in Drill?

How do I know if partition pruning has been applied to my Drill query?

At what point does the size of my data dictate that I should partition it? 

How do I avoid CTAS creating too many small files? 

 

Query Tuning and Performance

 

LIMIT 0 Queries

How can I speed up my LIMIT 0 queries on Drill?

How do I verify that my Limit 0 Drill queries are benefiting from the optimized code paths?

 

INFORMATION_SCHEMA Queries

Why are my Drill queries on INFORMATION_SCHEMA slow?

I have enabled metadata caching using REFRESH METADATA in Drill, but why are queries on INFORMATION_SCHEMA, which contains Hive tables is still slow?

 

Query Planning

Why does my Drill query against a Parquet table take a long time to plan?

How often should I refresh my metadata cache in Drill?

Why are my queries not utilizing the level of parallelism that I would expect?

Partition Pruning

How do you enable debug logging for the partition pruning phase in Drill?

Why is partition pruning not happening if my query has an IN list filter with more than 20 elements?

How should I query my data to take advantage of partition pruning if the data wasn't created with Drill CTAS? 

 

 

Join Queries

In Drill, how do I choose between the two different join types -- Broadcast Join vs Distributed Join?

 

Aggregation Queries

When is single-phase aggregation preferred over multi-phase aggregation in Drill?

 

Data Skew

How do I handle data skew during a join operation in Drill?

How do I detect data skew inherent in my Drill data?

How do I detect data skew during Drill query processing?

 

Queries Requiring Sorts

My sort operator in Drill is running slowly, how can I improve the performance of the sort operator?

 

IN List Filtering

How do I improve the performance of a Drill query with many in-lists elements?

 

Hive Queries

I am running a Drill query against a Hive parquet table and hitting an OOM exception. How can I avoid this?

How can I enable native parquet reader in Drill to optimize queries on Hive parquet tables?

 

Misc

Why am I unable to access JSON objects within Parquet files in Drill?

What information should I gather if my Drill query is hanging?

When reading a JSON file with many starting null values in a field, I get a DATA_READ_ERROR.  How do I address this ?

Which storage plugin should I use to query MapR-DB tables?

How to do I use the REST API to submit queries to Drill? 

Do we allow both the Drill and Hive JDBC drivers to be loaded together in a single executable? 

I noticed 2 JDBC drivers available for MapR Drill. Which one should I use? 

How can I use the CTAS statement to incrementally insert data?  

What are best practices for managing Drill query profiles? 

Do you have a simple example of how to write a Drill UDF and how to enable Dynamic UDFs? 

 

 

Resource Usage

Why is my Drillbit disconnecting from ZooKeeper even though the node where the Drillbit is running is alive and there is minimal CPU activity on the node

How do I decrease parallelism within a Drill query?

Why is my Apache Drill query running out of memory while performing a HashJoin?

How can I control spilling to disk in Drill?

How do I configure Drill to use MapR local volume for spilling space?

How much disk space is needed when CTAS partition by spills to disk? 

How do I determine the degree of parallelism being used for various operators in my query?

Why is my group-by query running out of memory? 

 

Security

Does Drill impersonation work with Hive SQL based authorization?

"Permission Denied" error reading Parquet metadata cache file

Why am I not able to see MapR-DB JSON documents created by another user using Drill?

How do configure Drill to use a different tmp location if Drill will not start due to noexec mount permissions on /tmp?

How can I use views to provide row level security using the current_user function? 

 

Drill Integration with BI Tools

Are there guidelines for optimizing the integration of Drill with various BI tools?

 

Drill Deployment

What are common configuration steps when deploying Drill on MapR CDP? 

 

Drill Configuration

Top 5 Items to Configure with Drill on MapR 5.x 

 

Related Resources

 

Get notified when new Apache Drill best practice FAQs are added to the community

9 people found this helpful

Attachments

    Outcomes