aengelbrecht

Drill Best Practices for BI and Analytical Tools

Blog Post created by aengelbrecht Employee on Jan 25, 2017

Below are a number of best practices to optimize the overall experience for analysts and users when using Business Intelligence and Analytical tools with Drill.

 

Metadata with BI Tools and Drill

 

It is important to keep in mind that many BI and Analytical Tools were designed with a relational database in mind, and as such these tools will typically have numerous metadata requests. Drill will present all the available Metadata in INFORMATION_SCHEMA, which contains information for all enabled Drill Storage Plugins. For more information on querying INFORMATION_SCHEMA see

https://drill.apache.org/docs/querying-the-information-schema/

 

 

Storage Plugin configuration

 

Validating the configuration of all enabled Drill Storage Plugins is the first step to optimize performance with BI tools. Incorrectly configured Storage Plugins can cause various tools to fail or timeout when querying metadata in INFORMATION_SCHEMA.

 

Start by disabling all storage plugins that are not being used, it has been observed that Storage Plugins (particularly HBase and Hive) that are turned on and incorrectly configured can cause significant delays on metadata queries. Then verify the configuration of ALL Storage Plugins that are enabled. For more information on Drill Storage Plugin configuration see

https://drill.apache.org/docs/connect-a-data-source/

 

A quick validation that all storage plugins are working correctly can be performed by accessing Drill from a CLI (such as SQLLine), or using Drill Explorer, and executing the following queries.

- Show Schemas;

- Select * from INFORMATION_SCHEMA.`TABLES`;

 

Each of these queries should return a result relatively quickly. (Note: If the Drill cluster was just started it may take a while to gather all the Storage Plugin information)

 

 

 

Exclude Schemas

 

Most BI Tools will use either the JDBC or ODBC interface to connect to Drill, this allows for another optimization of metadata performance. Both the MapR JDBC and ODBC drivers allow for the exclusion of Drill Schemas, which will then prevent the BI tool from querying the metadata of the Schemas that were excluded.

 

This is especially useful when working with Hive as a Drill data source, where not all the Hive databases are used by the BI tool and there may be unsupported data types or extremely large (thousands of objects) in those Hive databases.

 

For the MapR JDBC driver simply add ExcludedSchemas to the end of the JDBC URL and separate all Schemas to exclude by a comma (,).

Example

jdbc:drill:zk=drilldemo:5181/drill/drilldemo-drillbits;ExcludedSchemas=hive.default,sys;

 

For more information on the MapR JDBC driver configuration see

http://maprdocs.mapr.com/51/#Drill/JDBC-Driver-Drill.html

 

 

The same can be done with the ODBC driver by utilizing ExcludedSchemas in the Advanced Properties of the ODBC driver configuration. For more information see

https://drill.apache.org/docs/odbc-configuration-reference/

 

 

 

Metadata Caching

 

Drill has a feature for metadata caching to improve query performance and metadata requests, which in turn will improve the integration experience with BI tools. It is highly recommended to leverage this feature for environments with large schemas and datasets.

 

Hive Metadata Caching

 

For environments with large Hive schemas it is recommended to configure the Hive metadata caching feature in Drill, which will allow Drill to cache the Hive metadata for faster access. This will improve both Drill Metadata operations for BI Tools as well as general query performance improvements. For more information on how to configure the Hive metadata caching in Drill see

https://drill.apache.org/docs/hive-metadata-caching/

 

Parquet Metadata Caching

 

Parquet metadata caching can be enabled for parquet data located on the file system. This can drastically improve the query planning time for large data sets with a large number of files. This may not directly help with all BI Tool metadata operations, but can be helpful for improving overall query performance and metadata requests of some BI Tools. For more information on how to enable Parquet metadata caching see

https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

 

 

 

 

 

Drill Views

 

Drill Views are very powerful by enabling complex data structures to be represented in a tabular format for easier consumption by BI Tools, performing conversions and providing security on data sources. Drill works with various data sources and can deal with schema-on-the fly, however many of the complex data sources and workspaces cannot be fully defined in Drill’s INFORMATION_SCHEMA. This means that many BI Tools will not be able to effectively access these data sources, such as data sets on the file system, HBase/MapR-DB. To provide BI Tools with access to these data sources Drill Views are created, which are then available in INFORMATION_SCHEMA and available to BI Tools.

 

To optimize the integration of Views with BI Tools it is best to fully describe the output of the View by naming all columns in a View as well as specifying the data type for each column. This will improve BI Tool metadata operations with Drill by providing clear column information as well as improve the performance for many BI Tool metadata operations.  Below is a simple example of creating a View that provides the column name and data type information.

 

CREATE or REPLACE VIEW dfs.views.stock_quotes AS

SELECT CAST(columns[0] as VARCHAR(6)) as symbol,

CAST(columns[1] as VARCHAR(20)) as `name`,

CAST((to_date(columns[2], 'MM/dd/yyyy')) as date) as `date`,

CAST(columns[3] as FLOAT) as trade_price,

CAST(columns[4] as INT) as trade_volume

from dfs.csv.`/stock_quotes`;

 

 

 

 

 

Optimize Data Sources for Reporting

 

Optimize data that is frequently accessed by BI Tools for reporting purposes or dashboards. Drill typically will perform best if data is stored in Parquet format. Partitioning of the data can also improve the performance of the environment. For more information on Parquet Best Practices and Data Partitioning see

https://community.mapr.com/docs/DOC-1497

 

In addition consider creating aggregate tables for common reporting purposes where feasible.  This can greatly reduce the work required for common reporting queries to free up resources for other Drill queries.

Outcomes