In Case You Missed My Release Pitch
We just released Apache Drill 1.12 on MapR 6.0 as part of MEP 4.1 (MapR Expansion Pack). Continuing with the Drill 1.11 theme that I outlined in my previous post here in late November, we have made improvements in the most recent release.
Here are the highlights:
- Exploratory queries (those not requiring any filters) on operational data on JSON tables in MapR-DB can leverage secondary indexes to speed up.
- Exploratory queries on Parquet files in the MapR file system (MapR-XD) have improved by at least 2x.
- Several contributions from the open source community, including UDFs for facilitating network analysis and usability improvements.
- 140+ bug fixes that improve quality overall.
Data Exploration on Operational Data on JSON Tables in MapR-DB and Historical Data on Parquet in MapR-XD
One of the key features of MapR-DB and MapR-XD is that it allows data scientists to reuse the same data for advanced analytics, such as machine learning, AI, or predictive analytics, without the need to export the data. Critical to designing new algorithms is prototyping where the focus is, to explore the data while running experiments. In MapR 6.0, we launched a new product called the MapR Data Science Refinery, an easy-to-deploy and scalable data science toolkit with native access to all platform assets and superior out-of-the-box security. To enable data exploration with Drill while prototyping algorithms, data scientists can use the same notebook in Apache Zeppelin to do in-place ad hoc SQL queries (as shown in Figure 1) and visualize the results.
From a technical standpoint, we enhanced the performance of exploratory queries on JSON tables in MapR-DB by:
- Enhancing the query planner to:
- Leverage secondary indexes on queries lacking filters (i.e., an explicit WHERE clause).
- Use the sortedness of the data in the index to avoid costly sorting operations.
- For subsets of the results request (i.e., using LIMIT clause), performance is improved through index pushdown and reducing the amount of data scanned.
With this feature, Drill on JSON tables on MapR-DB can leverage secondary indexes to improve performance for exploratory queries (that require no filters) and highly selective queries (that have filters) that require sorting, aggregation and joins.
Figure 1: Sample of modified TPCH exploratory SQL queries on JSON tables in MapR-DB that would benefit from the performance feature.
Parquet, the columnar file format, is considered as a standard amongst our customers for historical analytics on the MapR Platform. To improve Drill's performance on Parquet, we conducted an investigation last year into the scanner itself that revealed the following:
- An opportunity to directly move the data from the file into direct memory via the heap. Since the data is moved in 4KB chunks, Drill could leverage CPU L1 cache and avoid touching the heap, which hurts performance.
- Vector processing with Java intrinsics (a native function implementation) could give a performance boost. Our tests showed that this improvement was in the order of 2x.
- Implicit column optimization: Implicit columns are columns that carry metadata about the rows in a batch, processed from the Parquet file by Drill. For example, these could be file paths and names for each row. Testing revealed to us that there was as much as 20% overhead in carrying this metadata that was identical for lots of rows. We reduced these duplicates and represented them by one value.
- Implicit data type optimizations include pattern detection in the Parquet metadata such that even if it claims the variable length for fixed-length data, Drill would override and treat the column as fixed-length to leverage JVM optimizations.
Our tests show below that we are able to get about 2-4x improvement in scan performance of Parquet files. The performance gain will be most pronounced for SELECT * queries that need to scan the entire table.
Test Results of Running Exploratory Queries on Operational Data in MapR-DB JSON Tables
This was the test setup that we put together to see how the combination of above performance optimizations would
benefit the sample queries:
- Cluster setup:
- 10 data nodes, each node had 12 SSD disks each of which was 0.5TB, each node has 20 cores and 256GB RAM
- 10 Drill nodes (drill bits) collocated with the data nodes
- MapR Converged Data Platform configuration:
- MapR File System (MapR-XD): 4 instances per node, 1 CLDB
- 4 storage pools per node
- 2RPC connections between MapR file system nodes
- MapR-DB: 4GB tablet size for primary and index tables
- Drill configuration:
- planner.memory.max_query_memory_per_node = 4GB
- planner.width.max_per_node = 14, default value which is 70% (of 20 cores)
- Data set:
- TPCH with SF1000 (1TB)
The results are shown in Figure 2. All the queries show significant improvement in performance except for two
queries that require retrieving the maximum values. Such a query would require scanning the entire index table to
ensure that the maximum value was identified.
Figure 2: Performance test of a sample of modified TPCH exploratory SQL queries on JSON tables in
Test Results of Running Exploratory Queries on Historical Data on Parquet Files in MapR-XD
This was the test setup that we put together to see how the combination of Parquet scanner optimizations would impact the performance of a SELECT * type of query:
- Cluster setup
- 10 data nodes, 23TB HDD per node, each node has 32 cores and 256GB RAM
- 10 Drill processing nodes (drill bits) collocated with the data nodes
- MapR Converged Data Platform configuration:
- MapR File System (MapR-XD): 1 instance per node, 1 CLDB
- Drill configuration
- planner.memory.max_query_memory_per_node = 8GB, 4 times the default of 2GB to avoid any spill to disk scenarios
- planner.width.max_per_node=1, parallelization was reduced to measure the impact more accurately of a single scan thread.
- Data sets
- TPCH SF100 parquet snappy compressed
The results of the testing is shown in Figure 3. We measured the performance gain of an individual scan fragment across all fragments for multiple runs of a query and observed the 2X factor improvement. However, as predicted, the overall query performance gain (30% in Figure 3) will be dependent on other factors such as filter complexity, aggregations, joins or sorting operations.
Figure 3: Performance test of an exploratory query on Parquet in MapR-XD.
Wild Card Text Search Performance on Parquet Files
Unknown to many customers, Drill, much like standard SQL, has the ability to search text (see Figure 4) within a document as part of a filter. A "regular expression," specified as a grammar in the filter, can help detect text patterns. We introduced several improvements to this search in Drill 1.11 but tested it in this release. Prior to Drill 1.11, Drill used the Java Regular Expressions library for pattern matching. The library required that for each record, the data was copied from direct memory (area in memory that Drill controls) into the heap (garbage collector, not in Drill's control), and then the regular expression was evaluated, which hurt performance.
To improve upon this feature, we did the following:
- First, we introduced character-based inspection for commonly seen patterns in direct memory itself.
- Second, we introduced optimizations to the ‘contains’ for-loop and minimized comparisons.
We carried out tests in the same cluster as the tests for exploratory queries in MapR-DB JSON tables described above. The queries were on TPCH dataset with a scale factor of 1000 and ParquetThe The results are shown in Figure 4. We see an increase in performance for regular expressions that had 1 wild card per word. As more % wild cards (i.e., any text) were present in the query, a full text scan had to be done, which hurt the performance. This is part of our roadmap for improvement for the next phase of this project.
Figure 4: Performance test of wild text card search queries on Parquet in MapR-XD.
Community Contribution Highlights
I am happy to report that the Apache Drill community has ramped up its activity in the last several months. In September of last year, we organized a Drill Developer Day that attracted users and developers around the Bay Area. I thought it would be worthwhile to highlight some of the contributions, as these are available in the current release as well. Note that we have not subjected these features to our internal testing and hence do not support it. But that should not deter you from trying them out and suggesting improvements through the dev and user mailing lists.
Here are the highlights:
- New plugins: Kafka storage plugin, OpenTSDB plugin.
- Graceful shutdown of Drillbits: Shutdown Drillbits to be reused for something else without disrupting the service.
- A collection of networking functions that facilitate network analysis using Drill (DRILL-5834).
- Geometry functions, ST_AsGeoJSON and ST_AsJSON, that return GeoJSON and JSON representations.
- Filter pushdown for Parquet with multiple row groups to improve performance.
- IF NOT EXISTS support for CREATE TABLE and CREATE VIEWS: If you had a table with the same name, it would error out.
- Syntax highlighting and checks during storage plugin configuration: Users can get feedback about the storage plugin information they submit. Prior to Drill 1.12, all storage plugins were initialized for every query. This had two disadvantages: a performance hit and a query failure if any of the storage plugins failed to initialize due to incorrect information. In this release, we have introduced a feature that allows initialization of only the essential plugins required to run a query. Both the above features should improve the user experience overall.
It's Your Turn to Try