smopsy

Apache Drill 1.11 on MapR 6.0: Release Highlights

Blog Post created by smopsy on Nov 29, 2017

In Case You Missed My Release Pitch

Apache Drill 1.11 on MapR 6.0 was released with new enterprise capabilities for a faster, secure, and robust interactive BI analytics experience. Drill now leverages the new secondary index technology on semi-structured operational data stored in MapR-DB, the database for the world’s most data-intensive applications, to speed up analytics, deliver insights, and drive better decisions. New security features ensure protection of sensitive data as they are accessed, processed, and delivered to end users. Several enhancements were added to improve better handling of analytic workloads running on the system, including spooling data intensive queries to disk and their management through queues.

 

Some Introduction is a Good Thing

The current trend in big data analytics marketing dictates that every product release, such as this one, warrants that the product manager (yours truly!) make a big splash in a blog post, promoting the greatest features that the team has shipped. Performance benchmark numbers must be published in carefully orchestrated setups to make the query engine appear like the North Star. Self-service? Fast ETL? Sub-second response times faster than the speed of thought? No problems. They got it all.

 

Big data marketing myths are far from enterprise reality. I am not going to do any of that. Instead, I want to talk about the Apache Drill release on MapR in the context of the problems we see our enterprise customers facing on their big data journey to bringing enterprise-wide analytics access. Democratization of access and self-service analytics continue to be the cornerstones of the current BI wave. From that perspective, there are three big challenges we see:

  1. Analytics market shift continues: Traditional warehousing solutions continue to be heavily optimized for smaller scales ranging in several 100s of TB but still strain the IT budget to scale. Many of these vendors seeing the inevitable market shift are still harvesting the market by sticking to their pricing. This, among other factors, is leading to increased displacement of these tools from use cases where they are overkill.  For example, scheduled reports are a classic use case, where a low-cost big data solution can be designed to deliver the same customer relevant SLAs. Over the next several years, I expect to see query engines like Apache Drill mature to displace more enterprise critical use cases from these traditional systems with the only differences being that they will do it at massive scale and low cost.
  2. Enterprise-grade needs: Customer expectations on enterprise features that traditional solutions have provided are non-negotiable. These needs include reliability, performance, and security at scale, both at the query and data layers. I mention the data layer as well, as there is no point building a MPP query engine without a tight integration to an industry-grade data platform.
  3. Analytics on data as-it-happens: Analysts do not wish to wait for a slow and expensive ETL process to kick in to make the operational data available for analytics in the data mart or the warehouse. 

None of these problems are easy to solve, especially when you are designing an industry-grade solution at massive scale. 

 

MapR Analytics Approach

One of our core design beliefs is that a query engine can never be designed in isolation but must be tightly integrated with the underlying data layer. It should come as no surprise that we continue to invest time and effort to strengthen our core data platform that we have built from the ground up. Keeping reliability and performance at scale as our end goal, we have built an industry-grade distributed file system (MapR-XD), a NoSQL database for data intensive operational applications (MapR-DB), and real-time streams for IoT scale (MapR-ES).  To simplify its operation, management, and deployment, the above technologies are unified within the software layer, yielding convergence (i.e.,  data can be streamed directly to the database, which would be stored as files, all within a single machine). Such design precludes the need for maintaining these as separate systems and managing their complexity.

 

For several years now, MapR has contributed to Apache Drill, an open source MPP query engine project. What is truly unique about Apache Drill is its ability to discover schema of the data on the fly. This capability becomes important as the amount of unstructured data stored continues to increase exponentially in the enterprise. In my opinion, most of the MPP query engines currently have a short-term focus on solving the structured data problem, reminiscent of the relational world. On D-Day, they will have to contend with querying all this unstructured data in-place without complete knowledge of their underlying schema. And in real-time. From the MapR perspective, we believe that Apache Drill prepares the world better for this impending data explosion. 

 

So What Challenges Have We Solved?

To address the above challenges we see in the enterprise, the Apache Drill 1.11 integration on MapR has the following capabilities:

1. Operational analytics access: Apache Drill had an existing integration with MapR-DB (a row-based NoSQL database), but queries containing filter conditions on columns that were not part of the primary key were rightfully slow. This issue was due to the lack of secondary indexes which could speed them up by indexing these columns. In this release, we leverage native secondary indexes, introduced in MapR-DB, to speed up the highly selective queries by a factor of at least 10X. 

Conceptual diagram, showing how Apache Drill integration on MapR enables operational and historical data analytics

 

The performance benefits can be substantial at scale at a fraction of the cost of the traditional warehousing system! Analysts now do not have to wait for the operational data collected from transaction systems to be ETL-ed into historical data stored in columnar Parquet format. Besides, columnar formats are ill-suited for queries that are highly selective. For example, we ran a simple test rig that confirmed these results as shown below.

Results from a simple test verify that secondary indexes speed up Drill queries on MapR-DB JSON, especially when the selectivity is high, corresponding to a low selectivity (%) value.

 

We also carried out a concurrency test with different number of concurrent users (each user is a user stream) sending simple queries into the same system on a TPC-DS dataset. The queries were fired in batches with the next query being executed when a query slot became available. Based on type of query, filter selectivity and capacity available we observe that highly selective queries increase system throughput. 

Simple concurrency test results showing query throughput and response times as type of query, selectivity and number of users are varied. 

 

As stated earlier, our vision at MapR is to leverage Apache Drill as a unified SQL access layer across files, tables and streams. We have made significant strides with enabling this capability on files and with this release, expanding it to tables. Our near-term focus is to introduce streaming analytics on the platform. For the curious, we are beginning to experiment with streaming SQL analytics in the open source community by building an experimental Kafka plugin. Even though KSQL from Confluent has a developer preview out, we are not convinced that their semantics and architecture considerations have been well thought out, which requires deep discussion at our end.

 

2. Enterprise-grade query engine capabilities:  Security and resource management are the primary concerns

  1. Security: One of the enterprise security requirements for analytics we see is the protection of sensitive data as they are accessed, processed, and delivered to analysts. We delivered the first set of these features in Apache Drill 1.10 and have completed the rest in this release. In brief, security features include multiple authentication mechanisms (PAM, SSL, Kerberos, MapR Security) and associated state-of-the-art encryption, all implemented through the MapR SASL. ZooKeeper contains key information used by Drill for discovery and coordination of cluster nodes. In this release, this ZooKeeper information is viewable but protected through ACLs (Access Control Lists). In a future release, we are looking to secure the entire cluster through full authentication and encryption support on all paths leading in and out of ZooKeeper. 

Authentication options on various communication paths within the Drill architecture

 

b. Improved resource management: Apache Drill's initial architecture and design, much like Google's Dremel, was based on optimistic execution semantics that assumed availability of ample memory and CPU core resources to execute queries. Such an assumption was justified at Google scale, where such clusters could possibly run into thousands of nodes or if the queries were operating on smaller workloads. If such a query failed, then it was perfectly okay to re-run the query. In the enterprise, resources are limited, and workloads can have wide variation. Hence, the expectation is that the query engine be pessimistic, especially in scenarios when there is resource contention. In this release, Apache Drill has introduced more pessimistic execution semantics.  

 

Apache Drill supports spooling to disk for two memory intensive operators that inevitably appear in most BI/SQL queries: aggregation and sorting. Thus, queries containing such operators will slow down and not fail. We have tested the functionality on the MapR-XD and found that queries that would normally fail under limited memory conditions complete. As expected, these queries do undergo a deterioration in performance. In the next several releases, there are plans in the open source to add this functionality to other operators, such as join, so that queries as a whole can spool to disk if the need arises. 

 

We are also supporting the ZooKeeper queue feature to better manage the concurrency of the system with a setting that decides the number of queries that can be run concurrently in the Drill cluster at any point in time, while all other queries wait. This Drill cluster concurrency should not be confused with user concurrency, which is typically defined as the number of concurrent queries being sent by all users through some front-end client (i.e., Tableau, Web UI, or REST API). This concurrency number includes queries that are being processed and are waiting to be processed by the Drill cluster. The waiting time can cause a slower response time to all users, which itself is dependent on the cluster response time. 

 

The feature was first released in open source in November of last year, but we have tested and added new features to this capability. The key idea behind using queues is that you can tune the system to allow light workload queries (typically interactive) to run more frequently than heavy workload queries (typically reporting or batch data intensive query) that can burden the system, thereby leading to a lower system response time or even failures.

For every 5 queries in the small queue that are getting executed, only 1 query in the large queue is allowed to execute. All other queries must wait in FIFO order and are routed to the appropriate queue when a slot opens. 

 

If you are smart enough to manage the spill to disk options, then the heavy workload queries will slow down but not fail. This slowdown, due to queuing and spilling, discourages those users who may have overloaded the system. Incidentally, the control on the concurrency also sets an upper bound on the number of CPU cores that could be used overall, as each query uses the same planner.width.max_per_query parameter. If these resources are sized appropriately, the response time of the system to interactive workloads for a given user concurrency can be improved as the waiting time for such queries can be reduced. 

Sample parameters of Drill queues, showing that 5 light workload queries ("Small") run for every heavy workload ("Large") query. 

 

As shown above, Cost Threshold is a measure of the workload size computed by Drill at query time. Since the parameter is an estimate, it needs to be tuned. To enable tuning during POCs and testing, we provide this estimate in the query profile itself, so that you can trace how queries with well-defined workloads were queued based on that parameter. For the curious, query profiles are JSON files that can be queried by Drill itself. From our point of view, this manual queuing feature represents an important step towards our future goal of making them automatic and dynamic by putting in more intelligent control through feedback mechanisms. 

To enable threshold parameter tuning, query profiles contain a new column for total cost and its associated queue.

 

And One Last Thing

If you are an aspiring data scientist, learning SQL could be one thing worth doing now. We recently announced the launch of a new product called the MapR Data Science Refinery, a scalable data science offering that comes pre-packaged with a notebook, Apache Zeppelin. You can use the notebook as a SQL interface to retrieve data and visualize it.

Drill queries alongside a simple pie chart representation.

 

Now, It's Your Turn to Try

No release is ever complete without you giving it a try and seeing for yourself if it delivers the value you are looking for. With that, I invite you to give the new Apache Drill 1.11 on MapR 6.0 a try, and let me know your feedback. 

 

The MapR Converge Community pages for Apache Drill also got a face-lift. Check it out here

 

Happy Drill-ing!

Outcomes