mpierre

Apache Drill have started to understand XML

Blog Post created by mpierre on Jan 25, 2017

A few month ago I created the first XML plugin for Apache Drill, mostly as a test to see if it was possible. The idea behind the plugin is simple: Since Apache Drill already has great support for JSON, why not convert the XML documents to JSON and somehow magically feed over the information into the JSON Driver for further processing and presentation in Apache Drill.

I already had a SAX based XML to JSON parser that I'd written for a demo at a prospect which kind of did the job so geared with the souce code for Apache Drill, I set out to try my ideas.

One hour later I had the first implementation that hooked into Drill the way I wanted and that compiled. Here's the scoop: it worked the first time

Extending Apache Drill is simple when reusing the base written by genius developers. I'm not really a developer myself so trust me on this: If I can do it, so can you.

 

My code was not 100% though: lots of errors were produced for the simple reason I had not thought about gearing the parser towards the format Apache Drill liked.

Since it wasn't really useful, I forgot about the project and kept on with my day to day work at MapR as a Systems Engineer.

 

The XML plugin the second iteration

A few days ago (a week ago) I was asked to test and see if the Drill plugin could do some magic with some specific XML documents for a customer, and since I since then had worked quite much with Apache Spark and Apache Spark XML since writing the plugin for Apache Drill, I had some new ideas to bring into the code. For instance keeping attributes with @sign, and keep values in tags as #value. Once looking at the code and seeing how Drill reacted towards my generated JSON I decided to rewrite the XML plugin to generate better JSON and hopefully be able to support more XML documents. Now I can safely say that the investment was well worth the effort.

 

What can it do?

Based on my own tests, it can query "almost" any XML files and get a workable JSON document back that Drill understands and can work with. So far I have tested with Network data, pos data, some European Union Data, Excel XML Sheet data, and database logs in XML, and even the mondial data sample in the form of XML and all of them works and can be queried directly with Drill. There's a long way before it will be a central piece of Apache Drill and I am sure the Apache Drill engineers have more clever ways of solving many of the tasks I have dealt with in the code, but it shows some of the capabilities that may soon to be part of Apache Drill.

 

How to get hold of it:

Since I did some small modifications of the JSONReader to be able to hook in my code you need to run my Apache Drill version in order for it to work. The code recides over at: GitHub - magpierre/drill at DRILL-3878: Mirror of Apache Drill https://github.com/magpierre/drill/tree/DRILL-3878

 

Compile the project using mvn (you may have to bump up the memory with MAVEN_OPTS in order for the compilation to go through):

mvn clean package -DskipTests

 

Once successfully compiled, move the contrib/storage-xml/target/drill-xml-storage-1.7.0-SNAPSHOT.jar into the jars/3rdparty folder of the Apache Drill distribution you just built.

 

In order to configure XML support you just add:

"xml": {

      "type": "xml",

      "extensions": [

        "xml"

      ],

      "keepPrefix": true

    }

 

to the formats section of your storage config for dfs and off you go. If the plugin was successfully registered you will get success when updating the storage config, and after that you can query XML documents.

 

Let me know how it goes. I will issue a pull request to Apache Drill once I figure how to do it.

 

Regards,

Magnus

Outcomes