PYETL: Demonstration of Streams, MapR-DB, Parquet with Python

Blog Post created by mandoskippy on May 23, 2017

Being a hacker of sorts, I wanted to throw together some tools that I could reuse over and over again, run with "decent" scale (handle different work loads) and have it be somewhat easy to reproduce and have other implement/improve on. 


To that end, I wanted a single tool that could read json off a Kafka or MaprR stream topic, and then given instructions for the topic, push that into a location that could be read easily by Apache Drill or other tools.  This includes outputs of MapR-DB.  Thus, PYETL was born. 


Now, Kakfa Connect is one of those tools that should be able to do what I want it to do, MapR does have a Kafka Connect package, but I haven't reverse engineered it yet, although I plan to.  I am guessing that Kafka Connect gives me some additional features (like having a Kafka Connect cluster be able to handle multiple topics to multiple locations).  My setup has all independent consumers. More on that in a bit. 


Another thing that Kafka Connect didn't have is a MapR-DB example, now, there is an HBASE sink, so maybe I can hack around on that, but I am not a Java expert, my scripting talents lie in Python, and thus, I wanted to demonstrate what I am trying to do in Python. Now, for performance reasons etc, perhaps Kafka Connect would be great, but I need to see some examples of it doing what my stuff is doing, and be easy to mange... the complexity of what Kafka Connect offers and what I am doing are vastly different, I pay a penalty for that in two areas.


1. Size (I use a Docker image that ends up 2 GB in size!, luckily, that is cached, so if I have 5 instances running on a node, it only takes up 2 GB).

2. Performance.  Using Python, even with C based libraries being wrapped (fast parquet, libhase, and librdkafka) I am sure Java performance is going to be better... that said, I haven't had issues... yet. 


So Python, did I use messy threading etc? Nope... I run a Mesos cluster based on Jim Scott talks on Zeta Architecture. Ted Dunning got me hooked on Mesos a few years ago, and I've been playing around merging MapR and Mesos for a while.  Since I am running Mesos and Marathon, I run multiple instances of my script for each topic I ETL.  So let's say I have a topic, web logs, with 6 partitions in it. I have each instance run the same config file (with the same group name) so it joins the group.  If I run 2 instances, and can handle all the data, Kafka/MapR streams distributes the two partitions across my two instances.  I can run up to 6 individual instances to handle the partitions for my topic. Marathon just lets me scale up or down for that task because the config is the same. (I use the HOSTNAME of the docker container as a unique identifier when I am writing json files or Parquet files).   Thus I can have multiple instances of my ETL running, managed by Marathon. This allows me some sense of scale, even with Python. 


So, json, parquet? MapR DB?  Yes, I have multiple outputs... I wanted to have some basic config files or ENV variables as how I configured things, and then be able to easily output to directory partitioned json or Parquet files for Apache Drill or Spark, while still being able to output with the same script to Mapr DB.  I allow the users to specify the partition fields, and even column family mappings in MapR DB. The row key can be specified from other fields, or you can add "RANDOMVALUE" (Fun fact, I was ETLing some web logs, and I ran two instances, one to JSON and one to MapRDB, and for some reason the MapRDB table was steadily losing the race when I ran select count(*) Apparently, my rowkey definition didn't have enough entropy, and I had collision effectively overwriting records...)


The multiple outputs are sure interesting when I do performance checking, especially as data sets grow!


I wouldn't consider this "production" ready yet, and would happily take improvements from the community.  I had some basic goals in mind here


1. Make this dockerizable so others could repeat exactly what I am doing to make it work

2. Make it fairly simple/easy to understand for Python folks

3. Make have an acceptable level of performance. 

4. Make it something I would WANT to use to quickly get data on streams into sane formats from JSON on a topic. 


It's something to play with, 




GitHub - JohnOmernik/pyetl: Python based ETL of JSON records from into multiple destinations