AnsweredAssumed Answered

Best practices when using compressed (lzo) files

Question asked by sjgx on Dec 8, 2016
Latest reply on Dec 9, 2016 by namato

I have a large amount of data (thousands of lzo compressed json files, 19TB in total) sitting on my dfs. Trying to figure out the best way to process everything using my existing Hadoop code. 

 

1) Since MapR compresses everything by default can I just feed these files into my existing code?

 

hadoop jar MyCode.jar org.MyCode -libjars someJars -input /path/to/input.json.lzo -output /path/to/output

 

2) If not, should I just extract these files and and rewrite them, allowing MapR to handle the compression?

 

hadoop fs -cat /path/to/input.json.lzo  | lzop -d | hadoop fs -put - /path/to/new/input.json

 

3) Maybe there is a smarter third option? 

 

My end goal, ideally, us to have these converted from json to csv and have been handling it via (2) above. I wrote a python script to do this and have been running 

 

hadoop fs -cat /path/to/input.json.lzo  | lzop -d | python json2csv.py |  hadoop fs -put - /path/to/new/input.csv

 

which is taking FOREVER. Now I'm in the process of writing a Hadoop version of my python script but wasn't sure if I could actually feed this the lzo files.

Outcomes