I have a large amount of data (thousands of lzo compressed json files, 19TB in total) sitting on my dfs. Trying to figure out the best way to process everything using my existing Hadoop code.
1) Since MapR compresses everything by default can I just feed these files into my existing code?
hadoop jar MyCode.jar org.MyCode -libjars someJars -input /path/to/input.json.lzo -output /path/to/output
2) If not, should I just extract these files and and rewrite them, allowing MapR to handle the compression?
hadoop fs -cat /path/to/input.json.lzo | lzop -d | hadoop fs -put - /path/to/new/input.json
3) Maybe there is a smarter third option?
My end goal, ideally, us to have these converted from json to csv and have been handling it via (2) above. I wrote a python script to do this and have been running
hadoop fs -cat /path/to/input.json.lzo | lzop -d | python json2csv.py | hadoop fs -put - /path/to/new/input.csv
which is taking FOREVER. Now I'm in the process of writing a Hadoop version of my python script but wasn't sure if I could actually feed this the lzo files.