Spark: MR JOB Reduce Side Optimizations -- issue with storing data for shuffle write on disk

Question asked by bgajjela on Mar 24, 2016
MR JOB , Need Reduce Side Optimizations with ORC format.


We have certain jobs which are running on daily basis and these are mostly spark jobs and hive jobs.




We have issue with storing data for shuffle write on disk(spark.local.dir) where the data which is showed on the application master webpage doesn't sum-up to the actual data which is being written to MAPR_HOME/tmp and causing the home directory full, there by causing issues, so in order to avoid this issues , i am looking for below options:


1) spark.local.dir to big mount point --- Not preferred


2)compress the intermediate data interms of storage capacity looking for compression codecs like bzip2. --Preferred


How can we achieve this?