AnsweredAssumed Answered

Spark: MR JOB Reduce Side Optimizations -- issue with storing data for shuffle write on disk

Question asked by bgajjela on Mar 24, 2016
Latest reply on Oct 17, 2016 by bgajjela

MR JOB , Need Reduce Side Optimizations with ORC format.

 

We have certain jobs which are running on daily basis and these are mostly spark jobs and hive jobs.

 

SPARK ISSUE:

 

We have issue with storing data for shuffle write on disk(spark.local.dir) where the data which is showed on the application master webpage doesn't sum-up to the actual data which is being written to MAPR_HOME/tmp and causing the home directory full, there by causing issues, so in order to avoid this issues , i am looking for below options:

 

1) spark.local.dir to big mount point --- Not preferred

 

2)compress the intermediate data interms of storage capacity looking for compression codecs like bzip2. --Preferred

 

How can we achieve this?

Outcomes