How to fix Scalding (or Cascading) job which fails to write on to local volume in reducer stage?

Document created by wade on Feb 27, 2016
Version 1Show Document
  • View in full screen mode

Author: Venkata Sowriraja, last modified by Hassan Shaik on April 21, 2015

Original Publication Date: March 25, 2015

 

Environment:

MapR 4.0.1
Scalding (or Cascading)

Symptom:

Application fails with any of the below Stack trace:

Classic Mode (Task tries to write on to local volume of the node in reducer stage)

 

java.io.IOException: Create failed for file:

/var/mapr/local/node1.centos1.mapr.sj.us/mapred/taskTracker/spill/job_201501091417_0003/attempt_201501091417_0003_r_000009_0/map_0.out,

error: Remote I/O error (121)

  at com.mapr.fs.MapRClientImpl.create(MapRClientImpl.java:159)

  at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:640)

  at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:682)

  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:809)

  at org.apache.hadoop.mapred.IFile$Writer.<init>(IFile.java:95)

  at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2474)

  at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:611)

  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:423)

  at org.apache.hadoop.mapred.Child$4.run(Child.java:278)

  at java.security.AccessController.doPrivileged(Native Method)

  at javax.security.auth.Subject.doAs(Subject.java:415)

  at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)

Yarn Mode (Task tries to write on to local volume of the node in reducer stage)

 

Caused by: java.io.IOException: Create failed for file:

/var/mapr/local/nmkucs1/mapred/nodeManager/spill/job_1420575394426_0002/attempt_1420575394426_0002_r_000000_2/map_0.out.merged,

error: No data available (61)

  at com.mapr.fs.MapRClientImpl.create(MapRClientImpl.java:166)

  at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:669)

  at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:711)

  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:809)

  at org.apache.hadoop.mapred.IFile$Writer.<init>(IFile.java:130)

  at

org.apache.hadoop.mapreduce.task.reduce.DirectShuffleMergeManagerImpl.finalMerge(DirectShuffleMergeManagerImpl.java:734)

  at

org.apache.hadoop.mapreduce.task.reduce.DirectShuffleMergeManagerImpl.close(DirectShuffleMergeManagerImpl.java:379)

  at

org.apache.hadoop.mapreduce.task.reduce.DirectShuffle.run(DirectShuffle.java:152)

  ... 6 more

Root Cause:

Root cause for this error is basically, Scalding (or Cascading) tries to resolve these job properties at the client side itself instead of resolving at the run time. For eg:

mapr.mapred.localvolume.mount.path = var/mapr/local/nmkucs1/mapred/nodeManager/spill/ 

 

The above property gets resolved (by scalding or cascading) at the client side and the node (hostname:nmkucs1) which invoked the job goes down, then we will get the above error. Since map and reduce tasks uses MapR Local volume as scratch space and if the node invoked goes down then other nodes won't have access to it giving Java IOException. This should not happen.

 

mapr.mapred.localvolume.mount.path =

${mapr.localvolumes.path}/${mapr.host}/mapred

mapr.mapred.localvolume.root.dir.path =

${mapr.mapred.localvolume.mount.path}/${mapr.mapred.localvolume.root.dir.name}

Solutions:

Workaround

Here is a workaround to solve this. The workaround is to add the following properties to mapred-site.xml on all TT/NM nodes.

<property>

  <name>mapr.mapred.localvolume.mount.path</name>

  <value>${mapr.localvolumes.path}/${mapr.host}/mapred</value>

  <final>true</final>

  </property>

 

  <property>

  <name>mapr.mapred.localvolume.root.dir.path</name>

  <value>${mapr.mapred.localvolume.mount.path}/${mapr.mapred.localvolume.root.dir.name}</value>

  <final>true</final>

  </property>

Attachments

    Outcomes