AnsweredAssumed Answered

All reduce jobs fail

Question asked by thealy on Apr 24, 2013
Latest reply on May 14, 2013 by thealy
V2.1.1 / M3

The applications in question ran many times without a problem after upgrade to v2.1.1. After a data loss and recovery affecting only once subset of the jobs, all Reducer tasks fail with 'Child Error':

<pre>
java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:275) Caused by: java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:262)
</pre>

Logs on nodes where reducer tasks were attempted all create a core file and the logs show:

<code>
libprotobuf FATAL /usr/local/protobuf-2.4.1//include/google/protobuf/repeated_field.h:666] CHECK failed: (index) < (size()):
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (index) < (size()):
</code>

All Map-only jobs work fine, as does DISTCP. I've been working on this for many days now - any suggestions appreciated.

Thanks for any suggestions.

Outcomes