All reduce jobs fail

Question asked by thealy on Apr 24, 2013
V2.1.1 / M3

The applications in question ran many times without a problem after upgrade to v2.1.1. After a data loss and recovery affecting only once subset of the jobs, all Reducer tasks fail with 'Child Error':

java.lang.Throwable: Child Error at Caused by: Task process exit with nonzero status of 134. at

Logs on nodes where reducer tasks were attempted all create a core file and the logs show:

libprotobuf FATAL /usr/local/protobuf-2.4.1//include/google/protobuf/repeated_field.h:666] CHECK failed: (index) < (size()):
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (index) < (size()):

All Map-only jobs work fine, as does DISTCP. I've been working on this for many days now - any suggestions appreciated.

Thanks for any suggestions.