Is there some sort of limitation, or rule of thumb on the number of simultaneous jobs I should limit my cluster to?
The reason I ask is that if I go over 15 or 20 long running jobs, I inevitably end up with all kinds of mysterious cluster errors that I don't normally get if I run the jobs seriallyâ€¦ Most of the time this ends up being fatalâ€¦ A majority of the tasktrackers will get blacklisted for a given job, but meanwhile other long running jobs are running successfully using the same tasktrackers.
I've tried de-tuning the cluster so that I always have a nice cushion of free memory, cpu, network, etc, but that doesn't seem to help.
Currently running 3.0.2, but this behavior goes back to 1.X. It seems to change slightly from version to version.
Is there a specific log I should look at to diagnose this problem, or can you recommend any further actions?
Typically I start to see things like:
Out of Memory in the shuffle,
Lost Task Tracker,
Task process exit with nonzero status of 65,
2014-02-14 06:50:31,760 WARN org.apache.hadoop.mapred.ReduceTask: Failed to read map output of task attempt_201402131946_0418_m_000160_0 for reduce 385 file path class org.apache.hadoop.fs.FidInfo[ fid: 11792.16653433.852739014, ipaddrs:  ]/output/job_201402131946_0418/attempt_201402131946_0418_m_000160_0/output.00385 error java.lang.NullPointerException
2014-02-14 06:50:31,760 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201402131946_0418_r_000385_0 copy failed: attempt_201402131946_0418_m_000160_0 from c10-n007
2014-02-14 06:50:31,760 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Failed to fetch map-output for attempt_201402131946_0418_m_000160_0 from c10-n007
2014-02-14 06:50:26,7743 ERROR Client fs/client/fileclient/cc/client.cc:1073 Thread: 139683775858432 OpenFid failed for file output/job_201402131946_0418/attempt_201402131946_0418_m_000078, LookupFid error Stale File handle(116)