TT nodes distributed cache failure

Question asked by thealy on Jan 25, 2013
Running V2.0.0 on small (20) M3 cluster.

When running a Map/Reduce job that uses several .jars loaded into the Distributed cache, several (~4) nodes have their map jobs fails because of ClassNotFoundException. All the other nodes proceed through the job normally and the jobs completes. But this is wasting 20-25% of my TT nodes.

Can anyone explain why some nodes might fail to read all the .jars from the Distributed cache?

[Apologies for cross-posting to those of you on hadoop-users; I'm not sure if this is MapR specific or not]