I am getting error in my spark jobs and they error are usually similar to one shown below. Each node has around 128 GB of memory and around 4 cores, also I have specified executor memory as 4GB and extra 4GB overhead. For shuffle I have specified memory fraction as 0.5, by all this I want to indicate it does not seems like memory issue. However I am not able to figure out what could be issue and this comes up in one stage or another, I reran my job multiple times and this comes at multiple points. You can assume we have infrastructure of around 200+ nodes with decent configuration.
Job aborted due to stage failure: Task 0 in stage 2.0 failed 12 times, most recent failure: Lost task 0.11 in stage 2.0 (TID 27, lgpbd1107.sgp.ladr.com): java.io.FileNotFoundException: /tmp/hadoop-mapr/nm-local-dir/usercache/names/appcache/application_1485048538020_113554/3577094671485456431296_lock (No such file or directory)
I am unable to figure out whether its issue related to application or infrastructure. Could someone please help.