We have a critical issue in our Hadoop datalake on production.
The problem comes from time to time and it concerns the processing of jobs in Yarn => In resourcemanager console, several still in "Running" status but actually don't run, and every new submitted job still in "Accepted" status.
Just after killing the running jobs, the other jobs run immediately.
By looking in the logs of RM, NM and History Server, no relevent warning, error or any information are sent from the moment when the processing stops.
I tried to investigate on this issue, I take the test envirenement which have 4 nodes with 16G of memory where 5G is allocated to Yarn =>
I take this hive command:
select distinct codic from images ;"
I send this command in the first server => the job runs normally and everything is OK.
I tried the same thing in two other servers, the 3 jobs run normally and everything is OK.
When I send the command the fourth time, I can see in the logs that the application's master is allocated, but just after it passed to "Running" status, Yarn crushes => All jobs stop to process on. If a send a new job, it still in status accepted.
To return the processing, I have to kill a running application. So just after the kill, I can see that it's ok. However, if I send another application, we get the same issue.
I tried the same steps in our post-production cluster, and I get the same results!!!!!
What disturbed me, is that in resource manager console, I can see that we still have available memory, and also we have no errors or warnings in the logs. Normally, if Yarn have not enough memory to run applications, it should kill himself the application and send an error.
Is it a bug in Yarn ? Any help or review for this issue? I will be really grateful, thank you!