AnsweredAssumed Answered

Why Yarn crashes ?

Question asked by ANIKADOS on Sep 11, 2017
Latest reply on Sep 29, 2017 by ANIKADOS

During MapReduce processing, Yarn did crash and the processing of jobs had stopped. I successed to back the processing after killing the first job which was running, but after some minutes, another crach that I solved by killing the second job wich was running.

 

We are looking for reasons of this crach that we had several times before (between one to two times in a month)

 

In ressource manager logs , I find this messages repeated from the beggining of the crach until the killing of the first job, and also after some minute before killing the second job

 

2017-08-25 03:51:58,815 WARN org.apache.hadoop.ipc.Server: Large response size 4739374 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:38352 Call#33361 Retry#0
2017-08-25 03:53:39,255 WARN org.apache.hadoop.ipc.Server: Large response size 4739374 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:38456 Call#33364 Retry#0
2017-08-25 03:55:19,700 WARN org.apache.hadoop.ipc.Server: Large response size 4739374 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:38556 Call#33367 Retry#0
2017-08-25 03:57:00,262 WARN org.apache.hadoop.ipc.Server: Large response size 4739374 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:38674 Call#33370 Retry#0
2017-08-25 03:58:40,687 WARN org.apache.hadoop.ipc.Server: Large response size 4739374 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:38804 Call#33373 Retry#0
.
.
.
2017-08-25 11:02:44,086 WARN org.apache.hadoop.ipc.Server: Large response size 4751251 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:39778 Call#34159 Retry#0
2017-08-25 11:02:47,933 WARN org.apache.hadoop.ipc.Server: Large response size 4751251 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:39778 Call#34162 Retry#0
2017-08-25 11:03:06,800 WARN org.apache.hadoop.ipc.Server: Large response size 4751251 for call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 10.135.8.101:39814 Call#34165 Retry#0

 

NB: We still get this warning from time to another, we still wondring if it concerns a connexion between the node manager (10.135.8.101) and the ressource manager, or something else ?

 

Same thing for the nodemanager

2017-08-25 03:51:54,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 98201 for container-id container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:54,791 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 112912 for container-id container_e41_1500982512144_36679_01_000387: 2.3 GB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:55,177 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 105848 for container-id container_e41_1500982512144_36627_01_001644: 619.4 MB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:58,938 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 98201 for container-id container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
.
.
.
2017-08-25 11:05:40,104 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 112912 for container-id container_e41_1500982512144_36679_01_000387: 1.1 GB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:40,493 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 105848 for container-id container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:43,867 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 98201 for container-id container_e41_1500982512144_36679_01_000382: 1.1 GB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:45,040 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 105848 for container-id container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical memory used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:48,397 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e41_1500982512144_36627_01_001644 transitioned from RUNNING to KILLING
2017-08-25 11:05:48,397 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1500982512144_36627 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT
2017-08-25 11:05:48,397 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e41_1500982512144_36627_01_001644


and also for the job history :

2017-08-25 03:53:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 03:56:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 03:59:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 04:02:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 04:05:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 04:08:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 04:11:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files

.
.
.
2017-08-25 11:05:36,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner started
2017-08-25 11:05:41,271 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner complete
2017-08-25 11:06:04,214 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
2017-08-25 11:08:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-08-25 11:08:06,518 INFO org.apache.hadoop.mapreduce.jobhistory.JobSummary: jobId=job_1500982512144_36793,submitTime=1503647426340,launchTime=1503651960434,firstMapTaskLaunchTime=1503651982671,firstReduceTaskLaunchTime=0,finishTime=1503651985794,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=9,reduceSlotSeconds=0,jobName=SELECT `C_7361705f62736973`.`buk...20170825)(Stage-1)
2017-08-25 11:08:06,518 INFO org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Deleting JobSummary file: [maprfs:/var/mapr/cluster/yarn/rm/staging/history/done_intermediate/mapr/job_1500982512144_36793.summary]
2017-08-25 11:08:06,518 INFO org.apache.hadoop.mapreduce.jobhistory.JobSummary: jobId=job_1500982512144_36778,submitTime=1503642110785,launchTime=1503651960266,firstMapTaskLaunchTime=1503651969483,firstReduceTaskLaunchTime=0,finishTime=1503651976016,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=19,reduceSlotSeconds=0,jobName=SELECT `C_7361705f7662726b`.`vbe...20170825)(Stage-1)

 

Please, have you any explication or solution of this issue ?

Outcomes