AnsweredAssumed Answered

Reducers fail at LOCALIZING in YARN

Question asked by kokonihau on Jun 15, 2015
Latest reply on Jun 29, 2015 by kokonihau
Hi, 
Recently we upgrade to v4.1 using Hive 0.12. 
From time to time we see queries that get hung on the reducer stage  and then fail on :
`org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed
 java.io.IOException: java.lang.InterruptedException` 
after what seems to be `mapreduce.task.timeout`.  


The details : 
Running a simple query that usaully succeeds using 2 mappers and 1 reducer. 
The application master runs on `node-03`. 
The reducer tried to run 3 times. 
On the first 2 it fails on `Timed out after 1800 secs` 
The first 2 times were on the node `node-04`. 
Thirds time it runs on another node and succeeds in few seconds. 

The first attempt is between `12:10` until failing at `12:40`. 
 There are no logs for the attempt itself (**why?**) but it the application master log we can see :

    2015-06-15 12:10:49,884 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1434057691672_0970_01_000008 to attempt_1434057691672_0970_r_000000_0
    2015-06-15 12:10:49,884 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:1 CompletedMaps:2 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:2 RackLocal:1
    2015-06-15 12:10:49,895 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapred.Task: mapOutputFile class: org.apache.hadoop.mapred.MapRFsOutputFile
    2015-06-15 12:10:49,896 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved node-04 to /default-rack
    2015-06-15 12:10:49,896 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1434057691672_0970_r_000000_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
    2015-06-15 12:10:49,897 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1434057691672_0970_01_000008 taskAttempt attempt_1434057691672_0970_r_000000_0
    2015-06-15 12:10:49,897 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1434057691672_0970_r_000000_0
    2015-06-15 12:10:49,900 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.job.event.TaskAttemptContainerLaunchedEvent: Shuffle port returned by ContainerManager for attempt_1434057691672_0970_r_000000_0 : 13562
    2015-06-15 12:10:49,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1434057691672_0970_r_000000_0] using containerId: [container_1434057691672_0970_01_000008 on NM: [node-04:59902]
    2015-06-15 12:10:49,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1434057691672_0970_r_000000_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
    2015-06-15 12:10:49,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator: ATTEMPT_START task_1434057691672_0970_r_000000
    2015-06-15 12:10:49,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1434057691672_0970_r_000000 Task Transitioned from SCHEDULED to RUNNING
    2015-06-15 12:10:50,886 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1434057691672_0970: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:0, vCores:0, disks:0.0> knownNMs=5
    2015-06-15 12:40:57,331 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1434057691672_0970_r_000000_0: AttemptID:attempt_1434057691672_0970_r_000000_0 Timed out after 1800 secs


looking at the node manager logs from that time we see : 

    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1434057691672_0970_01_000008 by user hadoop
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop       IP=###AM_IP### OPERATION=Start Container Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1434057691672_0970    CONTAINERID=container_1434057691672_0970_01_000008
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1434057691672_0970_01_000008 to application application_1434057691672_0970
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1434057691672_0970_01_000008 transitioned from NEW to LOCALIZING
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1434057691672_0970
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1434057691672_0970
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapr_direct_shuffle
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1434057691672_0970
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle
    2015-06-15 12:10:49,935 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1434057691672_0970
    2015-06-15 12:10:51,284 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Unknown localizer with localizerId container_1434057691672_0970_01_000007 is sending heartbeat. Ordering it to DIE
    2015-06-15 12:10:51,308 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Unknown localizer with localizerId container_1434057691672_0970_01_000007 is sending heartbeat. Ordering it to DIE
    2015-06-15 12:10:51,333 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed
    java.io.IOException: java.lang.InterruptedException
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:541)
            at org.apache.hadoop.util.Shell.run(Shell.java:455)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
            at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:236)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1098)
    2015-06-15 12:10:51,334 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: RESOURCE_FAILED sent to absent container container_1434057691672_0970_01_000007
    2015-06-15 12:10:51,949 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1434057691672_0970_01_000007
    2015-06-15 12:40:57,411 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1434057691672_0970_000001 (auth:SIMPLE)


This happens again between 12:40 and 13:10. 
Then the AM decides to run it on another node and it succeeds. 
Any ideas?

Outcomes