AnsweredAssumed Answered

YARN Resource Manager bogus node

Question asked by Terry on Nov 3, 2016
Latest reply on Nov 14, 2016 by cwarman

Hello, we are running Ver 5.2 and having numerous Spark jobs fail due to attempted connection to a non-existent node. The IP address in the errors is 0.0.218.212. The errors with this value are visible in the Resource Manager (RM) logs in several places: RM as well as Node manager on the RM node. It also appears in the userlogs of the nodes acting as Application Master for the job when it crashes.

 

 /opt/mapr/hadoop/hadoop-2.7.0/logs/yarn-mapr-resourcemanager-hd19.sec.bnl.local.log:2016-11-03 12:15:49,632 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e10_1478184432441_0018_01_000044 of capacity <memory:7168, vCores:2, disks:0.0> on host 0.0.218.212:43363, which has 2 containers, <memory:14336, vCores:4, disks:0.0> used and <memory:46853, vCores:2, disks:1.5> available after allocation

 

2016-10-25 17:20:16,959 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : 0.0.218.212:37468
2016-10-25 17:21:52,986 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /0.0.218.212:37468
2016-10-25 17:21:52,986 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to 0.0.218.212/0.0.218.212:0

 

maprcli node list shows no such entity. I need to find where it is being picked up so that I can purge it. I even did a grep -r /opt/mapr/ on all my nodes to see if there was a corrupted config file, but only found the string in the logs mentioned above.

 

Can anyone tell me how I can find the phantom node and get rid of it?

Thanks

Outcomes