ResourceManager fails to transition to Active mode with "InvalidResourceRequestException"

Document created by Hao Zhu Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

 

Original Publication Date: September 3, 2015

 

Environment :

Hadoop 2.5.1

Apache Hadoop ResourceManager HA enabled.

Symptom:

ResourceManager fails to transition to Active mode with "InvalidResourceRequestException".

 

Below stacktrace shows firstly in RM log:

Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=9216, maxMemory=8192         
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:385)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:345)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:309)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1104)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:508)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 13 more

Below stacktrace then repeats in RM log:

WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active 
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:301)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:120)
... 4 more

Caused by: org.apache.hadoop.service.ServiceStateException: RMActiveServices cannot enter state STARTED from state STOPPED
       
at org.apache.hadoop.service.ServiceStateModel.checkStateTransition(ServiceStateModel.java:129)
at org.apache.hadoop.service.ServiceStateModel.enterState(ServiceStateModel.java:111)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:190)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:911)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:951)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:948)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:292)
... 5 more 2015-09-03 13:59:23,581 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session

Root Cause

This is due to YARN-3493 which is fixed in Hadoop 2.6.1, 2.8.0 and 2.7.1.

 

This issue can happen if users lower the value of yarn.scheduler.maximum-allocation-mb and then restart ResourceManager.

 

ResourceManager fails to recover the applications left in RMStateStore which requires more memory than yarn.scheduler.maximum-allocation-mb, even though those applications failed for a long time.

 

Solution:

1. Identify the RMStateStore class.

MapR by default uses FileSystemRMStateStore which means the RMStateStore is on MFS.

User may choose ZKRMStateStore also.

$ hadoop2 conf |grep yarn.resourcemanager.store.class <property><name>yarn.resourcemanager.store.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value><source>yarn-default.xml</source></property>

2. Find the location of RMStateStore.

If RMStateStore is using FileSystemRMStateStore, the parent location is defined by yarn.resourcemanager.fs.state-store.uri.

$ hadoop2 conf |grep  yarn.resourcemanager.fs.state-store.uri <property><name>yarn.resourcemanager.fs.state-store.uri</name><value>/var/mapr/cluster/yarn/rm/system</value><source>yarn-default.xml</source></property>

Then the location of all application directories is :

/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot

If RMStateStore is using ZKRMStateStore, the parent znode is defined by yarn.resourcemanager.zk-state-store.parent-path

$ hadoop2 conf |grep yarn.resourcemanager.zk-state-store.parent-path <property><name>yarn.resourcemanager.zk-state-store.parent-path</name><value>/rmstore</value><source>yarn-default.xml</source></property>

Then the znode of all application directories is:

/rmstore/ZKRMStateRoot/RMAppRoot/

3. Move or remove all the application directories in RMStateStore.

The impact of this step is, RM UI will be clean, but the application information can still be view-able from HistoryServer UI; and also RM will not recover any failed/running applications so users need to re-submit the application.

For example:

If FileSystemRMStateStore,

hadoop fs -mv /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/* /backup_statestore/

If ZKRMStateStore,

Need to remove application directories one by one as below

rmr /rmstore/ZKRMStateRoot/RMAppRoot/application_#############_####

4. Restart ResourceManager

1 person found this helpful

Attachments

    Outcomes