AnsweredAssumed Answered

RM Fail to start during recovery [ ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Failed to load state ]

Question asked by satz on May 30, 2017
Latest reply on May 30, 2017 by satz

Error Message:

 

2017-05-30 17:37:52,484 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state
2017-05-30 17:37:52,502 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery started
2017-05-30 17:37:52,515 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.2
2017-05-30 17:37:53,082 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Failed to load state.
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
        at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
        at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124)
        at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:143)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:176)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
        at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationStateDataProto.parseFrom(YarnServerResourceManagerRecoveryProtos.java:956)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadRMAppState(FileSystemRMStateStore.java:250)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:201)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1044)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1084)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
2017-05-30 17:37:53,084 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.

 

Reason

 

Generally this messages you see, when your RM state store is corrupted (or it has corrupted application details). When the RM has failed over or it is rebooted , It will try to recover or try to maintain the application states from the state store directory ( maprfs:////var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot ) 

 

If the application details are corrupted some how, RM  will not be able to restore the state and it will fail to come up

 

This is related to JIRA [ [YARN-5924] Resource Manager fails to load state with InvalidProtocolBufferException - ASF JIRA  ]

Outcomes