AnsweredAssumed Answered

Spark Streaming - Zoo Keeper Timeout?

Question asked by john.humphreys on Jan 30, 2018
Latest reply on Feb 20, 2018 by Harikrishnan Cheneperth Kunhumveettil

My long-running spark-streaming job is clearly dying periodically from a zoo-keeper expiration.

 

The job also sometimes dies after a week due to a YARN ticket expiration (I know how to fix that one); but since that happens sometimes, I think the ZK timeout may be longer than a week and may not be directly tied to spark?

 

Is there a configuration setting I can use to stop this, or is there a function I can call to fix it when it happens? (like respond to a listener/etc).

 

Note that I think the critical line here is probably Session expired for /services/resourcemanager/master.

 

[2018-01-30 00:23:14,134] WARN ZK Reset due to SessionExpiration for ZK: totlxp00001.nomura.com:5181,totlxp00002.nomura.com:5181,totlxp00003.nomura.com:5181 (com.mapr.util.zookeeper.ZKDataRetrieval)
[2018-01-30 00:23:22,564] ERROR ZK Session expired. Need to reset ZK completely for node: /services/resourcemanager/master (com.mapr.baseutils.zookeeper.ZKUtils)
[2018-01-30 00:23:22,564] ERROR Most likely SessionExpirationException. Need to reset ZK and call myself again (com.mapr.util.zookeeper.ZKDataRetrieval)
com.mapr.baseutils.zookeeper.ZKClosedException: ZK client was closed

...

Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /services/resourcemanager/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at com.mapr.baseutils.zookeeper.ZKUtils.getData(ZKUtils.java:46)
... 112 more
[2018-01-30 00:23:22,574] ERROR Unable to determine ResourceManager service address from Zookeeper at totlxp00001.nomura.com:5181,totlxp00002.nomura.com:5181,totlxp00003.nomura.com:5181 (org.apache.hadoop.yarn.client.MapRZKRMFinderUtils)
[2018-01-30 00:23:22,576] ERROR Failed to properly truncate all lineage (and checkpoint). (com.nomura.us.sysm.maas.digestion.AggregateWorkflow$)

Outcomes