My long-running spark-streaming job is clearly dying periodically from a zoo-keeper expiration.
The job also sometimes dies after a week due to a YARN ticket expiration (I know how to fix that one); but since that happens sometimes, I think the ZK timeout may be longer than a week and may not be directly tied to spark?
Is there a configuration setting I can use to stop this, or is there a function I can call to fix it when it happens? (like respond to a listener/etc).
Note that I think the critical line here is probably Session expired for /services/resourcemanager/master.
[2018-01-30 00:23:14,134] WARN ZK Reset due to SessionExpiration for ZK: totlxp00001.nomura.com:5181,totlxp00002.nomura.com:5181,totlxp00003.nomura.com:5181 (com.mapr.util.zookeeper.ZKDataRetrieval)
[2018-01-30 00:23:22,564] ERROR ZK Session expired. Need to reset ZK completely for node: /services/resourcemanager/master (com.mapr.baseutils.zookeeper.ZKUtils)
[2018-01-30 00:23:22,564] ERROR Most likely SessionExpirationException. Need to reset ZK and call myself again (com.mapr.util.zookeeper.ZKDataRetrieval)
com.mapr.baseutils.zookeeper.ZKClosedException: ZK client was closed
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /services/resourcemanager/master
... 112 more
[2018-01-30 00:23:22,574] ERROR Unable to determine ResourceManager service address from Zookeeper at totlxp00001.nomura.com:5181,totlxp00002.nomura.com:5181,totlxp00003.nomura.com:5181 (org.apache.hadoop.yarn.client.MapRZKRMFinderUtils)
[2018-01-30 00:23:22,576] ERROR Failed to properly truncate all lineage (and checkpoint). (com.nomura.us.sysm.maas.digestion.AggregateWorkflow$)