Occasionally (fairly rarely) our long-lived streaming job gets a memory error, despite basically doing the same thing for days and days on end. The problem is that, for some reason, it can take like 10 hours before spark notices it's in a bad state and does a re-attempt.
While we clearly want to fix this (memory issue), we also want a way to get to production now. Given how rare the error is, I've added watch-dog thread that notices when nothing has happened for an extended time period, and it kills the job.
After extensive reading online (and personal experience trying), it seems like Spark + YARN don't play well together in terms of return codes. I can exit with failure codes and yarn says SUCCEEDED and does not reattempt. I can let exceptions propagate and it may or may not say that it failed (it literally toggles). People online seem to have the same issues.
The only thing which seems to consistently make YARN report a failure is literally tracking the main thread and calling mainThread.stop() - which is awful; that method is literally deprecated because it is a bad idea.
Does anyone know a consistent way to exit a spark/YARN application and trigger an application re-attempt on purpose?