AnsweredAssumed Answered

Node maintenance and running jobs

Question asked by Engel on Jan 29, 2018
Latest reply on Feb 22, 2018 by Engel

We are running a ten node cluster on MapR 5.2.2 and regularly we are performing some maintenance on one of our nodes. For this maintenance we sometimes need to reboot the node.

 

The issue we are experiencing now is that the /mapr mountpoint is getting stuck when Warden is stopped when there are still running jobs on that specific node. A normal reboot isn't working at that point and a force reboot or reset through the console is necessary to proceed.

 

The only way we found at this moment to resolve this issue is to kill the jobs running on that node, but we'd rather not do that. An alternative is to stop the nodemanager and wait for all jobs to finish, but for some jobs we are running that could take hours so we'd also rather not do that either.

 

Is there a way to relocate jobs to another node so you can take down a node without killing jobs or waiting for a long time?

 

(reading the documentation it looks like that this will also be applicable to MapR 6; some more maintenance (failover) options are added, but it looks like they are still not relocating jobs)

Outcomes