AnsweredAssumed Answered

Procedure to replace failed disks when server needs to be shutdown

Question asked by edmond on Feb 16, 2016
Latest reply on Feb 16, 2016 by keysbotzum
I have a failed disk on my NFS gateway node (which also runs TaskTracker, Resource Manager, Node Manager, Webserver, etc). It is one of 4 nodes in the cluster. The server chassis has internal disk bays that can't be accessed when powered on. Thus I will need to power off this machine in order to replace the disk.

The official disk failure documents seem to assume servers that have hotswap bays, and that disks can be replaced without offlining any other services.

What is the *safest* procedure to follow in this scenario on a live cluster? Should the cluster be shut down entirely? Maintaining data integrity is more important than uptime in this instance (I can shutdown our application temporarily if needed). We currently have a replication factor of 2 on our main storage volumes (we are working on adding more hardware to support a replication factor of 3). We are currently running on the Community license.

This node was previously the CLDB server as well, which has since been relocated to a different node as a result of this issue: