AnsweredAssumed Answered

NFS gateway timeouts whenever a disk fails in the cluster

Question asked by edmond on May 11, 2018
Latest reply on May 21, 2018 by jbubier

We are experiencing an ongoing issue with client servers getting timeouts to the NFS gateway every time a disk fails somewhere in the cluster. We have a MapR 5.0 cluster (community license), currently with 11 storage nodes, and are only using the fileserver features. The storage servers all run Ubuntu 14.04 and 16.04 LTS.

 

Whenever a hard disk failure occurs on any server, something that typically happens once every few months or so, MapR offlines that pool and begins re-replicating the data as expected. However our client/app servers then begin timing out. This effectively halts our operations, sometimes for several hours.

 

Timeout messages appear in dmesg on the clients:

 

[Mon Apr 30 01:06:51 2018] nfs: server pf-us1-storage3 not responding, still trying
[Mon Apr 30 01:06:51 2018] nfs: server pf-us1-storage3 not responding, still trying

 

Eventually, after the cluster fully recovers and replication is done, the timeouts stop.

 

[Mon Apr 30 01:06:51 2018] nfs: server pf-us1-storage3 OK

 

Unfortunately our MapR logs have rotated out since the last time this happened, but as I recall, similar messages about client timeouts appear on the server side in nfsserver.log.

 

The clients are mounted with the options "rw,noatime,hard,nolock" as suggested in the MapR 5 docs.

 

From looking at our gateway server metrics tracked in Grafana, I don't see any particularly harmful spikes in system CPU or memory usage during disk failure that might otherwise lead to obvious performance issues.

 

Any thoughts on what could be causing this or where I should be looking? Since disk failure is inevitable, we'd like to get our cluster to a more stable state that doesn't effectively offline the entire system during disk failure events.

 

Thank you!

Outcomes