AnsweredAssumed Answered

NFS Server Issues?

Question asked by mandoskippy on Mar 28, 2015
Latest reply on Mar 28, 2015 by mandoskippy
I have a cluster of 5 nodes running a test M7 setup. Each node is running a NFS server that mounts the cluster to same location on the local node. 

Thus, I get a setup that has all nodes with the same mount point to the cluster.

I have a process to update files across the cluster, that basically takes a tar file located in the cluster, copies it down locally, and untars/gzips it.  I ran it today, and I got this error on "some" of my nodes:

zip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

This concerned me so I ran the process again, and it worked fine. (no errors).  This was doing nothing more than executing again (i.e. no changes). Now this cluster is not "active" so perhaps the network activity from the first cp made things work better for the second copy, but that sure concerns me that the copied file (remember the tar/gunzip is local and on a local fs) was corrupt based on what gzip said.   This means (in my simple mind) that the the initial copy off the NFS mounted NFS produced a corrupted (not whole) file with no errors, and it took the gzip to identify the error.  You can see how this would concern me for using the MapR NFS mounts...

On one of the servers that had an error, I did go to the nfsserver.log file and found the error below that occurred at the time of the file copy, perhaps that could point to the right direction, either way, this seems to have failed "open" which could be really bad in a production environment.

nfsserver.log on the machine that tried to copy the tgz from maprfs vis nfs:
2015-03-28 08:52:53,3569 ERROR nfsserver[11272] fs/nfsd/fileops.cc:350 192.168.0.98[0x93ff6151] Read retry 1 for  [nfsfh=0.2959309312.2049.50.5646582] fs=192.168.0.98:5660
2015-03-28 08:52:53,3570 ERROR nfsserver[11272] fs/nfsd/fileops.cc:350 192.168.0.98[0x92ff6151] Read retry 1 for  [nfsfh=0.2959309312.2049.50.5646582] fs=192.168.0.98:5660

I will try to capture the issues in future runs (including looking at the files that were copied down)

Outcomes