AnsweredAssumed Answered

Input/output errors over NFS

Question asked by bittermelon on Dec 12, 2012
Latest reply on Dec 16, 2012 by bittermelon
We are running a small M3 cluster (2.1.0) for testing, right now with only four nodes and about 70 TB of disk space. The nodes are fairly new (two are brand new) with 1-2 Xeon CPUs and 24 - 64 GB of RAM.

We are importing data using NFS, mounted locally on the nfs-server node. This node also runs the CLDB master. ZK runs on two other nodes. The number of files is quite high, two volumes (of eight) have already warned about passing 20M each, and they are about halfway copied. We will have to subdivide them I guess.

We are getting a lot of strange input/output errors when accessing the cluster, both when reading and writing. They are intermittent and seemingly random. The file copying is the only thing loading the cluster. For instance:

Running rsync over SSH to the mounted NFS crashes after a while (different error messages, sometimes "close failed"):
<pre>
rsync: write failed on "/mnt/mapr/my.cluster.com/srv/00070000/001/588/3c": Input/output error (5)
rsync error: error in file IO (code 11) at receiver.c(322) [receiver=3.0.9]
rsync: connection unexpectedly closed (315 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(605) [generator=3.0.9]
</pre>

Even removing files breaks (the files are sequentially named, so only some fail):
<pre>
root@mapr005:/mnt/mapr/my.cluster.com/srv/00030000# rm -r 022.tmp
rm: cannot remove '022.tmp/252/b4': Input/output error
rm: cannot remove '022.tmp/252/85': Input/output error
rm: cannot remove '022.tmp/252/1b': Input/output error
rm: cannot remove '022.tmp/252/ee': Input/output error
rm: cannot remove '022.tmp/252/b7': Input/output error
rm: cannot remove '022.tmp/252/a0': Input/output error
rm: cannot remove '022.tmp/252/a8': Input/output error
rm: cannot remove '022.tmp/252/28': Input/output error
rm: cannot remove '022.tmp/252/42': Input/output error
rm: cannot remove '022.tmp/252/33': Input/output error
rm: cannot remove '022.tmp/252/7d': Input/output error
...
</pre>

Sometimes even a simple "ls" return I/O error. Wait a minute and try again, and it works. This is highly annoying and nor really a good sign for putting this into production. I haven't seen anything fishy in the logs, but I don't really know where to look either. The only thing is the warning "High FileServer Memory" on three of the nodes. The only thing I found about that was "restart Warden" from the docs, which didn't help at all.

Any ideas what might be going wrong here? I'll be happy to post logs if anyone can tell me which logs would be interesting.

Outcomes