AnsweredAssumed Answered

Low Replication Alarm Causing Put Failures?

Question asked by snelson on Aug 4, 2014
Latest reply on Sep 22, 2014 by snelson
We ran out of disk space on our cluster. We've added more disks, and the cluster is rebalancing. We've also deleted some snapshots that were taking up lots of space. The cluster is now back to being healthy except we have some low replication alarms on a couple volumes. We're seeing errors when putting rows into an M7 table in one of those volumes. Is this caused by the low replication? It seems to only have errors on certain regions. Will this be resolved once replication gets back up to normal? Please advise. The stacktrace of the exception is:

    java.lang.Exception: Error in inserting row 28
        at com.mapr.fs.jni.MapRPut.done( ~[na:na]

**Edit:** I see a bunch of these logs in cldb.log even though we have space on all disks on <host>:

    Resync for cid #### on server <host> failed due to lack of disk space

**Edit2:** Volumes have finished replicating, but we're still seeing Put failures periodically

**Edit3:** The frequency of Put failures has diminished drastically since yesterday. Still monitoring, but it appears the cluster has healed over time after adding the storage.

**Edit4:** Still receive Put failures occasionally. It appears that whenever I get a Put failure, MapR starts repairing that region because if I Put the same row over and over, the failures eventually go away. Is there a way to repair all regions instead of letting Put's fail?

**Edit5:** These exceptions have a nasty habit of bringing down our JVM. We had another occurrence of disk overflow last weekend, and we're seeing these exceptions again. Would somebody from MapR please look at the source code and/or file a bug?