AnsweredAssumed Answered

How Safe is GFSCK?

Question asked by dannyman on Jun 13, 2014
Latest reply on Jun 17, 2014 by junjun_olympia
We had an incident where several hosts lost power.  When they were brought online, several disks didn't come up fast enough.  I "added" the disks back in via the WebUI.  (NOTE: don't re-add disks via WebUI ... un-comment them from `/opt/mapr/conf/disktab` instead ...)

Now we have occasional errors across the filesystem at specific points.  So, I started to follow the procedure on the gfsck manual to iterate through each node, each container, take the container offline, run local hadoop fsck ... local fsck ran okay on all the nodes/sps that were impacted.  No errors reported.

When I went to run fsck (in report errors mode) on our CLDB node the devs reported problems, so I aborted local fsck on an SP.  The SP was then left in an inconsistent state and could not be re-added.  The fix ended up being to un-comment the entry in `/opt/mapr/conf/disktab` (!?huh?!) and restart warden, and the SP was back online.  Yay!

At this point, my think is that in adding new containers back through the web UI on downed nodes, I killed some data, and that the errors we hit on various directories across the filesystem are due to pointers to missing data.  I'd like to "fix" that, and in our case, regenerating lost data is no big problem for us.  I'd like to run gfsck.

The questions are:

 1. Is gfsck safe?  Ever since aborting local fsck left an SP in a state where we could not add it, we are wary ...
 2. Does gfsck require the filesystem to go offline?
 3. How long does gfsck take?  Let's say it takes us an hour to check one SP, and we have 20 nodes with 3 SPs each ... would gfsck need 60 hours to run?

`maprcli dump volumeinfo` reports 4747 containers in the volume.  5 look like this:

                        "ContainerId":6614,
                        "Epoch":4,
                        "Master":"unknown ip (0)-0-VALID",
                        "ActiveServers":{
                                
                        },
                        "InactiveServers":{
                                "IP:Port":"xx.xx.xx.63:5660--2"
                        },
                        "UnusedServers":{
                                "IP:Port":[
                                        "xx.xx.xx.60:5660--4",
                                        "xx.xx.xx.59:5660--4",
                                        "xx.xx.xx.61:5660--4"
                                ]
                        },
                        "OwnedSizeMB":"15.03 GB",
                        "SharedSizeMB":"15.03 GB",
                        "LogicalSizeMB":"32.93 GB",
                        "TotalSizeMB":"15.03 GB",
                        "NumInodesInUse":6144,
                        "Mtime":"Tue May 20 19:16:19 PDT 2014",
                        "NameContainer":"false"

Outcomes