AnsweredAssumed Answered

Standalone CLDB won't start after drive failure

Question asked by dougduncan on Nov 19, 2014
Latest reply on Nov 20, 2014 by dougduncan
Hi all,

I have a MapR M3 standalone machine running on a CentOS machine. This cluster is used for tests so it doesn't have a lot of data on it.

Last week we had a drive fail that was part of a RAID 0 array. Yes I know this is bad and was set up before I came on.

Looking at `/opt/mapr/logs/faileddisk.log` I saw the following message:

    ############################ Disk Failure Report ###########################
    
    Disk                :    sdb
    Vendor              :    DELL
    Model Number        :    PERC 5/i
    Serial Number       :    600188b0397f450016c2af23244f8895
    Firmware Revision   :    1.00
    Size                :    1141899264
    Failure Reason      :    I/O error
    Time of Failure     :    Mon Nov 10 03:16:22 MST 2014
    Resolution          :
       Please refer to MapR's online documentation at http://www.mapr.com/doc on how to handle disk failures.
       In summary, run the following steps:
    
       a. If this appears to be a software failure, go to step b.
          Otherwise, physically remove the disk /dev/sdb.
          Optionally, replace it with a new disk.
    
       b. Run the command "maprcli disk remove -host 127.0.0.1 -disks /dev/sdb" to remove /dev/sdb from MapR-FS.
    
       c. In addition to /dev/sdb, the above command removes all the disks that belong to the same storage pool, from MapR-FS.
          Note down the names of all removed disks.
    
       d. Add all the above removed disks (exclude /dev/sdb) and the new disk to MapR-FS by running the command:
          "maprcli disk add -host 127.0.0.1 -disks <comma separated list of disks>"
          For example, If /dev/sdx is the new replaced disk, and /dev/sdy, /dev/sdz were removed in step c), the command would be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdx,/dev/sdy,/dev/sdz"
                       If there is no new disk, the command would just be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdy,/dev/sdz"

We replaced the drive and formatted it as RAID 5 this morning and I noticed that the log file changed the `Failure Reason` from `I/O Error` to `Unknown Error`. I went ahead and followed the steps outlined above. I got an error stating that the drive wasn't found on the `remove`, but things appear to have gone successfully on the `add`. To verify this I ran a `list` command:

    # maprcli disk list -host 127.0.0.1
    mn        sn                                pst      fw    mt  n     dsu  dst     hn         vn    fs           dsa     st
    PERC_5/i  600188b0397f45000caec8be59575f24  running  1.00  1   sda1  77   500     127.0.0.1  DELL  ext4         423     0
    PERC_5/i  600188b0397f45000caec8be59575f24  running  1.00  0   sda2       68875   127.0.0.1  DELL  LVM2_member          0
    PERC_5/i  600188b0397f45001bfe929b3b731fe2  running  1.00  0   sdb   189  418176  127.0.0.1  DELL  MapR-FS      417987  0
                                                               1   dm-0       32836   127.0.0.1        ext4                 0
                                                               0   dm-1       18128   127.0.0.1        swap                 0
                                                               1   dm-2       17908   127.0.0.1        ext4                 0

I tried running a simple `hadoop fs -ls` command and got an error:

    2014-11-19 14:00:30,9784 ERROR Cidcache fs/client/fileclient/cc/cidcache.cc:1047 Thread: 139799603541760 Lookup of volume mapr.cluster.root failed, error Connection reset by peer(104), CLDB: 10.11.1.101:7222 trying another CLDB
    2014-11-19 14:00:30,9785 ERROR Client fs/client/fileclient/cc/client.cc:226 Thread: 139799603541760 Failed to initialize client for cluster my.cluster.com, error Connection reset by peer(104)
    ls: Could not create FileClient

If I look at the `cldbnodes` I can see my IP address as both **valid** and **invalid**:

    # maprcli dump cldbnodes -zkconnect 10.11.1.101:5181 -json
    {
            "timestamp":1416428139951,
            "status":"OK",
            "total":2,
            "data":[
                    {
                            "valid":"10.11.1.101:5660-"
                    },
                    {
                            "invalid":"10.11.1.101:5660-"
                    }
            ]
    }

I see the following entries in my `/opt/mapr/logs/cldb.log` file:

    2014-11-19 12:04:34,185 INFO  com.mapr.fs.cldb.Containers [pool-1-thread-2]: Processing stale containers  on StoragePool 05d9e7ea570011db00546cd5910d8a9d from FileServer 10.11.1.101:5660-
    2014-11-19 12:04:34,186 INFO  com.mapr.fs.cldb.Containers [pool-1-thread-2]: FileServer 10.11.1.101:5660- reported stale star container 1 on StoragePool 05d9e7ea570011db00546cd5910d8a9d which cannot become master. Asking it to retry
    2014-11-19 12:04:34,489 INFO  com.mapr.fs.cldb.Containers [pool-1-thread-2]: Processing stale containers  on StoragePool 05d9e7ea570011db00546cd5910d8a9d from FileServer 10.11.1.101:5660-
    2014-11-19 12:04:34,490 INFO  com.mapr.fs.cldb.Containers [pool-1-thread-2]: FileServer 10.11.1.101:5660- reported stale star container 1 on StoragePool 05d9e7ea570011db00546cd5910d8a9d which cannot become master. Asking it to retry
    2014-11-19 12:04:34,637 FATAL com.mapr.fs.cldb.CLDB [WaitForLocalKvstore Thread]: CLDBShutdown: CLDB had master lock and was waiting for its local mfs to become Master.Waited for 7 (minutes) but mfs did not become Master. Shutting down CLDB to release master lock.
    2014-11-19 12:04:34,637 INFO  com.mapr.fs.cldb.CLDBServer [WaitForLocalKvstore Thread]: Shutdown: Stopping CLDB
    2014-11-19 12:04:34,638 INFO  com.mapr.fs.cldb.CLDB [Thread-9]: CLDB ShutDown Hook called
    2014-11-19 12:04:34,638 INFO  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [Thread-9]: Zookeeper Client: Closing client connection:
    2014-11-19 12:04:34,646 INFO  com.mapr.fs.cldb.CLDB [Thread-9]: CLDB shutdown
    2014-11-19 12:04:34,646 INFO  com.mapr.fs.cldb.CLDBServer [main-EventThread]: ZooKeeper event NodeDeleted on path /datacenter/controlnodes/cldb/active/CLDBMaster

The first couple of lines are repeated numerous times.

My question to the experts here is, can I do anything to get the cluster to start working again without a complete rebuild? I'm not too worried about losing the data (it was on a RAID 0 array after all and is most likely already gone) as it's test data that should be easily replaceable.

I've seen mention of adding the line `cldb.ignore.stale.zk=true` to the `/opt/mapr/conf/cldb.conf` and restarting the CLDB, but that doesn't seem to be the wisest thing to try as a starting point in resolving this issue.

Thank you for any and all help anyone can provide on this.

Doug

Outcomes