AnsweredAssumed Answered

Cluster down, disk failure(s)

Question asked by edmond on Feb 15, 2016
Latest reply on Feb 15, 2016 by edmond
Hello, I have an urgent issue. Our 4-node cluster has gone down this evening. I was alerted when the NFS share to our production server began timing out. Upon checking the logs on the primary/gateway server, it appears a storage disk failure occurred on that server (/dev/sdg). Another storage disk had also just previously failed on another node, which has me concerned about our data (our replication factor is 2).

The MapR web interface will not let me log in, and the /mapr/ directory is missing from the primary/gateway server. CLDB appears to be offline.

I put a call into the MapR support line but was asked to come here as we currently do not have a premium support contract. I didn't want to take any drastic action before speaking to support, lest I make things worse. Any help you can provide with steps for resolving the issue is much appreciated. Happy to forward additional logs, etc where needed.

Log snippets below. It appears no /opt/mapr/logs/faileddisk.log file was generated. I have forwarded a more complete copy of our logs to support@mapr.com.

/opt/mapr/logs/mfs.err:

    2016-02-15 00:16:25,4526 Disk /dev/sdg GUID 4FD33C5C-056E-83B9-4695-0545A52A5600 hit IO Readv error Input/output error -5 at block 6190885 count 16

/opt/mapr/logs/cldb.log:

    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: sun.misc.Unsafe.park(Native Method)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    2016-02-15 01:26:51,927 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:745)
    2016-02-15 01:26:51,927 INFO CLDBServer [WaitForLocalKvstore Thread]: Shutdown: Stopping CLDB
    2016-02-15 01:26:51,930 INFO CLDB [Thread-13]: CLDB ShutDown Hook called
    2016-02-15 01:26:51,932 INFO ZooKeeperClient [Thread-13]: Setting the clean cldbshutdown flag to true
    2016-02-15 01:26:51,938 INFO ZooKeeperClient [Thread-13]: Zookeeper Client: Closing client connection:
    2016-02-15 01:26:51,943 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type NodeDeleted occurred on path /datacenter/controlnodes/cldb/active/CLDBMaster
    2016-02-15 01:26:51,943 INFO ZooKeeper [Thread-13]: Session: 0x152e41ce87f0013 closed
    2016-02-15 01:26:51,943 INFO CLDB [Thread-13]: CLDB shutdown


dmesg:

    [9892772.359631] sd 6:0:4:0: [sdg] Unhandled sense code
    [9892772.359634] sd 6:0:4:0: [sdg] 
    [9892772.359637] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [9892772.359639] sd 6:0:4:0: [sdg] 
    [9892772.359640] Sense Key : Medium Error [current]
    [9892772.359645] Info fld=0x5e77310
    [9892772.359647] sd 6:0:4:0: [sdg] 
    [9892772.359649] Add. Sense: Unrecovered read error
    [9892772.359652] sd 6:0:4:0: [sdg] CDB:
    [9892772.359654] Read(16): 88 00 00 00 00 00 05 e7 72 50 00 00 01 00 00 00
    [9892772.359665] end_request: critical medium error, dev sdg, sector 99054352
    [9892780.217948] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
    [9892780.217977] sd 6:0:4:0: [sdg] Unhandled sense code
    [9892780.217980] sd 6:0:4:0: [sdg] 
    [9892780.217983] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [9892780.217985] sd 6:0:4:0: [sdg] 
    [9892780.217986] Sense Key : Medium Error [current]
    [9892780.217991] Info fld=0x5e77310
    [9892780.217993] sd 6:0:4:0: [sdg] 
    [9892780.217995] Add. Sense: Unrecovered read error
    [9892780.217997] sd 6:0:4:0: [sdg] CDB:
    [9892780.217999] Read(16): 88 00 00 00 00 00 05 e7 72 50 00 00 01 00 00 00
    [9892780.218011] end_request: critical medium error, dev sdg, sector 99054352
    [9894638.984715] nfs: server localhost not responding, timed out

warden.log:
https://gist.github.com/edburnett/73050030b7247a11fdec

service list:

    $ maprcli service list
    ERROR (10009) -  Could not connect to CLDB and no Zookeeper connect string provided


storage pool list:

    # /opt/mapr/server/mrconfig sp list
    ListSPs resp: status 0:8
    No. of SPs (8), totalsize 64056609 MB, totalfree 6502449 MB
    
    SP 0: name SP2, Online, size 9150944 MB, free 923481 MB, path /dev/sdh
    SP 1: name SP1, Offline, size 9538612 MB, free 0 MB, path /dev/sdf
    SP 2: name SP7, Online, size 9150944 MB, free 936176 MB, path /dev/sdk
    SP 3: name SP5, Online, size 9150944 MB, free 933838 MB, path /dev/sdo
    SP 4: name SP4, Online, size 9150944 MB, free 941109 MB, path /dev/sde
    SP 5: name SP8, Online, size 9150944 MB, free 923108 MB, path /dev/sdm
    SP 6: name SP3, Online, size 9150944 MB, free 924482 MB, path /dev/sdc
    SP 7: name SP6, Online, size 9150944 MB, free 920253 MB, path /dev/sdq

Thanks very much for your assistance.




Outcomes