AnsweredAssumed Answered

Too Many Disk Failures

Question asked by maprNewbie on Mar 8, 2017
Latest reply on Mar 28, 2017 by mufeed

Hello,

 

I've been using mapr for about a year now and have a user level grip on how things work. One thing that has been happening repeatedly are disk errors on random nodes. I've had 4 of them over the past ~8 months, twice on a CLDB node which seems to bring the cluster down.

 

First, the set up: 10 Nodes on Ubuntu 16.04, running mapr version 5 something. I've forgotten! Is there some command which can tell me the mapr version? All I know is that it was the latest version 6 months ago and it did not support 16.04 (oops). However I am currently forced to use 16.04.

 

 

So I have two questions:

1) First question is about recovery. The procedure I normally follow is to bring the cluster offline, move the CLDB node, and then bring the cluster back online. However I want to know if I can shortcut this by just bringing the cluster offline, running disksetup on the failed CLDB node to bring the hard disk back up for that node, and then bringing the cluster back online. I have three zookeepers with the leader being the failed CLDB node. Would this work?

 

2) Is it possible to find out why the disk errors happen so frequently? I don't think the disks themselves are dying. The relevant log, I think, would be mfs.log-3, which shows:

 

2017-03-07 23:47:19,8179 INFO Global instancemfs.cc:608 ShrinkSlabsAI 0x12c61c000
2017-03-07 23:47:19,8180 INFO Global instancemfs.cc:544 ShrinkInstanceSlabs 0x12c61c000
2017-03-07 23:47:19,8182 INFO Global instancemfs.cc:589 ShrinkInstanceSlabsDone 0x12c61c000
2017-03-07 23:47:19,8182 INFO Global instancemfs.cc:628 ShrinkSlabsAIDone 0x12c61c000
2017-03-08 00:14:19,2434 ERROR IOMgr iomgr.cc:706 IO failed /dev/sdb1 off 0x192 blocks 0x2 err -1
2017-03-08 00:14:19,2435 ERROR IOMgr lun.cc:877 Disk /dev/sdb1 hit IO error Operation not permitted -1, iotype 3
2017-03-08 00:14:19,2435 ERROR IOMgr lun.cc:879 Disk /dev/sdb1 hit IO Writev error -1
2017-03-08 00:14:19,2435 ERROR IOMgr iomgr.cc:2760 Missing disk for GUID 2DA203FB-600D-8EB6-C5BD-0E4D29895700
2017-03-08 00:14:19,2435 INFO IOMgr iomgr.cc:2774 Refresh disktab state: old state: 0 0, failed SPs: 0, failed disks: 1
2017-03-08 00:14:19,2435 ERROR IOMgr iomgr.cc:706 IO failed /dev/sdb1 off 0x3c0150 blocks 0x1 err -1
2017-03-08 00:14:19,2435 ERROR IOMgr lun.cc:877 Disk /dev/sdb1 hit IO error Operation not permitted -1, iotype 3
2017-03-08 00:14:19,2435 ERROR IOMgr lun.cc:879 Disk /dev/sdb1 hit IO Writev error -1
2017-03-08 00:14:19,2437 INFO IOMgr spinit.cc:1098 sp SP1:/dev/sdb1 offline with error err -1
2017-03-08 00:14:19,2437 ERROR IOMgr spinit.cc:1108 SP SP1:/dev/sdb1 offline with Unknown error -1. Error -1.
2017-03-08 00:14:19,2437 INFO IOMgr spinit.cc:1123 spname SP1:/dev/sdb1 sp 0x11eb829c0 offline wa 0x129fca018
2017-03-08 00:14:19,2438 INFO IOMgr spinit.cc:1137 SP SP1:/dev/sdb1 Containers Hidden
2017-03-08 00:14:19,2479 INFO IOMgr spinit.cc:1188 SP SP1:/dev/sdb1 Containers Refs Dropped
2017-03-08 00:14:19,2479 ERROR Replication containerinfo.cc:166 Update container vn for cid (1) failed to get logspace 19
2017-03-08 00:14:19,2479 ERROR Replication containerinfo.cc:375 Update container vn for cid (1) failed with error 19
2017-03-08 00:14:19,2495 ERROR Replication containerinfo.cc:166 Update container vn for cid (2242) failed to get logspace 19
2017-03-08 00:14:19,2495 ERROR Replication containerinfo.cc:375 Update container vn for cid (2242) failed with error 19
2017-03-08 00:14:19,2768 ERROR Replication containerinfo.cc:166 Update container vn for cid (2693) failed to get logspace 19
2017-03-08 00:14:19,2768 ERROR Replication containerinfo.cc:375 Update container vn for cid (2693) failed with error 19
2017-03-08 00:14:19,2787 ERROR Replication containerinfo.cc:166 Update container vn for cid (2789) failed to get logspace 19
2017-03-08 00:14:19,2787 ERROR Replication containerinfo.cc:375 Update container vn for cid (2789) failed with error 19
2017-03-08 00:14:19,2809 ERROR Replication containerinfo.cc:166 Update container vn for cid (2895) failed to get logspace 19
2017-03-08 00:14:19,2809 ERROR Replication containerinfo.cc:375 Update container vn for cid (2895) failed with error 19

 

The faileddisk.log file just shows : Unknown error 1, with nothing else in it.

 

 

Thanks!

Outcomes