AnsweredAssumed Answered

CLDB Not Coming up after data node replica disk failure

Question asked by dmcnelis on Feb 27, 2017
Latest reply on Mar 2, 2017 by mufeed

Just had the following sequence happen on a test cluster running M3, v.5.2, community edition.

 

We had, in a short time frame, had a drive failure on one of our MFS nodes that was a replica for CLDB data.  Prior to finding out about that failure, the CLDB service died.  When we went to restart we started seeing a bunch of errors like:

Rejecting RPC 2345.211 from 10.93.0.201:5660 with status 3 as CLDB is waiting for local kvstore to become master.

At the time, we didn't realize we'd had the drive failure.  Running on EC2, we replaced the EBS volume (which was not-recoverable / had no snapshot) and restarted warden on that node.  We still weren't able to get that node's MFS to load.  We also were not able to follow the disk remove/add directions in the documentation because the CLDB service was down.

I tried to set up the new EBS volume by running disksetup -G, however I received an error like this:

/dev/xvdi failed. Error 52, Invalid exchange. Failed to find the size of the disk

I was able to confirm that I could run mkfs.ext3 on that volume without any issues.  I tried manually removing the original entry in the disktabs file in /opt/mapr/conf and re-ran, thinking the GUID for the drive was the issue, and still am receiving the same error.

 

For now I manually removed the entry in disktabs for that drive and the MFS log on that machine now ends in this message:

Resolving function 'maprhbase_RegisterFilters()'

tcmalloc: large alloc 4434534400 bytes == 0x6d26000 @  0xfcd113 0xfebcde 0x8f2e94

Loading /opt/mapr/server/permissions/libmapr_roles_refimpl.so

Resolving function 'getSecurityMembership()'

Resolving function 'cleanup()'

Scanning directory '/opt/mapr/server/filters'

Loading /opt/mapr/server/filters/libmaprhbase-filters.so

Resolving function 'maprhbase_RegisterFilters()'

tcmalloc: large alloc 4434534400 bytes == 0x846a000 @  0xfcd113 0xfebcde 0x8f2e94

Loading /opt/mapr/server/permissions/libmapr_roles_refimpl.so

Resolving function 'getSecurityMembership()'

Resolving function 'cleanup()'

Scanning directory '/opt/mapr/server/filters'

Loading /opt/mapr/server/filters/libmaprhbase-filters.so

Resolving function 'maprhbase_RegisterFilters()'

2017-02-27 16:07:37,3894bind: error 98

(This is when I try to manually start MFS using sudo service mapr-mfs restart, since Warden is unable to connect to CLDB, it never seems to get around to trying to start the MFS service, which CLDB seems to want to talk to before it will finish starting).

 

We really really don't want to go through the steps on another ticket, which has you remove the ZK data and reformat your cluster drives.  But we're at a loss for what we've got to do to get the CLDB node back up and running.

 

I will note that none of the machines have the same IP address (an answer to a previous question).  The cluster had wire-level security enabled, which I re-configured with it disabled for now; initially thought that there was a serverticket issue based on the original warden logs when trying to restart failed services.

 

Really appreciate any suggestions one might have on what we can do to successfully get the CLDB service (and subsequently the rest of the cluster) back online.

 

**Edit/addition

We also see a number of messages like this in the CLDB log:

2017-02-28 14:32:57,4633hdrlen bad, 363, received from 10.93.0.250:58497

2017-02-28 14:32:57,5022hdrlen bad, 363, received from 10.93.0.250:55861

2017-02-28 14:32:57,6531hdrlen bad, 363, received from 10.93.0.219:43008

2017-02-28 14:32:57,6798hdrlen bad, 363, received from 10.93.0.210:50528

2017-02-28 14:32:57,6798hdrlen bad, 363, received from 10.93.0.210:34755

Outcomes