AnsweredAssumed Answered

M3 Unable to promote new CLDB server

Question asked by stormcrow on Feb 23, 2014
Latest reply on Feb 24, 2014 by mava
I am attempting to recover from a failed CLDB server using the instructions listed [here][1], but am having trouble getting the new cldb server to take over.

First, I found two replicas of cid1. The one I am using is here:

    [root@node12 logs]# /opt/mapr/server/mrconfig info dumpcontainers | egrep "cid:1 "
    cid:1 volid:1 sp:SP1:/dev/sda spid:33e3bbe9d4b9229700519696ed0a0adb prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:1 querycldb:0 resyncinprog:0 shared:0 owned:0 logical:0 snapusage:0 snapusageupdated:0

I installed mapr-cldb on node12, and configured all nodes to use it as cldb:

    /opt/mapr/server/configure.sh -R -C node12 -Z job,node3,node4 -N cluster

node12 is unable to start up. 192.168.3.147 is the old cldb server. 192.168.3.20 is the new one.

cldb.log@node12:

    2014-02-23 13:38:08,580 INFO ZooKeeperClient [ZK-Connect]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Servers:  Inactive:  192.168.3.20-253(1133312586010741983) 192.168.3.148
    -253(3188444061217710707) 192.168.3.147-253(5299330584556789244) Unused:  Epoch:253 SizeMB:0
    2014-02-23 13:38:08,667 INFO CLDBServer [ZK-Connect]: Starting thread to monitor waiting for local kvstore to become master
    2014-02-23 13:38:08,702 INFO VolumeMirror [main]: Initializing volume mirror thread ...
    2014-02-23 13:38:08,703 INFO VolumeMirror [main]: Spawned 2 VolumeMirror Threads
    2014-02-23 13:38:08,722 INFO HttpServer [main]: Creating listener for 0.0.0.0
    2014-02-23 13:38:08.741::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
    2014-02-23 13:38:08,798 INFO CLDB [main]: CLDBState: CLDB State change : WAIT_FOR_FILESERVERS
    2014-02-23 13:38:08,798 INFO CLDB [main]: CLDBInit: Exporting program 2346
    2014-02-23 13:38:08,798 INFO CLDB [main]: CLDBInit: Exporting program 2345
    2014-02-23 13:38:08,798 INFO CLDB [main]: CLDBInit: Starting HTTP Server
    2014-02-23 13:38:08,798 INFO HttpServer [main]: WebServer: Starting WebServer
    2014-02-23 13:38:08,800 INFO HttpServer [main]: Listener started on SelectChannelConnector@0.0.0.0:7221 port 7221
    2014-02-23 13:38:08,800 INFO HttpServer [main]: Starting Jetty WebServer
    2014-02-23 13:38:08.800::INFO:  jetty-6.1.14
    2014-02-23 13:38:23.299::INFO:  Started SelectChannelConnector@0.0.0.0:7221
    2014-02-23 13:45:08,668 FATAL CLDB [WaitForLocalKvstore Thread]: CLDBShutdown: CLDB had master lock and was waiting for its local mfs to become Master.Waited for 7 (minutes) but mfs did not
     become Master. Shutting down CLDB to release master lock.

More worrying, mfs.log @ node12:

    ******* Starting mfs server *******
    *** mfs mapr-version: 2.1.2.18401.GA ***
    2014-02-23 13:53:46,0722 INFO  mapfs.cc:874 x.x.0.0:0 FS : Using hostname node12.wopr.mynetwatchman.com, port: 5660, hostid 0xfba55cadca64cdf (1133312586010741983)
    2014-02-23 13:53:46,0722 INFO  mapfs.cc:876 x.x.0.0:0 Starting fileserver on :
    2014-02-23 13:53:46,0722 INFO  mapfs.cc:879 x.x.0.0:0   192.168.3.20:5660
    2014-02-23 13:53:46,0795 INFO  cachemgr.cc:2827 x.x.0.0:0 cachePercentagesIn: inode:6:log:6:meta:10:dir:40:small:15
    2014-02-23 13:53:46,0795 INFO  cachemgr.cc:2854 x.x.0.0:0 CacheSize 23225 MB, inode:6:log:6:meta:10:dir:42:small:15
    2014-02-23 13:53:52,0355 INFO  cachemgr.cc:2665 x.x.0.0:0 lru meta  (0), start      1, end 447109, blocks 447109 [3493M], dirtyquota 178843 [1397M]
    2014-02-23 13:53:52,0562 INFO  cachemgr.cc:2665 x.x.0.0:0 lru dir   (3), start 447110, end 1620770, blocks 1173661 [9169M], dirtyquota 469464 [3667M]
    2014-02-23 13:53:52,0635 INFO  cachemgr.cc:2665 x.x.0.0:0 lru small (1), start 1620771, end 2039934, blocks 419164 [3274M], dirtyquota 377247 [2947M]
    2014-02-23 13:53:52,0768 INFO  cachemgr.cc:2665 x.x.0.0:0 lru large (2), start 2039935, end 2794432, blocks 754498 [5894M], dirtyquota 377247 [2947M]
    2014-02-23 13:53:52,7583 INFO  cachemgr.cc:2705 x.x.0.0:0 lru inode (5), start      1, end 5707776, inodes 5707776 [1393M], dirtyquota 2283110 [ 557M]
    2014-02-23 13:53:52,8511 INFO  cachemgr.cc:2743 x.x.0.0:0 lru cluster (6), start      1, end 1173662, cluster 1173662 [109M]
    2014-02-23 13:53:52,8512 INFO  cachemgr.cc:2887 x.x.0.0:0 Total dcache: 2794432, icache 5707776, ccache 1173662. memUsed 25334824944
    2014-02-23 13:53:52,8514 INFO  cachemgr.cc:3206 x.x.0.0:0 CM: Wrote cache offsets in /opt/mapr/logs/mfs-cache.dat
    2014-02-23 13:53:52,8515 INFO  mapfs.cc:189 x.x.0.0:0 mfs using maxTotalRpcs 4096
    2014-02-23 13:53:52,8516 INFO  iodispatch.cc:93 x.x.0.0:0 using IO maxEvents: 5000
    2014-02-23 13:53:52,8581 INFO  iomgr.cc:279 x.x.0.0:0 maxSlowIOs 30, slowDiskTimeOut 60 s, maxOutstandingIOsPerDisk 100, MaxStoragePools 49,
    2014-02-23 13:53:52,8635 INFO  mapserver.cc:799 x.x.0.0:0 CLDB 1 has IP address 192.168.3.147:7222
    .
    .
    .
    2014-02-23 13:54:11,3974 INFO  iomgr.cc:2530 x.x.0.0:0 Refresh disktab state: old state: 0 0, failed SPs: 0, failed disks: 0
    2014-02-23 13:54:13,8745 ERROR  cldbha.cc:668 x.x.0.0:0 Got error Connection reset by peer (104) while trying to register with CLDB 192.168.3.147:7222

It looks like node12 is trying to talk to the broken and offline cldb server at 192.168.3.147 even though it was configured to see itself as the only cldb server. What am I missing?


  [1]: http://answers.mapr.com/questions/7105/m3-cldb-failure-missingdisk

Outcomes