AnsweredAssumed Answered

Container/Replication issues

Question asked by mandoskippy on May 14, 2015
Latest reply on May 14, 2015 by mandoskippy
Branched to a new discussion
WE have a volume that is set 2x replication and it can't seen to correct out of this state, I focused in on one container, and the two nodes just seem to not be able to come up with healthy replication.

The container is 1246, 192.168.3.24 appears valid, and 192.168.3.12 is in REsync, but you can see on the log from 192.168.3.12 it's not resyncing... not sure what the issue is.

CLDB Web
Info on container 16246

    16246  224808893  4  15.01 GB  192.168.3.24--4-VALID  192.168.3.24--4-VALID,192.168.3.12--4-RESYNC,    C

# 192.168.3.12

    2015-05-14 10:44:48,2342 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2343 ERROR  attr.cc:77 GetAttr 16246.176.1595314 : GetContainer failed 19
    2015-05-14 10:44:48,2343 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2344 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2344 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2345 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2345 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2346 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2346 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2347 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2347 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2348 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2348 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2348 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2349 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2349 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2350 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2350 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2351 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2351 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2352 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2352 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2353 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet
    2015-05-14 10:44:48,2353 ERROR  attr.cc:77 GetAttr 16246.163.1595222 : GetContainer failed 19
    2015-05-14 10:44:48,2353 ERROR  ctable.cc:337 container 16246 is stale, and hasn't been resynced yet

# 192.168.3.24

        14 logs]# cat mfs.log-3|grep 16246
        2015-05-14 09:50:45,3005 INFO  loadcidmap.cc:730 SP SP9:/dev/sdi Delete empty containers for cid-chain with cid 16246
        2015-05-14 09:50:45,3005 INFO  loadcidmap.cc:759 SP SP9:/dev/sdi Cidmap Loaded cid 16246 rootblk: 0x16680180
        2015-05-14 09:50:47,3537 INFO  containerrollback.cc:263 Marking queryCldb 1 for rw container 16246 : stale(1) fixedByFsck(0)
        2015-05-14 09:50:48,9895 INFO  loadcidmap.cc:1267 PreInit done for container 16246
        2015-05-14 09:50:55,8002 INFO  firstwrite.cc:647 Get sync VN for spid 2449051591169589005196848f041ae7, container 18651 - got hole from txn: 14399620-15448196 uniq 16246794251868897394 totalsize 77
        2015-05-14 09:50:56,0597 INFO  firstwrite.cc:647 Get sync VN for spid 90a7d99445fd8dbe0051a4ff0f0d529f, container 3038 - got hole from txn: 36970042-38018618 uniq 16246904066132824124 totalsize 77
        2015-05-14 09:50:56,1963 INFO  firstwrite.cc:647 Get sync VN for spid 90a7d99445fd8dbe0051a4ff0f0d529f, container 18154 - got hole from txn: 16246894-17295470 uniq 16886904594297219785 totalsize 77
        2015-05-14 09:50:56,3965 INFO  nodefailure.cc:1097 Container 16246, CLDB asked to become master, ifClean=1
        2015-05-14 09:50:56,4144 ERROR  containerinfo.cc:753 Loss of data in container 16246, write versions are 29345659:29345660
        2015-05-14 09:50:56,4144 ERROR  nodefailure.cc:1201 Container 16246, become master failed since we lost data
        2015-05-14 09:50:56,4144 INFO  nodefailure.cc:1874 Marking container 16246 stale
        2015-05-14 09:50:56,4144 INFO  nodefailure.cc:1411 Become master failed for container 16246 with err 61
        2015-05-14 09:50:56,7461 INFO  firstwrite.cc:647 Get sync VN for spid 32b6d5b2cce678bf00518fd5500b2592, container 23976 - got hole from txn: 12740577-13789153 uniq 9316246137732924386 totalsize 77
        2015-05-14 09:50:57,3620 INFO  firstwrite.cc:647 Get sync VN for spid 7f27a52574290fa4005190d3a904cbd7, container 4516 - got hole from txn: 22395848-23444424 uniq 13162461974905276843 totalsize 77
        2015-05-14 09:50:58,3882 INFO  nodefailure.cc:1796 Mark container stale for cid 16246.
        2015-05-14 09:52:00,5619 INFO  nodefailure.cc:1097 Container 16246, CLDB asked to become master, ifClean=0
        2015-05-14 09:52:00,5833 ERROR  containerinfo.cc:753 Loss of data in container 16246, write versions are 29345659:29345660
        2015-05-14 09:52:00,5834 INFO  nodefailure.cc:1207 Lost data on container 16246. CLDB asked me to become master even if I lost data
        2015-05-14 09:52:00,5834 INFO  nodefailure.cc:1402 Become master completed successfully for container 16246 at txn:29345660-29345660, write:29345659-29345660, snap:544-544
        2015-05-14 10:12:44,5833 INFO  loadcidmap.cc:730 SP SP9:/dev/sdi Delete empty containers for cid-chain with cid 16246
        2015-05-14 10:12:44,5833 INFO  loadcidmap.cc:759 SP SP9:/dev/sdi Cidmap Loaded cid 16246 rootblk: 0x16680180
        2015-05-14 10:12:44,6893 INFO  containerrollback.cc:263 Marking queryCldb 1 for rw container 16246 : stale(1) fixedByFsck(0)
        2015-05-14 10:12:46,3753 INFO  loadcidmap.cc:1267 PreInit done for container 16246
        2015-05-14 10:12:53,4360 INFO  firstwrite.cc:647 Get sync VN for spid 90a7d99445fd8dbe0051a4ff0f0d529f, container 3038 - got hole from txn: 36970042-38018618 uniq 16246904066132824124 totalsize 77
        2015-05-14 10:12:53,7376 INFO  nodefailure.cc:1097 Container 16246, CLDB asked to become master, ifClean=1
        2015-05-14 10:12:53,7403 INFO  firstwrite.cc:647 Get sync VN for spid 32b6d5b2cce678bf00518fd5500b2592, container 23976 - got hole from txn: 12740577-13789153 uniq 9316246137732924386 totalsize 77
        2015-05-14 10:12:53,7854 ERROR  containerinfo.cc:753 Loss of data in container 16246, write versions are 29345659:29345660
        2015-05-14 10:12:53,7857 ERROR  nodefailure.cc:1201 Container 16246, become master failed since we lost data
        2015-05-14 10:12:53,7857 INFO  nodefailure.cc:1874 Marking container 16246 stale
        2015-05-14 10:12:53,7857 INFO  nodefailure.cc:1411 Become master failed for container 16246 with err 61
        2015-05-14 10:12:53,7983 INFO  firstwrite.cc:647 Get sync VN for spid 90a7d99445fd8dbe0051a4ff0f0d529f, container 18154 - got hole from txn: 16246894-17295470 uniq 16886904594297219785 totalsize 77
        2015-05-14 10:12:54,0049 INFO  firstwrite.cc:647 Get sync VN for spid 7f27a52574290fa4005190d3a904cbd7, container 4516 - got hole from txn: 22395848-23444424 uniq 13162461974905276843 totalsize 77
        2015-05-14 10:12:54,2279 INFO  firstwrite.cc:647 Get sync VN for spid 2449051591169589005196848f041ae7, container 18651 - got hole from txn: 14399620-15448196 uniq 16246794251868897394 totalsize 77
        2015-05-14 10:12:55,7244 INFO  nodefailure.cc:1796 Mark container stale for cid 16246.
        2015-05-14 10:13:56,8966 INFO  nodefailure.cc:1097 Container 16246, CLDB asked to become master, ifClean=0
        2015-05-14 10:13:56,9036 ERROR  containerinfo.cc:753 Loss of data in container 16246, write versions are 29345659:29345660
        2015-05-14 10:13:56,9056 INFO  nodefailure.cc:1207 Lost data on container 16246. CLDB asked me to become master even if I lost data
        2015-05-14 10:13:56,9056 INFO  nodefailure.cc:1402 Become master completed successfully for container 16246 at txn:29345660-29345660, write:29345659-29345660, snap:544-544
        2015-05-14 10:14:14,9443 INFO  containerrestore.cc:565 CONTAINER_RESTORE_START -- from srcnode FSID 6832161662026751981, 192.168.3.16:5660,  srccid 23728, replicacid 23728, replicaSnapId 0 resyncVolumeSnapshots = 1 resynctype 2 resyncWAcount 0 onReplica 0 needsReplication 0 sessionId 1001316246 notifyReplModule 1 chainSeqNumber 811  dumpSnapshotInode 1 isFullMirror 0, needReconnect 0, onlyfastresync 1 isMirrorRestarted 0
        2015-05-14 10:14:15,0568 ERROR  containerrestore.cc:1747 D: Container resync failed to send doresync req srccid 23728 replicacid 23728 sessionId 1001316246 source node 192.168.3.16:5660 err:104
    



Outcomes