AnsweredAssumed Answered

Container replication appears to be stuck

Question asked by dannyman on Feb 18, 2017
Latest reply on Mar 3, 2017 by mufeed

We got into a situation where NFS access that created files on a volume errored out: "No space left on device"

 

Root cause was eventually found: a storage pool on one node was full. I put the node in /decommissioned but the SP remained stuck at 100%. I ultimately trashed and re-created the SP as the node was begging for a larger stripe anyway.

 

Now I we seem stuck like so:

$ maprcli dump containerinfo -ids 2504 -json
{
"timestamp":1487442699876,
"timeofday":"2017-02-18 10:31:39.876 GMT-0800",
"status":"OK",
"total":1,
"data":[
{
"ContainerId":2504,
"Epoch":117,
"Master":"10.10.3.118:5660--117-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.3.118:5660--117-VALID",
"10.10.12.142:5660--2-RESYNC",
"10.10.4.24:5660--2-RESYNC",
"10.10.3.127:5660--114-RESYNC"
]
},
"InactiveServers":{

},
"UnusedServers":{
"IP:Port":"10.10.12.134:5660--116" # The node that had the full SP, which no longer exists.
},
"OwnedSizeMB":"285.2 GB",
"SharedSizeMB":"14.12 GB",
"LogicalSizeMB":"16.19 GB",
"TotalSizeMB":"285.2 GB",
"NumInodesInUse":298687327,
"Mtime":"Sat Feb 18 10:31:38 PST 2017",
"NameContainer":"true",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":false,
"VolumeName":"data",
"VolumeId":71616275,
"VolumeReplication":3,
"VolumeMounted":true,
"AccessTime":"Sat Feb 18 07:56:57 PST 2017"
}
]
}

 

Looking at mrconfig cntr resyncprogress it seems like the resync is stuck in a weird loop. Here's output at an 11 hour interval:

 

0-23:24 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4279899545, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.3.127:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 166563838

Cid: 2504, Snapshot Cid: 4279899034, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 17219737

Cid: 2504, Snapshot Cid: 4279899034, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 26431813

 

And:

 

0-10:31 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4279899559, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.3.127:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 128126974

Cid: 2504, Snapshot Cid: 4279899034, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 66296347

Cid: 2504, Snapshot Cid: 4279899034, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 68238052

 

From what I can see, the sync to 3.127 (epoch 114?!) just keeps rolling over, and the syncs to the other nodes (epoch 2?!) are just infinitely slow.

 

I am also a little concerned of a potential SP crunch on 3.127:

 

0-10:43 djh@c16-03-27 ~$ /opt/mapr/server/mrconfig sp list
ListSPs resp: status 0:4
No. of SPs (4), totalsize 17987745 MB, totalfree 11731512 MB

SP 0: name SP1, Online, size 4908952 MB, free 4647029 MB, path /dev/sdb2
SP 1: name SP2, Online, size 4908952 MB, free 605304 MB, path /dev/sdi2
SP 2: name SP4, Online, size 3260886 MB, free 2627979 MB, path /dev/sdf2
SP 3: name SP3, Online, size 4908952 MB, free 3851198 MB, path /dev/sdk2

 

I had tried putting the node to drain, but SP2 remains stubbornly stuck around 90%. This is another node which could benefit from a wider stripe.

 

Any idea what is going on here with replication? Why are the epochs mixed? I'm going to try bouncing the mfs on container master node.

 

UPDATE 1: Initial results after bouncing warden on SP master node:

 

 

0-11:10 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4189879439, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.3.127:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 6819838

Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 1172602

Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 1172602

 

Also:

"Master":"10.10.3.118:5660--118-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.3.118:5660--118-VALID",
"10.10.3.127:5660--114-RESYNC",
"10.10.12.142:5660--2-RESYNC",
"10.10.4.24:5660--2-RESYNC"
]
},

 

I would still like to understand why we have different epochs here. My hunch is that 3.127 syncs so much faster because it is in the same topology as the master.

 

UPDATE 2: The epoch 114 finished sync and is now marked inactive:

 

0-15:18 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 23241139

Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 22818814

 

"ContainerId":2504,
"Epoch":118,
"Master":"10.10.3.118:5660--118-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.3.118:5660--118-VALID",
"10.10.12.142:5660--2-RESYNC",
"10.10.4.24:5660--2-RESYNC"
]
},
"InactiveServers":{
"IP:Port":"10.10.3.127:5660--114"
},
"UnusedServers":{
"IP:Port":"10.10.12.134:5660--116"
},

 

UPDATE 3: A few weeks have passed and now the replication is again going in a circle:

0-09:29 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 81871118

Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 187011070

 

0-09:32 djh@mapr-01 ~$ maprcli dump replicationmanagerinfo -volumename data -json
{
"timestamp":1488216765113,
"timeofday":"2017-02-27 09:32:45.113 GMT-0800",
"status":"OK",
"total":2,
"data":[
{
"VolumeName":"data",
"VolumeId":71616275,
"VolumeTopology":"/data",
"VolumeUsedSizeMB":100259266,
"VolumeReplication":3,
"VolumeMinReplication":2,
"MirrorThrottle":true,
"AccessTime":"Mon Feb 27 07:57:02 PST 2017",
"limitSpread":true,
"fixCreatorId":false
},
{
"ContainerId":2504,
"Epoch":118,
"Master":"10.10.3.118:5660--118-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.3.118:5660--118-VALID",
"10.10.12.142:5660--2-RESYNC",
"10.10.4.24:5660--2-RESYNC"
]
},
"InactiveServers":{
"IP:Port":"10.10.3.127:5660--114"
},
"UnusedServers":{
"IP:Port":"10.10.12.134:5660--116"
},
"OwnedSizeMB":"395.2 GB",
"SharedSizeMB":"2.28 TB",
"LogicalSizeMB":"2.66 TB",
"TotalSizeMB":"395.2 GB",
"NumInodesInUse":164675012,
"Mtime":"Mon Feb 27 09:32:42 PST 2017",
"NameContainer":"true",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":false
}
]
}

 

I will try restarting the CLDB and relevant MFS later today when the cluster is idle.

 

UPDATE 4: Things are starting to look normal:

 

First, I took a look at resyncprogress, and it was going in a circle on oe destination node, but the Snapshot Cid at least was incrementing:

 

0-09:29 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 81871118

Cid: 2504, Snapshot Cid: 4189879438, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.12.142:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 187011070

 

Later:

 

0-12:47 djh@c16-03-18 ~$ /opt/mapr/server/mrconfig cntr resyncprogress
Resync Progress Info
--------------------
Cid: 2504, Snapshot Cid: 4189879935, Vol Id: 71616275, Location: Source, Peer Addr: 10.10.4.24:5660
ResyncType: Container Resync, Status: Resync In Progress, Total Inodes: 334589951, Resync Complete: 90583383

 

Huh. Look at container info:

 

 "ActiveServers":{
"IP:Port":[
"10.10.3.118:5660--118-VALID",
"10.10.12.142:5660--118-VALID",
"10.10.4.24:5660--2-RESYNC",
"10.10.3.127:5660--114-RESYNC"
]
},

 

Two VALID copies is good news. I stopped and started MFS on the master node, and now we have:

 

 "Master":"10.10.12.142:5660--119-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.12.142:5660--119-VALID",
"10.10.3.118:5660--118-RESYNC",
"10.10.3.127:5660--114-RESYNC"
]
},

 

No more Epoch 2. Things look normal to me now. If you don't see any more updates you can assume we have it sorted. For whatever this is worth ...

Outcomes