AnsweredAssumed Answered

Recovering from disk failure

Question asked by cleonn on Mar 20, 2013
Latest reply on Mar 22, 2013 by Ted Dunning
In our testing process we pulled two disks at once on two separate servers. Having a replication of 2x at the time we discovered that 5 volumes now give a Volume Data Unavailable alarm. Running a

<pre>maprcli dump volumeinfo -volumename OurVolume -json</pre>

shows that all but one container are VALID and located on ActiveServers. But on each of these 5 volumes there's one container which shows the following:

<pre>
  {
   "ContainerId":&lt;id&gt;,
   "Epoch":13,
   "Master":"unknown ip (0)-0-VALID",
   "ActiveServers":{
   
   },
   "InactiveServers":{
   
   },
   "UnusedServers":{
    "IP:Port":[
     "10.11.0.26:5660-10.21.0.26:5660--13",
     "10.11.0.22:5660-10.21.0.22:5660--13"
    ]
   },
   "OwnedSizeMB":"15.05 GB",
   "SharedSizeMB":"0 MB",
   "LogicalSizeMB":"15.44 GB",
   "TotalSizeMB":"15.05 GB",
   "NumInodesInUse":14996,
   "Mtime":"Tue Mar 19 21:14:11 CET 2013",
   "NameContainer":"false"
  },
</pre>

I'm guessing that data is lost due to us managing to destroy two storage pools that contained these containers data.

Running:

<pre>maprcli volume info -name OurVolume -json</pre>

on one of these volumes returns:

<pre>
{
"timestamp":1363867198556,
"status":"OK",
"total":1,
"data":[
  {
   "acl":{
    "Principal":"User srv",
    "Allowed actions":[
     "dump",
     "restore",
     "m",
     "d",
     "fc"
    ]
   },
   "creator":"srv",
   "aename":"srv",
   "aetype":0,
   "numreplicas":"2",
   "minreplicas":"2",
   "replicationtype":"high_throughput",
   "rackpath":"/",
   "readonly":"0",
   "mountdir":"/srv/00030000/03b",
   "volumename":"OurVolume",
   "mounted":1,
   "quota":"0",
   "advisoryquota":"0",
   "snapshotcount":"0",
   "logicalUsed":"228988",
   "used":"223867",
   "snapshotused":"0",
   "totalused":"223867",
   "scheduleid":0,
   "schedulename":"",
   "volumetype":0,
   "volumeid":242633580,
   "actualreplication":[
    4,
    0,
    95,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0
   ],
   "nameContainerSizeMB":22850,
   "needsGfsck":true,
   "maxinodesalarmthreshold":"0",
   "partlyOutOfTopology":0
  }
]
}
</pre>

Where the interesting part seems to be that <i>"needsGfsck":true</i> thingie.

My question is, how do I remove the Volume Data Unavailable alarm and how do I restore these containers? I can re-populate them from the source, so I'm not afraid of losing data on them.

Outcomes