AnsweredAssumed Answered

Cluster crash when a disk is full

Question asked by benoit on Oct 12, 2012
Latest reply on Oct 28, 2012 by nabeel
I ran a terasort bench on a small cluster. I wanted to generate 50GB of data on a cluster with 3*36GB of disk space. Replication factor is the default (3, min=2). Obviously I haven't enough disk space to store all the data. But I was expecting some kind of IOException, not a cluster crash's cascade.

MFS reported some errors (see below), and CLDB start "Waiting for local KvStoreContainer to become valid", "waiting for local mfs to register and become master".

In that point I'm no more able to log into the adminui (see adminui logs below).

Again, is there a way to recover from here ?

Thanks,

mfs.log snapshot

`2012-10-12 16:15:15,9814 ERROR  ctable.cc:190 x.x.0.0:0 Container 2056 is on a storage pool that is not yet available, thus not ready yet.
2012-10-12 16:15:15,9815 INFO  dirtyinodes.cc:347 x.x.0.0:0 Cannot update write version number of container 2056 since the container cannot be modified now
2012-10-12 16:15:15,9815 ERROR  ctable.cc:190 x.x.0.0:0 Container 2058 is on a storage pool that is not yet available, thus not ready yet.
2012-10-12 16:15:15,9815 INFO  dirtyinodes.cc:347 x.x.0.0:0 Cannot update write version number of container 2058 since the container cannot be modified now
2012-10-12 16:15:15,9815 ERROR  ctable.cc:190 x.x.0.0:0 Container 2064 is on a storage pool that is not yet available, thus not ready yet.
2012-10-12 16:15:15,9815 INFO  dirtyinodes.cc:347 x.x.0.0:0 Cannot update write version number of container 2064 since the container cannot be modified now
2012-10-12 16:15:15,9815 ERROR  ctable.cc:190 x.x.0.0:0 Container 2062 is on a storage pool that is not yet available, thus not ready yet.
2012-10-12 16:15:15,9815 INFO  dirtyinodes.cc:347 x.x.0.0:0 Cannot update write version number of container 2062 since the container cannot be modified now
2012-10-12 16:15:16,4188 ERROR  ctable.cc:190 x.x.0.0:0 Container 1 is on a storage pool that is not yet available, thus not ready yet.
2012-10-12 16:15:16,4189 ERROR  replicateops.cc:1495 x.x.0.0:0 Failed to GetContainer (19) for replicated op (1052395) on container (1) of type (15) from 10.19.251.93:5660. Rejecting the op.
2012-10-12 16:15:16,4189 INFO  replicateops.cc:2087 x.x.0.0:0 Op (15) from 10.19.251.93:5660 with version (1052395) on container (1) failed on replica with error (111)
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496
2012-10-12 16:15:16,4820 ERROR  iomgr.cc:723 x.x.0.0:0 Failing Slow disk /dev/sdb. 31 IOs spent 60 seconds or more for each IO.
2012-10-12 16:15:16,4821 WARN  iomgr.cc:715 x.x.0.0:0 Slow IO on disk /dev/sdb. IO spent more than 60 seconds.epoch IO: 6484 6484, currentEpoch 6496`

adminui log snapshot

`012-10-12 15:47:54,040 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [pool-1-thread-25]: init
2012-10-12 15:47:54,042 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [pool-1-thread-25]: CLDB Host: 10.19.251.92, CLDB IP: 7222 is READ_ONLY CLDB. Trying another one
2012-10-12 15:47:54,042 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [pool-1-thread-25]: Bad CLDB credentials removed: CLDB Host: 10.19.251.92, CLDB IP: 7222
2012-10-12 15:47:54,043 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [pool-1-thread-25]: CLDB Host: 10.19.251.93, CLDB IP: 7222 is attempting to become a master. Retrying !
2012-10-12 15:48:29,049 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [pool-1-thread-25]: Bad CLDB credentials removed: CLDB Host: 10.19.251.93, CLDB IP: 7222
2012-10-12 15:48:29,049 ERROR com.mapr.cli.VolumeCommands sendRequest [pool-1-thread-25]: RPC Request to list volumes failed. No data returned
2012-10-12 15:49:24,076 INFO  com.mapr.adminuiapp.commands.CLDBCallable [pool-1-thread-25]: CLDBCallable: Parameters = [volume, list, -limit, 50, -start, 0, -columns, mt,n,p,on,qta,dsu,dlu,ssu,tsu,rp,t,src,msc,mst,mds,mdc,drf, -output, terse]`

Outcomes