AnsweredAssumed Answered

mapr dropping hard disks?

Question asked by stormcrow on Feb 13, 2013
Latest reply on Feb 20, 2013 by aaron
Our cluster drives are offered up via SAS HBA. We 'lost' a drive today, but the drive has no posted smart errors, generated no messages in /var/log/messages, is still visible to the OS, and mapr has given us the spectacularly unhelpful failure reason of "error_str='Unknown error'". Is it possible for us to drill down to get more information on why mapr decided to reject this drive? Is there a way for us to get mapr to retry the drive without explicitly removing it and adding it?

    [root@job /opt/mapr/logs]# cat handle_disk_failure.log.15682.2013-02-13-09:57:35
    + LOG_FILE=/opt/mapr/logs/faileddisk.log
    + cluster=
    + diskList=
    + volList=
    + errorCode=0
    + '[' - = - ']'
    + case "$1" in
    + diskList=/dev/sdd
    + shift 2
    + '[' - = - ']'
    + case "$1" in
    + cluster=wopr
    + shift 2
    + '[' - = - ']'
    + case "$1" in
    + volList=mapr.job.cluster.example.com.local.mapred
    + shift 2
    + '[' - = - ']'
    + case "$1" in
    + errorCode=52
    + '[' 52 -lt 0 ']'
    + shift 2
    + '[' '' = - ']'
    + '[' wopr '!=' '' ']'
    + cluster='-cluster wopr'
    + '[' /dev/sdd '!=' '' ']'
    ++ basename /opt/mapr/server/handle_disk_failure.sh
    + curProcName=handle_disk_failure.sh
    + lock=/tmp/handle_disk_failure.sh.lck
    + var=(`echo ${diskList} | tr ',' ' '`)
    ++ echo /dev/sdd
    ++ tr , ' '
    + for disk in '${var[@]}'
    + [[ /dev/sdd == /dev/* ]]
    + disk=sdd
    ++ echo sdd
    ++ sed -e 's/\//_/g'
    + disk=sdd
    + flock -x 201
    + log_failure sdd
    + disk=sdd
    ++ date
    + time_of_failure='Wed Feb 13 09:57:35 EST 2013'
    + error_str=
    + resolution='Resolution          :
       Please refer to MapR'\''s online documentation at http://www.mapr.com/doc on how to handle disk failures.
       In summary, run the following steps:
    
       a. If this appears to be a software failure, go to step b.
          Otherwise, physically remove the disk /dev/sdd.
          Optionally, replace it with a new disk.
    
       b. Run the command "maprcli disk remove -host 127.0.0.1 -disks /dev/sdd" to remove /dev/sdd from MapR-FS.
    
       c. In addition to /dev/sdd, the above command removes all the disks that belong to the same storage pool, from MapR-FS.
          Note down the names of all removed disks.
    
       d. Add all the above removed disks (exclude /dev/sdd) and the new disk to MapR-FS by running the command:
          "maprcli disk add -host 127.0.0.1 -disks <comma separated list of disks>"
          For example, If /dev/sdx is the new replaced disk, and /dev/sdy, /dev/sdz were removed in step c), the command would be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdx,/dev/sdy,/dev/sdz"
                       If there is no new disk, the command would just be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdy,/dev/sdz"'
    + echo -e '############################ Disk Failure Report ###########################'
    + '[' -e /opt/mapr/logs/sdd.info ']'
    + cat /opt/mapr/logs/sdd.info
    + case "$errorCode" in
    + error_str='Unknown error'
    + echo -e 'Failure Reason      :    Unknown error'
    + echo -e 'Time of Failure     :    Wed Feb 13 09:57:35 EST 2013'
    ++ /usr/bin/id -u
    + '[' -e /dev/sdd -a 501 -eq 0 -a -e /usr/sbin/smartctl ']'
    + '[' mapr.job.cluster.example.com.local.mapred '!=' '' ']'
    + echo -e 'Lost Volumes        :    mapr.job.cluster.example.com.local.mapred'
    + echo -e 'Resolution          :
       Please refer to MapR'\''s online documentation at http://www.mapr.com/doc on how to handle disk failures.
       In summary, run the following steps:
    
       a. If this appears to be a software failure, go to step b.
          Otherwise, physically remove the disk /dev/sdd.
          Optionally, replace it with a new disk.
    
       b. Run the command "maprcli disk remove -host 127.0.0.1 -disks /dev/sdd" to remove /dev/sdd from MapR-FS.
    
       c. In addition to /dev/sdd, the above command removes all the disks that belong to the same storage pool, from MapR-FS.
          Note down the names of all removed disks.
    
       d. Add all the above removed disks (exclude /dev/sdd) and the new disk to MapR-FS by running the command:
          "maprcli disk add -host 127.0.0.1 -disks <comma separated list of disks>"
          For example, If /dev/sdx is the new replaced disk, and /dev/sdy, /dev/sdz were removed in step c), the command would be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdx,/dev/sdy,/dev/sdz"
                       If there is no new disk, the command would just be:
                       "maprcli disk add -host 127.0.0.1 -disks /dev/sdy,/dev/sdz"'
    + echo -e ''
    + touch /opt/mapr/logs/sdd.failed.info
    ++ head -1 /opt/mapr/hostname
    + hostname=job.cluster.example.com
    + '[' job.cluster.example.com = '' ']'
    + /opt/mapr/bin/maprcli node services -cluster wopr -nodes job.cluster.example.com -tasktracker stop
    + '[' mapr.job.cluster.example.com.local.mapred '!=' '' ']'
    + var=(`echo ${volList} | tr ',' ' '`)
    ++ echo mapr.job.cluster.example.com.local.mapred
    ++ tr , ' '
    + for vol in '${var[@]}'
    + /opt/mapr/bin/maprcli volume unmount -cluster wopr -force 1 -name mapr.job.cluster.example.com.local.mapred
    + /opt/mapr/bin/maprcli volume remove -cluster wopr -name mapr.job.cluster.example.com.local.mapred
    + /opt/mapr/bin/maprcli node services -cluster wopr -nodes job.cluster.example.com -tasktracker start

Outcomes