Friday, January 8, 2016

raid - Linux mdadm RAID5 data recovery with one drive failed, one drive failing



Improbably I had two drives fail in the same Raid5 array within 2 weeks of one another. Which means the array is dead. Yes yes hot spares not being lazy about replacing the failed drive I know. But let's move past that.




The data is somewhat backed up and not of critical importance, so I am not particularly panicked by this. I would still like to try to salvage what I can anyway.



It is a 4-device Software RAID5 set up with mdadm. The drives are as follows:



/dev/sde - device 0, healthy 
/dev/sdf - device 1, first failure, hard failure, totally dead
/dev/sdg - device 2, second failure, badblocks reports a few bad sectors
/dev/sdc - device 3, healthy



I think you can see where I'm going with this. Given that sdg has only the few bad sectors I'd like to believe that most of the data is salvagable. When I reassemble the array with



mdadm --create /dev/md0 --assume-clean --level=5 --raid-devices=4 /dev/sde missing /dev/sdg /dev/sdc


I get no complaints and the device assembles and starts just fine in degraded mode. The problem occurs when I try to mount it. As soon as I run



mount -t ext4 /dev/md0 /mnt/raid



The bad blocks are detected at that point, /dev/sdg fails out of the array, and with only /dev/sde and /dev/sdc still operational the raid goes inactive and the mount fails.



Is there some way to prevent mdadm from failing the drive as soon as it detects a bad block? Some debug flag I can set? Something? I realize that some of the data will be corrupt, and some of the reads will fail.



I'm guessing what I am asking is impossible, although I don't see the theoretical reason that it needs to be. The RAID device could just say I/O error like the drive itself does. But I figure that if the only way to avoid dd failing on a normal hard drive's bad blocks is to use a different program dd_rescue instead, I sort of figure the same will end up being true with mdadm, except I doubt there is any such thing as "mdadm_rescue".



Still, I will ask anyway, and please enlighten me if I am wrong or if you can think of a way to pull some of the data out without the drive instantly crashing out of the array.


Answer



Off hand, try doing a disk dump of the dying drive to a healthy drive, and then add the healthy drive to the array.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...