linux - mdadm raid1 fails to resync

Saturday, June 6, 2015

linux - mdadm raid1 fails to resync

I'm trying to solve this problem I'm having with an mdadm raid1.

I have an ubuntu 9.04 server running on a software 2-drive raid1 with mdadm. Yesterday, one of the drives failed, and so I replaced it with a brand new drive of the same size. I removed the faulty drive, copied the partition from the remaining good drive to the new drive and then added it to the raid. It re-synced and the system worked fine, until the drive that hadn't failed, was also labeled failed.

Now I had the raid running solely on the new drive. So I purchased another drive and repeated the procedure above. So now I had 2 brand new drives and the raid was syncing. However, after a few minutes I checked /proc/mdstat and the raid was no longer syncing.

mdadm --detail /dev/md1 shows: (sdb is the first new drive, and sdc is the second new drive)

root@dola:/home/jjaramillo# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90
Creation Time : Sat Dec 20 00:42:05 2008
Raid Level : raid1
Array Size : 974711680 (929.56 GiB 998.10 GB)
Used Dev Size : 974711680 (929.56 GiB 998.10 GB)
Raid Devices : 2
Total Devices : 2

Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Wed Jun  2 10:09:35 2010
      State : clean, degraded

Active Devices : 1
Working Devices : 2
Failed Devices : 0

Spare Devices : 1

       UUID : bba497c6:5029ba0b:bfa4f887:c0dc8f3d
     Events : 0.5395594

Number   Major   Minor   RaidDevice State
   2       8       35        0      spare rebuilding   /dev/sdc3
   1       8       19        1      active sync   /dev/sdb3

I've tried removing and re-adding the drive a few times, but the same thing happens. The raid fails to resync. I've looked at /var/log/messages, and found the following:

Jun 2 07:57:36 dola kernel: [35708.917337] sd 5:0:0:0: [sdb] Unhandled sense code
Jun 2 07:57:36 dola kernel: [35708.917339] sd 5:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 2 07:57:36 dola kernel: [35708.917342] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Jun 2 07:57:36 dola kernel: [35708.917346] Descriptor sense data with sense descriptors (in hex):
Jun 2 07:57:36 dola kernel: [35708.917348] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 2 07:57:36 dola kernel: [35708.917357] 00 43 9e 47
Jun 2 07:57:36 dola kernel: [35708.917360] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed

So it looks like there's some kind of error on sdb (the first new drive). My question is, what would be the best approach to get the raid up and running again? I've thought about dd'ing the /dev/md1 to a blank hard drive, then re-doing the raid from scratch and loading the data back, but there could be an easier solution..

Any help would be appreciated.

Answer

You shouldn't attempt to prepare the new drive in any meaningful way unless your raid constituents are actually disk PARTITIONS not disks themselves. In which case, you would create a partition on the new drive that is the same size as the one on the remaining active disk.

You never need to touch the old drive at all -- it's assumed to be failed and unreliable.

The correct procedure is to remove the broken drive, add a new, empty drive, and then use mdadm to add that new drive to the array. You'd do it something like this:

mdadm --add /dev/md0 /dev/

The kernel will then sync the new drive into the array, copying the data from the one remaining good drive.

Blog

Saturday, June 6, 2015

linux - mdadm raid1 fails to resync

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server