Sunday, November 23, 2014

linux - mdadm: drive replacement shows up as spare and refuses to sync



Prelude




I had the following devices in my /dev/md0 RAID 6: /dev/sd[abcdef]



The following drives were also present, unrelated to the RAID: /dev/sd[gh]



The following drives were part of a card reader that was connected, again, unrelated: /dev/sd[ijkl]



Analysis



sdf's SATA cable went bad (you could say it was unplugged while in use), and sdf was subsequently rejected from the /dev/md0 array. I replaced the cable and the drive was back, now at /dev/sdm. Please do not challenge my diagnosis, there is no problem with the drive.




mdadm --detail /dev/md0 showed sdf(F), i.e., that sdf was faulty. So I used mdadm --manage /dev/md0 --remove faulty to remove the faulty drives.



Now mdadm --detail /dev/md0 showed "removed" in the space where sdf used to be.




root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jul 30 13:17:25 2014
Raid Level : raid6

Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
Raid Devices : 6
Total Devices : 5
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Mar 17 21:16:14 2015
State : active, degraded

Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Name : eclipse:0
UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5

Events : 67205

Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 32 1 active sync /dev/sdc
4 0 0 4 removed
3 8 48 3 active sync /dev/sdd
4 8 64 4 active sync /dev/sde
5 8 16 5 active sync /dev/sdb



For some reason the RaidDevice of the "removed" device now matches one that is active. Anyway, let's try add the previous device (now known as /dev/sdm) because that was the original intent:




root@galaxy:~# mdadm --add /dev/md0 /dev/sdm
mdadm: added /dev/sdm
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jul 30 13:17:25 2014

Raid Level : raid6
Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Mar 17 21:19:30 2015

State : active, degraded
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 512K

Name : eclipse:0

UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
Events : 67623

Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 32 1 active sync /dev/sdc
4 0 0 4 removed
3 8 48 3 active sync /dev/sdd
4 8 64 4 active sync /dev/sde
5 8 16 5 active sync /dev/sdb


6 8 192 - spare /dev/sdm


As you can see, the device shows up as a spare and refuses to sync with the rest of the array:




root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6](S) sdb[5] sda[0] sde[4] sdd[3] sdc[1]

15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices:


I have also tried using mdadm --zero-superblock /dev/sdm before adding, with the same result.



The reason I am using RAID 6 is to provide high availability. I will not accept stopping /dev/md0 and re-assembling it with --assume-clean or similar as workarounds to resolve this. This needs to be resolved online, otherwise I don't see the point of using mdadm.


Answer




After hours of Googling and some extremely wise help from JyZyXEL in the #linux-raid Freenode channel, we have a solution! There was not a single interruption to the RAID array during this process - exactly what I needed and expected from mdadm.



For some (currently unknown) reason, the RAID state became frozen. The winning command to figure this out is cat /sys/block/md0/md/sync_action:




root@galaxy:~# cat /sys/block/md0/md/sync_action
frozen


Simply put, that is why it was not using the available spares. All my hair is gone at the cost of a simple cat command!




So, just unfreeze the array:




root@galaxy:~# echo idle > /sys/block/md0/md/sync_action


And you're away!





root@galaxy:~# cat /sys/block/md0/md/sync_action
recover
root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6] sdb[5] sda[0] sde[4] sdd[3] sdc[1]
15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
[>....................] recovery = 0.0% (129664/3906887168) finish=4016.8min speed=16208K/sec
bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices:

root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jul 30 13:17:25 2014
Raid Level : raid6
Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent


Intent Bitmap : Internal

Update Time : Tue Mar 17 22:05:30 2015
State : active, degraded, recovering
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1


Layout : left-symmetric
Chunk Size : 512K

Rebuild Status : 0% complete

Name : eclipse:0
UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
Events : 73562

Number Major Minor RaidDevice State

0 8 0 0 active sync /dev/sda
1 8 32 1 active sync /dev/sdc
6 8 192 2 spare rebuilding /dev/sdm
3 8 48 3 active sync /dev/sdd
4 8 64 4 active sync /dev/sde
5 8 16 5 active sync /dev/sdb


Bliss :-)


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...