Monday, January 9, 2017

raid - Why can't I recover a RAID1 disk which apparently has no errors?

For several years now we run a software RAID1 on a old Gentoo box with a custom Linux 2.6.31. The RAID consists of 2 harddisk with 4 partitions each. In the last years it happend about 3-4 times that a disk was thrown out of the array. But each time badblocks didn't report an error and i was able to reactivate the disk like this



mdadm /dev/md3 -r /dev/sda3

mdadm /dev/md3 -a /dev/sda3


This time the situation is different: mdadm reported 2 faulty partitions over the past 24 h, both on the same disk sda. Again I ran badblocks without success: It reported 0 bad blocks. If I try to add the faulty disk back to the array, it breaks with the same error each time:



Mar 25 23:09:10 xen0 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 25 23:09:10 xen0 kernel: ata1.00: irq_stat 0x40000001
Mar 25 23:09:10 xen0 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 25 23:09:10 xen0 kernel: res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
Mar 25 23:09:10 xen0 kernel: ata1.00: status: { DRDY ERR }

Mar 25 23:09:10 xen0 kernel: ata1.00: error: { ABRT }
Mar 25 23:09:10 xen0 kernel: ata1.00: configured for UDMA/133
Mar 25 23:09:10 xen0 kernel: ata1: EH complete
Mar 25 23:09:10 xen0 kernel: end_request: I/O error, dev sda, sector 18297870
Mar 25 23:09:10 xen0 kernel: md: super_written gets error=-5, uptodate=0
Mar 25 23:09:10 xen0 kernel: md: md3: recovery done.
Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 0, wo:1, o:0, dev:sda3
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3

Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3


The sector is always the same: 18297870. If I check the output of smartctl -a /dev/sda it doesn't show any reallocated sectors, so iIthought, I could enforce reallocation with a method I found here.



hdparm --read-sector  18297870 /dev/sda
hdparm --write-sector 18297870 --yes-i-know-what-i-am-doing /dev/sda



But with the above commands the disk does not report any error at all - and thus does not reallocate the sector:



smartctl -a /dev/sda | grep -i reall
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0


I googled for the status: { DRDY ERR } message and some say this could be a kernel bug. But then I wonder why this just started to happen now. There was no change to the system over the past years.




There's one thing which still bugs me though: smartctl -a /dev/sda also reports some errors in the logs. It's always the same error:



Error 15 occurred at disk power-on lifetime: 39971 hours (1665 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 38 df f7 a7


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ea 00 00 00 00 00 00 08 35d+12:17:45.571 FLUSH CACHE EXT
61 80 f0 0e 96 2d 00 08 35d+12:17:44.033 WRITE FPDMA QUEUED
61 80 e8 8e 95 2d 00 08 35d+12:17:44.033 WRITE FPDMA QUEUED
61 80 e0 0e 95 2d 00 08 35d+12:17:44.033 WRITE FPDMA QUEUED
61 80 d8 8e 94 2d 00 08 35d+12:17:44.033 WRITE FPDMA QUEUED



All this doesn't make sense to me: If the disk is really faulty, then why can i successfully read and write to the sector at which the RAID resync fails each time? And why doesn't the disk first try to reallocate the broken sector if it really detects an error?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...