Friday, July 27, 2018

raid - HP SmartArray P400: How to repair failed logical drive?



I have a HP Server with SmartArray P400 controller (incl. 256 MB Cache/Battery Backup) with a logicaldrive with replaced failed physicaldrive that does not rebuild.



This is how it looked when I detected the error:





~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

array A (SATA, Unused Space: 0 MB)
logicaldrive 1 (698.6 GB, RAID 1, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)


array B (SATA, Unused Space: 0 MB)
logicaldrive 2 (2.7 TB, RAID 5, Failed)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, Failed)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

unassigned
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA, 750 GB, OK)

~#


I thought that I had drive 2I:1:8 configured as a spare for Array A and Array B, but it seems this was not the case :-(. I noticed the problem due to I/O errors on the host, even if only 1 physicaldrive of the RAID5 is failed.



Does someone know why this could happen? The logicaldrive should go into "Degraded" mode but still be fully accessible from the host os!?



I first tried to add the unassigned drive 2I:1:8 as a spare to logicaldrive 2, but this was not possible:





~# /usr/sbin/hpacucli ctrl slot=0 array B add spares=2I:1:8
Error: This operation is not supported with the current configuration.
Use the "show" command on devices to show additional details
about the configuration.
~#


Interestingly it is possible to add the unassigned drive to the first array without problems. I thought maybe the controller put the array into "failed" state due to the missing spare and protects failed arrays from modification. So I tried was to reenable the logicaldrive (to add the spare afterwards):





~# /usr/sbin/hpacucli ctrl slot=0 ld 2 modify reenable
Warning: Any previously existing data on the logical drive may not
be valid or recoverable. Continue? (y/n) y

Error: This operation is not supported with the current configuration.
Use the "show" command on devices to show additional details
about the configuration.
~#



But as you can see, re-enabling the logicaldrive this was not possible.



Now I replaced the failed drive by hotswapping it with the unassigned drive. The status now looks like this:




~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

array A (SATA, Unused Space: 0 MB)
logicaldrive 1 (698.6 GB, RAID 1, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)

array B (SATA, Unused Space: 0 MB)
logicaldrive 2 (2.7 TB, RAID 5, Failed)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

~#


The logical drive is still not accessible. Why is it not rebuilding?



What can I do?



FYI, this is the configuration of my controller:





~# /usr/sbin/hpacucli ctrl slot=0 show
Smart Array P400 in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: XXXX
Cache Serial Number: XXXX
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Chassis Slot:
Hardware Revision: Rev E

Firmware Version: 5.22
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Analysis Inconsistency Notification: Disabled
Raid1 Write Buffering: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 25% Read / 75% Write

Drive Write Cache: Disabled
Total Cache Size: 256 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
~#



Thanks for you help in advance.


Answer



The answer is not pleasant. There's a high probability that your array is in a "waiting for rebuild" state, where there's another failing disk in the RAID5 array set that's preventing the recovery from completing. This is why you should avoid RAID5 these days. It doesn't help that these are SATA drives... The likelihood of problems is even higher. Try powering the system off (letting the drives spin down) and powering back on. Follow the prompts at the BIOS array screen and choose the F2 option to "reenable all logical drives". This may kickstart the rebuild process.



Otherwise, it's a rebuild/recovery with new disks.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...