raid - HP SmartArray P400: How to repair failed logical drive?

Friday, July 27, 2018

raid - HP SmartArray P400: How to repair failed logical drive?

I have a HP Server with SmartArray P400 controller (incl. 256 MB Cache/Battery Backup) with a logicaldrive with replaced failed physicaldrive that does not rebuild.

This is how it looked when I detected the error:


~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

  array A (SATA, Unused Space: 0 MB)
    logicaldrive 1 (698.6 GB, RAID 1, OK)
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)


  array B (SATA, Unused Space: 0 MB)
    logicaldrive 2 (2.7 TB, RAID 5, Failed)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, Failed)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

  unassigned
      physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA, 750 GB, OK)

~#

I thought that I had drive 2I:1:8 configured as a spare for Array A and Array B, but it seems this was not the case :-(. I noticed the problem due to I/O errors on the host, even if only 1 physicaldrive of the RAID5 is failed.

Does someone know why this could happen? The logicaldrive should go into "Degraded" mode but still be fully accessible from the host os!?

I first tried to add the unassigned drive 2I:1:8 as a spare to logicaldrive 2, but this was not possible:



~# /usr/sbin/hpacucli ctrl slot=0 array B add spares=2I:1:8
    Error: This operation is not supported with the current configuration.
    Use the "show" command on devices to show additional details 
    about the configuration.
~#

Interestingly it is possible to add the unassigned drive to the first array without problems. I thought maybe the controller put the array into "failed" state due to the missing spare and protects failed arrays from modification. So I tried was to reenable the logicaldrive (to add the spare afterwards):



~# /usr/sbin/hpacucli ctrl slot=0 ld 2 modify reenable
    Warning: Any previously existing data on the logical drive may not 
    be valid or recoverable. Continue? (y/n) y

    Error: This operation is not supported with the current configuration.
    Use the "show" command on devices to show additional details
    about the configuration.
~#

But as you can see, re-enabling the logicaldrive this was not possible.

Now I replaced the failed drive by hotswapping it with the unassigned drive. The status now looks like this:


~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

  array A (SATA, Unused Space: 0 MB)
    logicaldrive 1 (698.6 GB, RAID 1, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)

  array B (SATA, Unused Space: 0 MB)
    logicaldrive 2 (2.7 TB, RAID 5, Failed)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

~#

The logical drive is still not accessible. Why is it not rebuilding?

What can I do?

FYI, this is the configuration of my controller:



~# /usr/sbin/hpacucli ctrl slot=0 show
 Smart Array P400 in Slot 0 (Embedded)
  Bus Interface: PCI
  Slot: 0
  Serial Number: XXXX
  Cache Serial Number: XXXX
  RAID 6 (ADG) Status: Enabled
  Controller Status: OK
  Chassis Slot:
  Hardware Revision: Rev E

  Firmware Version: 5.22
  Rebuild Priority: Medium
  Expand Priority: Medium
  Surface Scan Delay: 15 secs
  Surface Analysis Inconsistency Notification: Disabled
  Raid1 Write Buffering: Disabled
  Post Prompt Timeout: 0 secs
  Cache Board Present: True
  Cache Status: OK
  Accelerator Ratio: 25% Read / 75% Write

  Drive Write Cache: Disabled
  Total Cache Size: 256 MB
  No-Battery Write Cache: Disabled
  Cache Backup Power Source: Batteries
  Battery/Capacitor Count: 1
  Battery/Capacitor Status: OK
  SATA NCQ Supported: True
~#

Thanks for you help in advance.

Answer

The answer is not pleasant. There's a high probability that your array is in a "waiting for rebuild" state, where there's another failing disk in the RAID5 array set that's preventing the recovery from completing. This is why you should avoid RAID5 these days. It doesn't help that these are SATA drives... The likelihood of problems is even higher. Try powering the system off (letting the drives spin down) and powering back on. Follow the prompts at the BIOS array screen and choose the F2 option to "reenable all logical drives". This may kickstart the rebuild process.

Otherwise, it's a rebuild/recovery with new disks.

Blog

Friday, July 27, 2018

raid - HP SmartArray P400: How to repair failed logical drive?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server