Thursday, April 11, 2019

raid - Why would a RAID5 rebuild fail?



I have a IBM System x3650 server with a ServeRaid controller and two RAID5 arrays, each consisting of 3 disks.



Yesterday, one disk failed (It was the Raid array that holds the data, the system is located on the sound array). I naively trusted the RAID controller in rebuilding the array. I shut down the server, replaced the failed disk with a new similar. I booted in the controller bios, where I could see that it recognized the new disk and was ready to rebuild (I had nothing to do, everything was automatic). I started the server and it rebuilt the array.




This morning everything seemed OK. The rebuild was finished, the array seemed sound. Only a few hours later, the mysql service crashed with a corrupted database. I managed to dump the data partially and restored the rest from backup. I thought I was OK.



But then I found that some active logfiles were corrupt: they included blocks from different random files. If I appreciate the situation correctly, only files modified since the rebuild has started are corrupted, but I'm not yet 100% sure for this. Somehow, the rebuild must have corrupted the data.



I ask this question to learn out of error. I hope the next time will be never...



What can be the reason that the rebuild failed ? What can I do better next time ?
Is it compulsary to cut the server from the network during rebuild ? I thought, the controller should manage concurrently rebuild and make ordinary reads and writes.
Or shouldn't this never happen, and maybe the controller is faulty ?


Answer



From your description, it seem that the rebuild did not fail, in the sense that the array was up and running. However, it seems that the rebuild process caused some blocks to be wrongly placed/remapped, which is an extraordinary rare but dangerous thing.




I suggest you to take the time to examine the situation. Did you read/follow the RAID card manual? Are you 100% sure that you did the right things? If the reply to both question is "yes", you should immediately open a support case with your server vendor/consultant.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...