We've struggled with the RAID controller in our database server, a Lenovo ThinkServer RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the ServeRAID 8k.
We have patched this ServeRAID 8k up to the very latest and greatest:
- RAID bios version
- RAID backplane bios version
- Windows Server 2008 driver
This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the change history is just.. well, scary.
We've tried both write-back and write-through strategies on the logical RAID drives. We still get intermittent I/O errors under heavy disk activity. They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.
We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.
When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original "incompatible" drive in bay 6. On a hunch, I began to assume that the Western Digital SATA hard drives I chose were somehow incompatible with the ServeRAID 8k controller.
Buying 6 new hard drives was one of the cheaper options on the table, so I went for 6 Hitachi (aka IBM, aka Lenovo) hard drives under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.
Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O "event" in this time frame. It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!
While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.
So my question is, is this sort of SATA drive incompatibility common with RAID controllers? Are there some brands of drives that work better than others, or are "validated" against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).
Answer
Yes, I have encountered this with low-end cards and buggy drivers. However, no, not on an up-to-date Adaptec rebranded card. Wow is all I can say. One thing to consider, maybe it is more a bug with the drive than the RAID controller.
I don't have a good answer, but since you seem to have exhausted most of your options other than replacing the card, (and replacing the drives did the trick) here's a few ideas you can consider for your troubleshooting:
The WD drives were RE (RAID Edition) drives, right? The time limited error recovery is important, so if you don't have that and the drive is attempting to recover the sector, you are going to get a looooong pause from that drive. If the RAID controller is being patient and not dropping the drive you'll have a big problem on your hands.
Check the SMART data on the drives you removed and see if there is anything interesting.
Another comment about the importance of time limited error recovery (TLER) feature, from NAS / RAID vendor support:
As I mention before, we always suggest customers to use enterprise level drives if they use the drives in RAID settings. Enterprise level drives have more consistent responding time so that the RAID will be safer.
No comments:
Post a Comment