I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.
A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.
This is a very routine operation for us; almost weekly across several thousand physical servers... We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.
Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.
Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.
If this is common, how do other large server environments handle this?
- Any tips or tricks to handling RAID battery replacements?
- Are there any configuration parameters that can help?
- How disruptive is this to operations in your environment?
- Could poor chassis cooling and temperature be a factor?
- Are we doing something wrong?
- Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?
LSI product literature outlining a new-generation battery that can last longer in service than 1 year.
HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery...
# uptime
05:38:08 up 1031 days, 44 min, 31 users, load average: 0.49, 0.64, 0.99
# hpacucli
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 50% Read / 50% Write
Total Cache Size: 512 MB
Battery Pack Count: 1
Battery Status: OK
Answer
I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.
root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: BBU
[...]
Temperature: 41 C
I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.
No comments:
Post a Comment