Saturday, July 11, 2015

Hardware RAID controller cache battery failure frequency/lifetime?




I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.



A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.



This is a very routine operation for us; almost weekly across several thousand physical servers... We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.



Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.



Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.




If this is common, how do other large server environments handle this?




  • Any tips or tricks to handling RAID battery replacements?

  • Are there any configuration parameters that can help?

  • How disruptive is this to operations in your environment?

  • Could poor chassis cooling and temperature be a factor?

  • Are we doing something wrong?

  • Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?




LSI product literature outlining a new-generation battery that can last longer in service than 1 year.
enter image description here



HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery...



# uptime 
05:38:08 up 1031 days, 44 min, 31 users, load average: 0.49, 0.64, 0.99

# hpacucli

Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 50% Read / 50% Write
Total Cache Size: 512 MB
Battery Pack Count: 1
Battery Status: OK

Answer



I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.




root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
[...]
Temperature: 41 C


I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...