Wednesday, August 5, 2015

Non-volatile cache RAID controllers: what kind of protection is there against NVCACHE failure?

The battery back-up (BBU) model:




  • admin enables write-back cache with BBU

  • writes are cached to the RAID controller's RAM (major performance benefit)

  • the battery saves uncommitted and cached data in the event of a power loss (reliability)



If I lose power and come back within a day or so, my data should be both complete and uncorrupted.




The downside to this is that, if the battery is dead or low, OR EVEN IF IT IS IN A RELEARN CYCLE (drain/charge loops to ensure the battery's health), the controller reverts to write-through mode and performance will suffer. What's more, the relearn cycles are usually automated on a schedule which may or may not happen in the middle of big traffic. So, that has to be manually disabled and manually scheduled for off-hours if it's a concern. Annoying either way.



NV caches have capacitors with a sufficient charge to commit any uncommitted-to-disk data to flash. Not only is that more survivable in longer loss situations, but you don't have to concern yourself with battery death, wear-out, or relearning.



All of that sounds great to me. What doesn't sound great to me is the prospect of that flash module having an issue, though. What if it's completely hosed? What if it's only partially hosed? A bit corrupted at the edges? Relearn cycles can tell when something like a simple battery is failing, but is there a similar process to verify that the flash is functional? I'm just far more trusting of a battery, warts and all.



I know the card's RAM can fail, the card itself can fail - that's common territory, though.



In case you didn't guess, yeah, I've experienced a shocking-to-me amount of flash/SSD/etc. failure :)

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...