Monday, December 28, 2015

ext4 - How does SSD meta-data corruption on power-loss happen? And can I minimize it?



Note: This is a follow-up question to Is there a way to protect SSD from corruption due to power loss?. I got good info there but it basically centered in three area, "get a UPS", "get better drives", or how to deal with Postgres reliability.




But what I really want to know is whether there is anything I can do to protect the SSD against meta-data corruption especially in old writes. To recap the problem. It's an ext4 filesystem on Kingston consumer-grade SSDs with write-cache enabled and we're seeing these kinds of problems:




  • files with the wrong permissions

  • files that have become directories (for example, toggle.wav is now a directory with files in it)

  • directories that have become files (not sure of content..)

  • files with scrambled data



The problem is less with these things happening on data that's being written while the drive goes down, or shortly before. It's a problem but it's expected and I can handle that in other ways.




The bigger surprise and problem is that there is meta-data corruption happening on the disk in areas that were not recently written to (ie, a week or more before).



I'm trying to understand how such a thing can happen at the disk/controller level. What's going on? Does the SSD periodically "rebalance" and move blocks around so even though I'm writing somewhere else? Like this:



enter image description here



And then there is a power loss when D is being rewritten. There may be pieces left on block 1 and some on block 2. But I don't know if it works this way. Or maybe there is something else happening..?



In summary - I'd like to understand how this can happen and if there anything I can do to mitigate the problem at the OS level.




Note: "get better SSDs" or "use a UPS" are not valid answers here - we are trying to move in that direction but I have to live with the reality on the ground and find the best outcome with what we have now. If there is no solution with these disks and without a UPS, then I guess that's the answer.



References:



Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition "expected behavior"?
This is similar but it's not clear if he was experiencing the kinds of problems we are.



EDIT:
I've also been reading issues with ext4 that might have problems with power-loss. Ours are journaled, but I don't know about anything else.




Prevent data corruption on ext4/Linux drive on power loss



http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/


Answer



For how metadata corruption can happen after an unexpected power failure, give a look at my other answer here.



Disabling cache can significantly reduce the likehood of in-flight data loss; however, based on your SSDs, data-at-rest remain at risk of being corrupted. Moreover, it commands a massive performance loss (I saw 500+ MB/s SSDs to write at a mere 5 MB/s after disabling the private DRAM cache).



If you can't trust your SSDs, the only "solution" (or, rather, workaround) is to use an end-to-end checksumming filesystem as ZFS or BTRFS and a RAID1/mirror setup: in this manner, any eventual single-device (meta)data corruption can be recovered from the other mirror side by running a check/scrub.



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...