Saturday, March 30, 2019

backup - How to get cheap disaster recovery for a 124 TB Isilon filesystem?

On our Isilon cluster, we have a 124 TB file system. It is currently 38 percent full, with 31 million files. About half the data are image files, and the mean file size is 1.5 MB. We use snapshots to protect against accidental deletion, but we need something different to protect against total failure (e.g., sysadmin error, software error, or water, heat, or fire damage). And because we're a poor research lab, it shouldn't be too expensive.



We currently try to back up to tape, but that has two problems. First, just traversing the directory tree and stating each file takes more than five days, so even an incremental backup takes over a week. Second, and most important, a restore would takes many weeks, even months.



Ideally, we'd like to have access to much of the data again within a week of disaster. (It's fine to get the data back gradually over the course of several weeks if we can choose which directories to restore first, but sourcing new storage equipment and restoring would likely take much longer than that.) The only way I can think of recovering in a week is to maintain a replicate on disk at a separate location. It's OK to lose at least a few days of work, so the replication can lag a little or cover the file system over the course of several days. It's also OK for the replicate to have much poorer performance than the original.




The Isilon solution would be to use SyncIQ to replicate the file system to another cluster. Because this operates at the block level, it avoids the problem of traversing the file system and stat-ing each file. As can be expected, the cost is a little steep: the license for the SyncIQ software is $55k, and then there is the cost of the expensive Isilon storage to synchronize to (although using their cheaper NL storage helps a bit). I expect that the Isilon solution will come to somewhere between $500 and $1000 per TB, which is far better than the $1300–1900/TB we paid for the primary storage, but still a lot of money for us.



Given that raw hard drives can be had for $60/TB these days, I would hope that 124 TB of slow storage can be cobbled together for far below Isilon prices, and that there is a way to replicate changes in less than a week. Can you think of a way?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...