Saturday, August 23, 2014

filesystems - "Hot data" and erasure coding: how to know if it is efficiently handled?

It is well known that erasure coding adds extra complexity because of encoding and decoding operations. Due to this drawback, most of cloud services recommend to use data replication for hot data and erasure coding for cold data.



For example, from the Ceph documentation:




The erasure-coded pool crush ruleset targets hardware designed for cold storage with high latency and slow access time. The replicated pool crush ruleset targets faster hardware to provide better response times.





Is there a better definition of hot data than "data that are more accessed than others" ?



Let consider a storage system that relies on erasure coding and an application that runs on it, defined by an intensive I/O workload. Is it considered as hot data ?



Now, how can I say if my storage system' erasure code is viable or not ? Is it relevant to measure IOPS from the application side for some specific tests (i.e. random/sequential read/write) ?



Is there a threshold that says erasure codes are not viable for hot data because I only record (for example) one hundred of IOPS application-side for random write operation of 4 kB blocks ?
What if I record one hundred of billion IOPS ?




Are IOPS relevant for this kind of test (maybe an other metric would say more) ?



I am full of questions about it and any help would be grateful.

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...