Blog

Erasure Coding-X (EC-X): Predictably Increase Usable Storage Capacity

By Tim Isaacs

Now that the dust has settled from our inaugural .NEXT user conference, its time to go into the details of a technology that we briefly discussed called Erasure Coding–X (EC-X). Customers can now deploy EC-X in non-production environments by upgrading to Nutanix Operating System version 4.1.3, which released recently. Jump to this blog post to read about key NOS 4.1.3 features and functionality.

EC-X is a proprietary, native, patent pending, implementation of Erasure Coding. With EC-X, Nutanix customers are able to increase their usable storage capacity by up to 70%.

However, before we discuss EC-X in detail, lets frame the topic of storage efficiency.

Storage vendors have implemented many features to make storage more efficient. Deduplication, Compression, Thin Provisioning, Snapshots, Clones and RAID are all household names today. While we have been shipping the full suite of storage efficiency since our early days, we never implemented RAID. Our web-scale brethren (Google, Azure, AWS, Facebook) too have taken the same approach for the same reason – RAID hasn’t stood the test of time. Disk drives have become larger but their reliability has stayed the same. Larger disks means that the time needed to reconstruct a failed disk using RAID parity information has become significantly longer, increasing the probability of a second disk failure or other errors before reconstruction can complete. It’s not uncommon for a 4TB disk to take days to rebuild.

Instead of RAID, we (and our web-scale brethren) rely on a technology called Replication Factor.

Replication Factor (RF) is a quick and efficient technique of creating multiple (2 or 3) data copies across the cluster, making the cluster highly resilient with the ability to tolerate up to two simultaneous node (server) failures without downtime or data loss. Replication factor is also the king of rebuilds. If a node or disk fails, the cluster rebuilds back to the desired resiliency level using the power of all spare resources in the cluster, far quicker than any traditional method. However, all these benefits come at a cost. Replication Factor, as the name implies, creates data copies and therefore consumes more capacity than traditional RAID 5 or 6 schemes.

This is where Nutanix EC-X fits in.

EC-X overcomes the capacity cost of Replication Factor without taking away any of the benefits. EC-X is an implementation of Erasure Coding and therefore works by creating a mathematical function around a data set such that if a member of the data set is lost, the lost data can be recovered easily from the rest of the members of the set.

Strictly speaking RAID too is an implementation of Erasure Coding, albeit tied to disk geometry and disk arrangement. This constraint greatly limits RAID’s usefulness in todays day and age, as we previously discussed.

Common industry uses of Erasure Coding include Data Protection (in the object storage world) and Storage Efficiency (in the web-scale world). The most commonly used class of Erasure Codes is Reed-Solomon codes.

EC-X does not use Reed-Solomon codes and instead uses a proprietary patent pending algorithmic scheme that affords us more flexibility and speed than Reed-Solomon and Reed-Solomon like implementations (more on this topic in the another blog post).

So what savings can be attributed to EC-X?

Replication Factor of 2 (RF2) allows the utilization of about 50% of raw storage capacity. EC-X can take this utilization to 80%.

 

Also, unlike savings from Deduplication and Compression, which depend entirely on the Dedupe-ability and Compressibility of the data set, EC-X applies to pretty much all workloads and the savings are therefore deterministic.

There’s more.

The EC-X algorithm is also one of the lowest in computational cost. Coding and rebuilds are distributed across nodes of the cluster. This distributed low cost algorithm affords faster rebuilds and thus lower vulnerability windows and improved availability. EC-X also optimizes the flash tier, improving performance by increasing effective flash capacity. Finally, EC-X does not break Data Locality on the Nutanix cluster, further helping performance since read requests will be served locally.

Interested customers can use EC-X in non-production environments starting with Nutanix OS version 4.1.3. The generally available, production ready, version of EC-X will follow in a few months.

This write-up will be complemented with another blog post delving into the workings and differences between EC-X and other Erasure Coding implementations. We will also discuss complementary technologies and put things into perspective with All Flash systems.

Hope this piqued your interest. Share your thoughts and continue the conversation on the Nutanix NEXT community.