Nutanix Cluster-based Replication vs. Hypervisor-based Replication Solutions

One of the things I like about my work at Nutanix is up-close customer interactions. These let me and the team understand real customer challenges and solve them via innovative products. Recently, in one of my conversations with a customer, we were asked how Nutanix Replication/Disaster Recovery compares to hypervisor-based replication products. This motivated me to write a blog post to illustrate how the distributed architecture of NDFS provides a clear advantage over hypervisor-based replication solutions.

Both VMware and Microsoft have replication support out of the box. VMware calls it vSphere Replication, and Microsoft calls theirs Hyper-V Replica. Both these approaches use the same basic technique – logging IOs to virtual disks in a file, and shipping the delta across the wire to a remote cluster. Of course, this sounds simpler than it actually is, since all failure cases need to be considered carefully – network is flappy/down, hosts are down etc.

Note however that the host is the key part of the puzzle in replication. This means that if you have a single host with a number of critical VMs you want to support, then this host’s physical resources (CPU, memory, disk, network) become the bottleneck, as is the case with vSphere Replication and Hyper-V Replica. To get around this, you have to intelligently place the VMs across multiple hosts and predict metrics like data change rate across workloads, which is hard to do.

Another fundamental issue with this approach- writes to the same region on the disk over and over again will lead to all data being shipped across. Yes, there are tricks you can play here as well, i.e replicate at a periodic interval and hope that you can compress your changeset. But when you blindly ship IOs across, there is no real way in which one can collapse region overwrites. (In this aspect, vSphere Replication seems to do better than Hyper-V Replica, which seems to be shipping over each write, as opposed to proper change block tracking in the vSphere case)

Contrast this method that with the Nutanix Replication/DR approach, which is a true distributed replication solution. Nutanix does a diff across the latest snapshot and the previously replicated snapshot, and sends over the delta. This allows for 2 key advantages:

1. A host other than the one serving IO on the active virtual disk can do the work of computing diffs across the snapshot. This means we are no longer bottlenecked on a host that might have critical VMs doing work and needing the resources of this host.
2. Diff across snapshots means that overwrites to the same region are effectively collapsed. In cases of data being zeroed out in regions of the virtual disk, this approach is superior to change block tracking since a simple metadata update can be shipped across.

Why does this matter to you as a customer? It’s simple – Nutanix’s Replication approach allows for more efficient CPU and bandwidth usage.

There are a number of other advantages of the Nutanix Replication/DR solution which are covered throughout our blogs as well as this technology brief. Architecturally, the Nutanix approach has proven to be a more natural approach to achieve true scale-out replication and disaster recovery.