Nutanix – High Availability and Data Protection

In an earlier post, I spoke about a new functionality called “Cloud Connect” that lets users build a hybrid-cloud strategy for their datacenter. I wanted to step back a little and cover High Availability and Data Protection capabilities that Nutanix provides so that you have context around where Cloud Connect fits in.

Nutanix offers a wide range of capabilities across the stack to make sure data and VMs are continuously available. Failures can happen at multiple levels within the datacenter – individual HW components can fail, independent nodes can go down (mostly due to human errors), data stored in the disks can get corrupted or accidentally deleted, racks may fail and even an entire site can go down during a disaster. Today, you will find a wide range of solutions from different vendors that handle each of these failures. HW vendors provide redundancy at the HW level, hypervisors provide the ability to move VMs from one node to another in case of a node issue, back-up vendors provide ability to backup data to remote sites, cloud providers let you manage data in the public cloud and finally separate DR solutions are sold as bolt-ons to existing storage solution. What this results in is a vendor and management palooza. As you can imagine, associated Capex and Opex are extremely high as well.

At Nutanix, we have built high availability and data protection ground-up into our platform. One of the core tenets of being web-scale is the ability to self heal. We build our software with a fundamental assumption that hardware it runs on will fail and that companies still hire interns who will accidentally pull out cables. We strive hard to ensure that VMs and Apps are never impacted when such failures happen and the underlying issue is automatically fixed without any user intervention. The Nutanix hardware platform has redundant power supplies, memory and CPU. The platform is resilient to individual hard disk failures. If a disk fails, data is automatically rebuilt without any impact to the App. It is important to know that we are a true software-defined platform. We don’t have any HW RAID solution for protecting locally data unlike other solutions.  It is a true ‘crowd-sourced’ solution (aka distributed) where if a disk or node fails, other nodes in the cluster work together to bring-up the VM and repopulate the data locally automatically. More the number of nodes in the cluster, the faster and more efficient the rebuild process is.

Controller and Hypervisor upgrades are touch-free and non-disruptive as well. With a feature called “Tunable Redundancy”, data gets synchronously replicated across multiple nodes in the cluster (minimum of 2). Additionally meta-data also gets automatically replicated across multiple nodes within the cluster as well. Tunable Redundancy  lets you decide the level of protection based on the application SLA and enable policies at a VM level. Building upon this is the concept of Availability Domains which will determine the optimal placement for replicas based upon node / block / rack awareness.  With Availability Domains, while unlikely, you can lose a full Nutanix block (1, 2 or 4 nodes) and still have copies of the data available.  There’s no admin interaction required to enable this. There are also silent data-integrity checks to ensure data corruption issues are identified and fixed even before the VMs/Apps notice it.

You can learn more about this by watching a quick whiteboard video – Data Protection within a Nutanix Cluster

At the VM level, you have the ability to backup VMs locally, to a remote site or to the public cloud (Cloud Connect). The backup is WAN-optimized by default and data is deduplicated and compressed before it gets transferred. With support for differential backups, only data that has changed gets sent through the wire. Recovery is initiated in seconds, irrespective of where the data is, and data can be recovered in a matter of minutes.  For a lot of customers and workloads, this eliminates the need to have a dedicated backup solution. Needless to say, since the backup happens at VM granularity, backup and recovery is super-efficient. With VSS integration in our native snapshots, we can provide app-consistent backups for Microsoft workloads such as Microsoft SQL, Microsoft Exchange, etc. We also seamlessly integrate integrate with third-party backup solutions, if there is a need.

Enterprise-grade DR is built into the software as well. You can replicate your VMs locally or to another site based on a schedule that suits each workload. With support for 1:Many, Many:1 and Many:Many replications, you can protect all your workloads across all your sites. Again, since data is deduped and compressed over the wire, network and storage resources used are significantly lesser. With REST APIs, CLIs and powerful management through PRISM, the entire DR process can be automated resulting in minimal downtime during a DR.  We are not stopping there, very soon you will see to us talk about continuous availability across different metros too. Stay tuned.

I hope this gives you a good overview of some of the HA and Data Protection capabilities that are built into the product. I will digress slightly, but I just want to leave you with a few closing thoughts. Just look at all the consoles you have open now for managing your datacenter infrastructure across Storage, compute, hypervisor, backup, DR etc. Do you use all the features across all these consoles? What stands-out the most to you? Is it one single ‘feature’ across all those windows you have open?  Or is it just the fact that you are dealing with so many vendors and management consoles with complex user flows to get simple things done. I would think it is the latter. We fundamentally believe that Datacenter management shouldn’t be this complex. We like to operate with the left-brain of Google and the right brain of Apple. Deal with datacenter complexities the way Google does with an Apple like simplicity.  It not important to have every feature under the sun, it only matters if we have the ones you want. We think we do.

If you are already a Nutanix customer, enable backup or replication for a couple of your workloads and see how it works for you. If you aren’t one, you know what you need to do.