Blog

No-Click Data Center Site Failovers Powered by Metro Availability Witness

By Mike McGhee

Nutanix Metro Availability is a continuous availability solution which enables synchronous replication and stretched clustering capabilities for Nutanix environments. Metro Availability was first released with AOS 4.1 in early 2015. With Metro Availability, Nutanix was the first hyper-converged platform to provide a native continuous availability solution. While this was a strong achievement in the industry, it was only a first step in delivering the intended functionality: a simple solution providing seamless migration, zero RPO and near zero RTO for unplanned events across datacenters.

Eighteen Months of Innovation for Metro Availability

Over the past year and a half there have been several improvements to Metro Availability. Some of those features included:

  • Synchronous replication support for both ESXi and Hyper-V environments, in the 4.1.3 release.
  • Asynchronous replication interoperability, including three site support, introduced with the 4.1.4 release.
  • The capability to fully transition workloads between sites with zero downtime during planned events, available with the 4.6 release.

Enabling any of the aforementioned features could be achieved with a simple, non-disruptive software update. Customers could take advantage of these new features using their existing infrastructure investments. In 18 months, customers have been able to benefit from increased data protection capabilities and added support for replicating to more than two sites at no additional cost.

Announcing No-Click Site Failovers, Powered by Metro Availability Witness

Looking ahead, one of the planned key new features of Metro Availability is an ability to automate the failover of virtual machines between datacenters following unplanned events. Customers are looking for application failovers to occur without human intervention, especially for transient events such as power outages. Failover automation helps to relieve administrative burden and greatly reduce the recovery time objective (RTO) of the solution.

To account for this use case, Nutanix is planning to debut, in the upcoming software release codenamed “Asterix,” a witness service to act as a broker and control plane for automating the transition of virtual machines during site failure.

MAW Image

Once released, the Metro Availability witness will run as a standalone virtual machine, in a separate failure domain from the sites hosting the Metro cluster. The witness should be connected to the two protected sites via independent network connections with a round-trip latency of less than 200 ms. Whether the witness is running on Nutanix or non-Nutanix infrastructure, the witness virtual machine can be deployed in minutes and registered with an existing metro cluster with just a few mouse clicks. The witness itself exposes a basic monitoring UI that shows the state of the Protection Domains it is protecting, and a set of internal APIs for the actual witness functionality that are used by the protected Nutanix clusters. A single witness virtual machine can protect hundreds of Protection Domains across a large number of clusters.

MAW Image

The witness will detect cluster failures and can automatically promote the standby site which allows vSphere High Availability to restart virtual machines automatically. Metro Availability witness support will prevent split brain scenarios should a network partition occur between sites. Additionally, the witness integration can detect rolling disasters and help prevent data loss scenarios.

Upgrading a current Metro Availability deployment to start making use of the witness functionality requires just a quick upgrade operation. Once released, using the witnessed failure mode is going to be the recommended configuration, but the current failure handling modes (automatic resume and manual) will continue to be supported.

The witness represents the next step in the continuous innovation Nutanix has been providing for Metro Availability, and helps to complete the industry’s first fully integrated, hyper-converged, continuous availability solution with zero RPO and now near-zero RTO. Please continue the conversation on the Nutanix Next Community and let us know what you think.