Introduction

The landscape continues to evolve for existing cloud-integrated virtualization solutions, achieving a rapid and predictable Recovery Time Objective (RTO) can be a critical priority. In this article, we discuss our recent validation testing of Nutanix Cloud Clusters (NC2) on AWS, exploring the failover mechanics and design constraints that demonstrated the ability to recover 1,000 virtual machines.

Failover from owned and operated datacenters to hyperscalers is a plan many IT organizations are leveraging, but to play chess rather than checkers requires examining resiliency in the cloud following such an event. As more organizations evaluate their hybrid cloud strategies and look for viable alternatives to existing virtualization solutions in hyperscalers, the conversation inevitably turns to risk, reliability, and recovery. When disaster strikes or a planned migration is underway, the speed at which you can bring your workloads back online is critical.

Recently, as part of updating the NC2 on AWS Migration Nutanix Validated Design (NVD), we had the opportunity to perform some recovery testing of NC2 on AWS. We wanted to see exactly what kind of performance gains and Recovery Time Objective (RTO) improvements we could achieve when failing over a large-scale environment.

The results were eye-opening.

The Setup: Architecture and Design Constraints

For this validation, we split out management and workloads functions between purpose built clusters, then packed 1,000 mixed-workload VMs onto a single cluster of just 9 i4i.metal workload-only AWS nodes running Nutanix AOS 7.5. The failover target was configured exactly the same as the source. This wasn't a lightweight test; the nodes were highly utilized to simulate a typical, dense enterprise environment. The VM profile breakdown included 50 large, 100 medium, and 850 small virtual machines all with active workloads. 

Because our design only depends upon two Availability Zones (AZs) in a particular AWS region, a critical decision was that there would be no partial failover between AZs. When a failover occurs, it's an all-or-nothing event for the workloads in that AZ and a human-made decision to fail over. We utilized Nutanix Flow Virtual Networking (FVN) to manage the network routing, making manual External Routing Policy (ERP) changes to facilitate the cutover to the new active AZ for VMs and networks

 

The Results: Blistering Fast RTO

For context, it's important to understand the mechanics behind our recovery methods and why they have different recovery times:

  • Unplanned Failover: In a true disaster scenario, this process simply registers and boots the VMs on the destination cluster using the last available replicated snapshot. Because there is no graceful shutdown or final data synchronization required from the source, this is typically the fastest failover method.
  • Planned Failover: This process involves gracefully shutting down the virtual machines at the source cluster, performing a final storage synchronization to capture the exact state at power-off, and then registering and powering on those VMs at the destination cluster. It provides a clean, consistent cutover with a brief window of downtime, but takes longer than an unplanned failover due to the shutdown and final sync phases.
  • Planned Failover with Live Migration: This method utilizes the live migration capabilities of Nutanix AHV to move the active compute state (memory and CPU) of the VMs across to the destination cluster without powering them down. This method is always the longest of the three because synchronizing the active memory state and disk changes across the wire while the VM is still running requires transferring significantly more data than a standard shutdown and boot sequence. However, it results in zero disruption to the running applications; the VMs simply continue executing on the new hardware.

Operating under this strict failover constraint, we initiated a planned failover of all 1,000 VMs between the two clusters. The entire process (including the manual FVN ERP changes to swing the routing), completed in under 30 minutes. That’s a massive block of VMs up and running in short order, achieving a true 0 RPO (Recovery Point Objective) and a sub-hour RTO. Even when leveraging live migration for the same workload on failback (which moves the active compute state) we saw the failover complete in under an hour.

Check out this quick video of the failover in action:

Continuous RTO Enhancements

It’s worth noting that the incredible speeds we saw during this validation aren't just a one-off anomaly. Nutanix has been relentlessly optimizing recovery performance. Between 2023 and 2025, we've delivered a 50-60%+ cumulative reduction in mean RTO across VMs protected under any RPO (from Async to Metro).

Recovery Time Recovery Time

What’s truly impressive is that as the scale of the environment has increased with each release, the RTO has actually continued to improve as well. Throughout all of these enhancements, Nutanix has maintained full-stack consistency and predictability, designing so that the recovery process remains rock solid whether failing over a handful of critical databases or thousands of mixed-workload VMs.

To put that in perspective: many organizations - especially those with legacy systems - are often stuck with multi-hour or even multi-day outages during large-scale failovers or migrations. With NC2, those lengthy downtime windows can be put in the rearview mirror. When you can failover 1,000 VMs in less time than it takes to have a lunch break, the excuses for prolonged outages start to vanish.

Migrating away from other cloud-based virtualization solutions doesn't have to mean compromising on performance or taking on unacceptable levels of risk. In fact, our testing proves the opposite. By leveraging NC2 on AWS, you not only retain the benefits of AWS public cloud services, but you gain the operational simplicity and elite disaster recovery capabilities of the Nutanix Cloud Platform.

Whether you are looking to secure your environment against regional outages or simply execute a massive data center exit strategy, the numbers don't lie. NC2 delivers a different kind of world for disaster recovery—one where extreme performance and low RTO are just part of the standard design.

 

©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s). This content reflects an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.