Question 1

What is RTO (Recovery Time Objective) and why is it important?

Accepted Answer

RTO is the maximum acceptable downtime for your systems after a failure or disaster. It's critical because it directly impacts business continuity—the faster you can recover, the less revenue and productivity is lost during an outage. When failing over workloads to the cloud, achieving rapid RTO can be a strategic priority for IT organizations.

Question 2

What is RPO (Recovery Point Objective) and how does it differ from RTO?

Accepted Answer

RPO is the maximum acceptable amount of data loss measured in time. It defines how frequently you need to back up or replicate your data. While RTO measures recovery speed, RPO measures data freshness. The article demonstrates achieving a 'Zero RPO' through continuous replication, ensuring no data loss during failover, while also achieving sub-hour RTO.

Question 3

What was the scale of the Nutanix Cloud Clusters (NC2) failover validation test?

Accepted Answer

The validation involved 1,000 mixed-workload virtual machines (50 large, 100 medium, and 850 small VMs) running on a single cluster of 9 i4i.metal AWS nodes. The environment was highly utilized with active workloads to simulate a typical dense enterprise environment, making this a real-world scale test.

Question 4

How fast was the planned failover for 1,000 VMs in the test?

Accepted Answer

The entire planned failover process, including manual network routing changes via Nutanix Flow Virtual Networking (FVN), completed in under 30 minutes. This represents a true 0 RPO with sub-hour RTO, significantly faster than traditional multi-hour or multi-day failovers.

Question 5

What are the three types of failover methods discussed?

Accepted Answer

1. Unplanned Failover: Fastest method—registers and boots VMs from the last replicated snapshot without graceful shutdown. 2. Planned Failover: Involves graceful shutdown, final sync, then registration and power-on at destination. 3. Planned Failover with Live Migration: Uses AHV to migrate active compute state without downtime, resulting in zero application disruption but longer total time.

Question 6

Why does live migration failover take longer than planned failover?

Accepted Answer

Live migration requires transferring significantly more data across the network—including the active memory state and disk changes while VMs are still running. This results in zero application disruption and a seamless transition, but takes longer than a standard shutdown and boot sequence because of the additional data that must be synchronized.

Question 7

What network technology did Nutanix use to manage the failover routing?

Accepted Answer

Nutanix Flow Virtual Networking (FVN) was used to manage network routing, with manual External Routing Policy (ERP) changes to facilitate the cutover to the new active Availability Zone for VMs and networks.

Question 8

How much has Nutanix improved RTO performance between 2023 and 2025?

Accepted Answer

Nutanix delivered a 50-60%+ cumulative reduction in mean RTO across VMs protected under any RPO level (from Async to Metro) between 2023 and 2025. Impressively, RTO continued to improve even as the scale of protected environments increased with each release.

Question 9

What architectural constraint was used for the failover test?

Accepted Answer

The design used two Availability Zones (AZs) in an AWS region with a strict constraint of no partial failover between AZs. When a failover occurs, it's an all-or-nothing event for workloads in that AZ and requires a human decision to initiate, ensuring controlled and deliberate recovery operations.

Question 10

What does the article suggest about migrating from other cloud virtualization solutions?

Accepted Answer

The article demonstrates that migrating to NC2 on AWS doesn't require compromising on performance or accepting unacceptable risk levels. By leveraging NC2, organizations retain AWS public cloud service benefits while gaining Nutanix Cloud Platform's operational simplicity and elite disaster recovery capabilities.

Question 11

How does NC2 failover performance compare to traditional systems?

Accepted Answer

Many organizations with legacy systems experience multi-hour or even multi-day outages during large-scale failovers. With NC2, these lengthy downtime windows are eliminated—the validation showed 1,000 VMs recovered in less than 30 minutes, demonstrating a dramatically different approach to disaster recovery.

Question 12

What was the testing environment setup for this validation?

Accepted Answer

The validation split management and workload functions between purpose-built clusters. The workload cluster consisted of 9 i4i.metal AWS nodes running Nutanix AOS 7.5. The failover target was configured identically to the source, ensuring consistent performance testing and realistic failover simulation.

Question 13

Is this RTO performance a one-time anomaly or ongoing standard?

Accepted Answer

The article notes that these performance speeds are not a one-off anomaly. Nutanix has been continuously optimizing recovery performance, with consistent improvements across releases while maintaining full-stack consistency and predictability for both small and large-scale failovers.

Question 14

What are the key benefits of achieving sub-hour RTO with zero RPO?

Accepted Answer

Sub-hour RTO with zero RPO provides businesses with extreme performance and reliability for disaster recovery, enabling rapid recovery from regional outages or data center exits. This eliminates excuses for prolonged outages and significantly reduces recovery-related business impact and costs.

Zero RPO and Fast RTO: What We Learned from NC2 Failover Validation