We announced our Metro Availability solution last week. More details on this announcement may be found here. Simply put, Metro Availability policies complete the Data availability spectrum by enabling customers to seamlessly keep applications online even on full site disasters. Being seamless is key — when things fail, the system should automatically handle failures such that applications and end users never see disruptions or even blips. Regardless of the failure — component (disk, memory, motherboard, power supply, etc.), server, rack or site — applications will stay online without any downtime.
Metro Availability is the last leg of the Data availability spectrum with the express goal of minimizing or eliminating application downtime, on unplanned failure events like site disasters and planned events such as site maintenance.
Unplanned downtime includes the unexpected like site disasters, hardware or software failures and human error. Planned downtime involves planned activities like datacenter maintenance, technology refreshes, software upgrades or patching, application migrations and any event requiring a maintenance window or outage.
Nutanix Metro Availability is a policy applied on the datastore comprising one or many virtual machines for protection against unplanned and planned downtime. Think of the policy as effectively “spanning or stretching” the datastore across two sites. On site failure, the policy relies on hypervisor HA (like VMWare HA) to spin-up virtual machines on the partner site where the virtual machines continue serving data from the datastore.
Handling Site Disasters
When a site fails, hypervisor HA re-starts virtual machines on the partner site. Since the datastore spans both sites, re-started virtual machines see their virtual disks on the surviving partner site and continue serving data from where they previously left off. Our solution, like all competing metro clustering solutions, assumes a layer-2 network across the two sites.
Performing Site Maintenance
Leveraging the same site disaster handling mechanisms, customers may non-disruptively move (using vMotion, for example) virtual machines to hosts on the partner site. Since the datastore spans both sites, the migrated virtual machines see their virtual disks and therefore continue serving data from where they previously left off.
Metro Availability Policy Requirements
Ensuring datastore availability across two sites requires data to be present at both sites in real-time. Therefore, the sites must be within reasonable RTT latency to make this solution viable. We support round trip latencies of 5ms and below. Inter-site bandwidth should be a function of the application’s write profile and our recommendation is to maintain adequate bandwidth to accommodate peak writes.
Applying the Metro Availability Policy
Currently operational Nutanix clusters may apply the Metro Availability policy once upgraded to the new release (NOS 4.1). Metro Availability policies are enabled on Nutanix containers (containers correspond to datastores as far as virtual machines are concerned). Enabling the policy requires a few clicks and shouldn’t take more than a few minutes. Contrast our simplicity of deployment to the solutions on the market today. I haven’t heard of any competing solution deployed without expensive professional services lasting at least a week.
Metro Availability policies interoperate seamlessly with other data management policies on the cluster like Deduplication, Compression & Redundancy factor, to name a few.
Setup, Management & Monitoring
Metro Availability policies are managed from the Nutanix UI – Prism. Setup should take a couple of minutes via a few Clicks.
Choose Data Protection from the main menu on the Prism home page, then identify the Metro Availability tab. This tab displays all Metro Availability relationships currently on the system:
Setup a new policy by choosing Metro Availability from the Protection domain menu:
A wizard will walk you through the Setup process. Be sure to choose storage containers with the same name on both sites (in this example we use ‘test’) and the desired partner site (remote site). The wizard performs the necessary pre-checks around latency, available capacity and consistent container naming and only then allows the operation to succeed.
Start by giving the policy a name:
Choose the storage container (datastore) where you will enable the policy:
Choose the correct partner site:
Choose your method of handling network or standby site failures. When the network or partner site fails, VM writes temporarily halt to preserve consistency and avoid loss of transactions due to a situation like a rolling site failure. The policy must be disabled to continue VM writes and may be done manually by the end user or automatically using a timeout:
Review your policy and then click create:
The newly created policy appears in the Metro Availability tab on both the Active and Standby sites. In this example, Brussels is ‘Active’ and Amsterdam is ‘Standby’:
Note: Active vs. Standby applies to the policy, not the cluster. Another policy could easily be created [on a different datastore] where Amsterdam is Active and Brussels is Standby.
After setup, you can monitor policy health, inter-site characteristics like latency & bandwidth and standard capacity metrics. Anomalies are flagged through events & alerts with the associated criticality (warning vs. critical):
When a site fails, promote the container (datastore) on the Standby site to Active using 1-click: