When it comes to healing I think most people would agree with:
- Corrective action should be taken right away
- Fixing the underlying issue shouldn’t cause something else to fail
Wouldn’t it make sense to take that same philosophy and apply it to your infrastructure?
Nutanix being a converged distributed solution is designed for failure. It’s not a matter of if, but a matter of when. Distributed systems are extremely complex for some of the following reasons:
- Implementing consistent, distributed, fine-grained metadata.
- No single node has complete knowledge.
- Isolating faulty peers through consensus.
- Handle communications to peers running older or newer software during a rolling upgrade;
It’s these above reasons why you need to have safe guards in place. This post will tell how Nutanix is able to heal from drive failure and node failure while allowing current workloads to carry on like nothing was out of the ordinary.
Nutanix Failed Disk Drive = Recovery starts immediately
Nutanix Node Failure = Recovery is started in 60 seconds
Running workloads are not forced to go under the knife looking for the failed component. Nutanix prioritizes internal replication traffic. Each node has a queue called Admission Control. VM IO (front-end adapter) and maintenance tasks have a 75/25 split on each node in the cluster.
When a drive or node goes down the metadata is quickly scanned to see what workloads have been affected. This work is evenly distributed amongst all the nodes in the cluster running a map reduce job. The replication tasks are queued into a cluster-wide background task scheduler and trickle fed to the nodes as their resources permit. More nodes, faster rebuild. By default 10 tasks per second out of the 25 can be used for internal replication. This number can be changed but is recommended to leave it alone. Internal replication tasks have a higher priority then auto-tiering but auto-tiering won’t be starved. Remember Nutanix has the ability to write to a remote node but prefers to write locally for speed.
In a four-node cluster, we will have 40 parallel replication tasks on the cluster. As of today data is moved around in what we call an extent group. An extent group is of size 4MB, this mean we are trying to replicate maximum – 40*4 = 160MB of data at any give time. Nutanix did use an extent group that was 16MB for a while but found it could move more data using a smaller rate and have less impact on the IO path. This type of knowledge only comes through product maturity with the code base.
Since data is equally spread out on the cluster, and the bandwidth reserved per disk for replication is 75MB/s (including reading the replica and writing the replica), the maximum bandwidth across nodes will be 75 * number of disks, which is significantly higher than 160MB. So we can assume that the max replication throughput will be 160MB/s.
If where to take a 32-node Nutanix cluster (512 TB RAW) made up 6260’s at 50% capacity using 4 * 4 TB drives. This is how we would figure out the rebuild time of a drive:
- 32 nodes * 40 MB = 1280 MB/s of rebuild throughput. One disk is not the bottle neck as the data is placed on all of the drives based on proper redundancy.<
- 4 TB at 50% = 2 TB to actual data to rebuild = ~28 minutes to rebuild a 4 TB drive under heavy load. If you had to rebuild the whole node it would take ~1.8 hours.<
Low loaded clusters will complete these tasks faster, and therefore get back to parity quicker.
One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.
This fast, low impact rebuild is only possible when you spread out the data across the cluster when writing instead of limiting yourself to certain disks. You also need a mechanism to level the data in the cluster when it becomes unbalanced. Nutanix provides both of these core services for distributed converged storage. Not having a mechanism to balance your storage can put your cluster at significant risk for data loss and negative performance impacts.
How many times can you go under the knife?