Blog

NOS 4.0 Cluster Health – Preemptive Disk Failure Alerting for the ROBO

 

By Dwayne Lessner

Prior to 4.0 Nutanix would track disk errors and eventually mark the drive offline and rebuild the data if the drive was failing. Cluster Health allows the end user to get insight into whats going on before the drive is marked offline. 2 out of the 55 Cluster Health tests help in showing a disk that is gradually failing.

Stargate Disk Corruption Level
Checks that the number of disk corruption related errors in the last log_collection_duration secs is more than disk_corruptions_threshold.

Stargate Disk Operations
Checks the number of Stargate disk error operations in the last log_collection_duration secs is more than disk_read_write_errors_threshold.

In most cases you should plan on replacing the drive shortly. Since Nutanix can quickly heal it’s self and then keep losing additional drives it has made it a great use case for remote sites where it may be hard to get people on site. If you can see that the drive is failing you can bring the extra drives on your scheduled trip if you have enough capacity remaining on the cluster, instead of making a special trip.

@dlink7