Blog

NOS 4.0 Cluster Health – Slices & Dices

 

By Dwayne Lessner

It slices and it dices! Nutanix Cluster Health is a new feature that will be another great asset in maintaining availability for your Tier 1 workloads. Cluster Health allows the ability to monitor and visually see the overall health of cluster nodes, VMs and disks from a variety of different views. With the ability to set different HA requirements at the application level, Cluster Health will visually dissect what’s important and give you guidance on how to take corrective action.

NOS View

Multiple views to meet your needs

Once inside the Cluster Health section in Prism you have access to over 55 tests and doesn’t require any additional setup other than upgrading to 4.0.

Some of the tests include:
VM Related
o Drop Packets – Maybe an VMware tools issue with nic driver
o CPU Ready Time
o Memory Ballooning
Networking
o Checking Latency between Virtual Storage Controllers
o Insight into seeing what Virtual Storage Controllers are rerouting traffic for ability.
o 10 Gb NIC – Ensuring that the Storage Controller is using a 10 Gb NIC. I’ve seen where the 10 Gb connection was failed over to 1 Gb NIC in standby and then never failing back to the 10 Gb NIC.

Hardware
o SMART Status – Ability to run the SMART tools on any hard drive if need.
• ClusterNOS
o Bit Rot Dictation
o Collision Rates between storage controllers
o Identify large network latencies between CVMs causing service crash
o Identify bad CVM health – down/hung/extremely busy

While the tests defaults will be set for optimal cluster health it’s been built with ability to customize to increase test runs for troubleshooting or a quick health check preceding an upgrade.

NOS View

A history and schedule of past tests

Finding a problem is great but finding out how to fix the problem is even better. Cluster Health also includes cause and resolution where possible so you can address the issues in the shortest amount of time.

Nos View

Showing the cause and resolution for a problem

Taking local storage and forming a storage fabric spread out over many nodes/servers can be very complex. As the cluster size grows, the chances increase of getting faults. Quick isolation and resolution of a problem are important to maintaining high availability. A lot of the tests and warnings where in the product before but you had use the CLI or call Nutanix Support and get a Support Reliability Engineer (SRE) on the phone. With NOS 4.0, it’s like you get an SRE 24-7 monitoring your cluster. Having a highly functional visual aide empowers all people to use the right side of their brain to quickly come to a resolution.

Cluster Health is a big data application that is just the start of larger analytic commitment at Nutanix. The first release is impressive and only a sign of things to come from some of the best engineering minds in the world.

@dlink7