Next-Generation Serviceability: The Four Keys from Nutanix

| min
Next Generation Serviceability

The serviceability of any product relies on four fundamental actions: (i) observe, (ii) inform, (iii) fix, and (iv) learn. While most products attempt to implement these in some form or another, very few get it right. Today I would like to share with you the secret sauce required to build the most serviceable system in the world.

1. Reliable observation. This is toughest problem and also the easiest to overlook. The component that observes failures, should by definition, be the most reliable component in the system. It needs to work even when all other things are failing. In today’s complex distributed systems, it needs to be truly highly available, it needs to be resistant to network partitions, and it needs to record data reliably. For high availability and partition tolerance, Nutanix has had great success with Apache Zookeeper. And for data storage, any reliable database is fine. We use a NoSQL database with a replication factor of 3. Go this extra mile and you will never miss a failure.

2. No-noise Information. While it is fairly easy to get systems to “call home” when things go wrong, it is both science and art to make the call homes precise and meaningful. First of all, never send multiple emails for the same underlying problem, even if it means comparing stacktraces in core dumps, or doing automated root cause analysis on the box itself. In fact, we designed a 2-step call home on any error. In the first step the cluster contacts Nutanix with the error code and consults a database of current issues. If there is a match, then the real call home is sent with an analysis of the problem. Our L1 support could see a call home that reads “Hit bug # 3212. Version 2.5.0-stable has a fix for this”. The primary and relentless goal should be to make the support organization highly efficient by improving the quality of the information flowing in from the customer machines. For cross-customer repeat problems, this reduces resolution time from hours to just minutes.

3. Live analysis. Never (or rarely) ship logs out from customer machines for troubleshooting. In my opinion, that is just a lame excuse for an on-call developer to have an extended lunch break. All debugging and analysis tools need to be available in real-time on the live box. The goal should be to resolve the issue before the customer finishes his lunch. To make this happen, we use (i) “reverse tunnels” that can create an on-demand VPN from the system to the Nutanix Service Center with the click of a button, (ii) distributed log viewing and analysis right on the box, (iii) on-demand cluster-wide live and historical stats visualization, (iv) click-of-a-button problem analysis tools on the box, (v) live, on-demand, debugging tool chest upgrade (vi) and adherence to “be nice” philosophy which means that troubleshooting should only consume unused cpu and memory resources in the system. None of these are trivial to implement, but the founders of Nutanix decided to build a product for the long haul, and the early investments are bearing fruit today.

4. Adaptation — the anti-virus model. Think of bugs like viruses. Your job is to limit the number of customers who get affected by the same bug. A serviceability system that only improves when you upgrade the entire system software, is quite useless. It is like an anti-virus software which does not update its virus definition files everyday, but only when you buy the next version of the software. You need to design the serviceability component of your system on the design principles of anti-virus software. What is monitored, how it is monitored and what actions are taken, should be constantly updated from as the community learns about the system. At Nutanix, we designed such a component called Aegis that enables serviceability to constantly learn (with permission from the admin) how to monitor the system better. Careful design keeps it completely isolated from the I/O path. If a problem occurs that has happened at other customer site, Aegis might already know how to fix it through pre-canned fix procedures leading to truly zero service times. If the error is in the I/O path, Aegis will inform the customer with a very meaningful message and the course of action.

Nutanix strives to adhere to each of the serviceability principles I’ve outlined, to become a leader in next-generation serviceability, one of the key concerns of datacenters both large and small.