It is 2:30am and the overnight shift for a corporate data center’s operations team is more than halfway complete. A wall of monitors displays a spectrum of graphs and surveillance feeds. The phone rings and it is an on-call engineer from the infrastructure team. She’s calling to confirm reports that some of the overnight batch jobs for the Business Intelligence app have failed. All of the end-of-month reports have failed to run and will be unavailable for tomorrow’s monthly accounting audit if the root cause is not identified soon. The DC operations staff doesn’t see any major alerts in their monitoring systems. A deeper investigation will be required.
IT infrastructure organizations invest a considerable amount of time and money evaluating and selecting products, collecting requirements from their stakeholders, designing robust infrastructure and solutions, and building operational workflows and run books. Despite the best laid plans, incidents occur that can impact the services delivered by IT. If the issue is not easily diagnosed using system alerts, then a variety of techniques must be used. This can include examining log files and contacting one or more vendors for support. What if scenarios like the overnight batch failure described above could be diagnosed using automated, intelligent analysis of logs, alerts, and system telemetry?
About 3 years ago Nutanix wanted to expand the consumer grade experience it was delivering to data center infrastructure to encompass troubleshooting and root cause analysis. The Nutanix Cluster Check tool, or NCC, is a modular service that can be packaged and installed on Nutanix CVMs running AOS. It runs a series of “pass/fail” type tests that can help diagnose problems and rule out potential problem areas. The modular design means that new tests can be written to provide robust troubleshooting support for new Acropolis and Prism features or new solutions running on Nutanix infrastructure. NCC quickly became the go-to tool for Nutanix support, helping to lower case resolution times and increase customer satisfaction.
NCC is also the engine that powers the Prism Cluster Health page and augments the operational insights of Prism Alerts. But there are some functions performed by NCC that needed to find a home in the Prism UI. Today at the .NEXT On-Tour event in Mexico City, we are happy to announce that we are planning to add, in the Asterix release, Prism integration for NCC! Once released, you can enjoy improved control of NCC within Prism. NCC CLI command will no longer be required because you will be able to run checks on demand, select one or more tests, and download test results from an easy-to-use graphical interface.
The proposed design adds actions in the top right corner of the Health page that will permit users to manage checks, run checks, or use the log collector tool. One unique selection is the ability to run tests that previously failed or printed warnings. This can be helpful when a problem is identified, corrective changes are made, and problem resolution needs to be verified.
The Tasks view in Prism will also show NCC tasks along with other completed tasks. Selecting a test will show a summary for that NCC task and will contain a link to download the output. These improvements will give users more control of their environments and streamline working with Nutanix support on root cause analysis.
Taking the guesswork out of troubleshooting never felt more rewarding then at 2:30am when the source of an application failure is quickly identified using the diagnostic powers of NCC and the one-click simplicity of Prism. Holistic focus on the product and support experience is one of the many ways Nutanix delivers the best Enterprise Cloud Platform built on your terms. Join us on the Next Community site and let us know what you think.