COVID-19 Crisis Reveals Virtues of Remote Data Center Management

Lights-out operations and automation provide timely benefits as IT leaders from Nutanix and elsewhere contemplate an AI-driven future.

By Stan Gibson

By Stan Gibson May 22, 2020

Although the novel coronavirus has upended everyday life, IT operations at many organizations are proving remarkably resilient. The reason: data center automation and remote operations technologies are keeping things humming without direct human involvement.

“When we made the shift to working from home during the pandemic, we saw no interruption in our global data center services,” said Wendy M. Pfeiffer, CIO of Nutanix. She noted that Nutanix is benefiting from having implemented remote operations through software-defined networking (SDN) a year ago when the company moved its three main data centers from California to other states.

“Instead of having onsite network engineering personnel who were constantly configuring physical switches and routers, we now provision and take down networks remotely,” Pfeiffer explained.

[Related story: Relying on Remote IT During COVID-19]

The company doesn’t need to configure hundreds of network switches individually by automating network management tasks on their Big Switch Networks SDN (recently acquired by Arista), according to Eric Pearce, IT systems architect at Nutanix.

“We are currently writing our own Python code that uses the Big Switch REST API,” said Pearce.

The Big Switch product also integrates with the Nutanix Prism virtual data center management platform, so Prism administrators can configure all their cluster networking from within the Prism GUI without having to involve the networking team. 

“In the past, a Prism administrator would have to request networking changes via a ticketing system and have to wait for the networking team to respond,” said Pearce. “This Big Switch and Prism integration brings self-service networking to the Nutanix administrator.”

Reducing Human Error

Although data center automation and remote operations are paying dividends during this unusual time, a recurrent problem they address is downtime caused by human error. Studies vary about how large a role human error plays, but the Uptime Institute estimates 70% all data center outages are caused by human error.

IT is in the midst of a transition right now. We’re up-leveling these jobs. It’s less about bits and bytes, and more about understanding workloads.

Steve McDowell, senior analyst, Moor Insights & Strategy

Figures aside, even a single human error can have significant consequences. In 2017, for example, a British Airways outage was traced to an engineer who disconnected and reconnected a power supply. The resulting electrical surge damaged IT equipment, leading to hundreds of cancelled flights and $112 million in customer refunds and compensation.

This event is an example of why remote management can sometimes work better than being there.

“Managing network infrastructure remotely is more effective than assigning a person to do the task on-site,” said Pearce. “If someone walks by a rack, it might look fine visually. But we can go in [remotely] and use the features of the BigSwitch SDN to determine if everything is cabled correctly and to verify actual connectivity.”

Fixing physical cabling issues remains a manual task that has to be resolved on-site. 

“We have created tools that allow both the remote and on-site DC staff to immediately verify and audit their own work without having to rely on external teams,” Pearce said.

AI Becoming Table Stakes

Sophisticated lights-out management of remote data centers and networks is getting a boost from artificial intelligence (AI), which can save labor and increase uptime, according to Steve McDowell, senior analyst, storage and data center technologies at Moor Insights & Strategy.

“We’re seeing a trend across all the vendors to implement AI-based decision assistance,” McDowell said. Storage devices, for example, can send information that is correlated with AI algorithms to predict failures.

In another case, McDowell explained, IT can model the behavior of thousands of virtual desktop infrastructure (VDI) nodes when considering capacity expansion. “AI-driven predictive analytics tools are almost table stakes now if you’re delivering infrastructure automation tools,” he added.

The trend toward more automation and remote operations will change the job descriptions of IT operations managers, McDowell said.

“IT is in the midst of a transition right now,” he said. “We’re up-leveling these jobs. It’s less about bits and bytes, and more about understanding workloads.”

The increased use of AI tools will help IT pros model application performance and enable them to make financial decisions as to where those jobs should run, whether on-premises or in the cloud, McDowell explained.

Lessons from the Pandemic

Although automated data centers are weathering the coronavirus storm, the pandemic is providing a stress test that will teach lessons for the future.

“Six months from now, IT is going to look back and ask how hard or easy it was and where the pain was,” McDowell said. “We’re going through a great experiment that will reveal gaps in automation. You’ll see a recognition of what works and what doesn’t that will push IT more toward software-defined infrastructure. [Software-defined] tools that are cloud-aware are going to rise to the top because you need to provision [resources] on-prem and in the cloud.”

According to Pfeiffer, Nutanix is well on its way down that path.

“We already operate in a hybrid cloud mode. We have significant operations in public cloud infrastructure provided by Amazon and Google,” she said.

That hybrid model came in handy when the coronavirus caused a sudden spike in the number of employees working remotely who needed to utilize the company’s Citrix VDI implementation.

“We run a significant Citrix farm in one of our data centers that is provisioned to support about 2,500 remote sessions,” Pfeiffer said. “However, as that capacity was stretched to the breaking point as more of our engineers began to work from home, instead of procuring and provisioning additional servers in our data centers, we provisioned additional VDI capacity in AWS to scale out our Nutanix Frame VDI farm.”

“Today, we are running thousands of VDI sessions in both [Amazon Web Services] utilizing Nutanix Frame and in our data centers utilizing Citrix running certified on Nutanix [Acropolis Hypervisor]. This mixed mode is possible because both environments share the underlying foundation of our hybrid OS [Acropolis Hypervisor plus Acropolis operating system], and no physical hands-on provisioning was needed,” said Pfeiffer.

Looking beyond the pandemic, it doesn’t take a crystal ball to see that more intelligent and more powerful remote data center operations management lies ahead.

“It doesn’t really matter that you can’t get into the data center anymore, the management processes are the same regardless of whether the data center is in your building or a thousand miles away,” said Pearce.

Going forward, he said, IT teams should set things up that way from the get-go.

Stan Gibson is a contributing writer with 36 years of experience as a technology journalist.

© 2020 Nutanix, Inc. All rights reserved.  For additional legal information, please go here.