Data Center Risk Management: A Comprehensive and Effective Plan

Companies with data centers need to prepare for multiple natural and unnatural risks while maintaining compliance.

By Dipti Parmar April 15, 2021

“Men may come and men may go, but I go on forever,” declared The Brook in a poem of the same name by Lord Alfred Tennyson 135 years ago. That could well be the cry of data, which continued to flow through the 200-odd data centers in Texas even as the winter storm Uri wreaked havoc, taking down the state’s power grid with it.

While things didn’t go smoothly, most data centers came out unscathed and managed to stay up through the storm. Some were even able to provide outside assistance.

Smart CIOs treat data centers as capital assets, with their own budgeting, management objectives and periodic upgrade necessities. With the exponential growth of cloud computing, mobile applications, IOT, EUC, and remote work in recent years, data centers have exploded in complexity. But IT leaders still must manage external and internal risks to avoid downtime, which can result in losing millions of dollars a day.

To protect and maintain the IT ecosystem requires strategic, long-term Data Center Infrastructure Management (DCIM) planning that mitigates risks in multiple areas. Here are some definitive steps organizations can take in this direction.

Who needs data center risk management?

That is pretty much a redundant question. Since data centers in their bare form are physical facilities that house business-critical data and applications, the risks they face are immense, regardless of whether they’re built and run within the enterprise, managed by an MSP, or hosted off-site by a cloud service provider.

For organizations that need to comply with legal, contractual, or regulatory requirements, periodic data center risk assessments and disaster testing are inevitable. Not having a risk management plan in place can lead to the whole data center going down because of a single point of failure anywhere in the architecture, leading to significant disruptions to operations and consequent losses in revenue.

Go for organization-wide, integrated risk management

Data centers, as we know them today, came into existence 25-odd years ago in an attempt to handle the monolithic workloads of the time. In 2021, colocation and private cloud hosting services have added convenience, as well as reduced costs for businesses large and small, but the question of complexity remains subjective.

With the rise of EUC, BYOD, and remote work practices, and the explosion of cloud apps, organizations need to take a fresh, wholesome look at the risks they face, including natural disasters, in-facility risks, data risks, and supplier or vendor-specific risks.

Guide to Simplifying Data Management in a Hybrid Cloud

This means any single risk factor might not apply just to the data center but impact the entire organization. Any data center risk management plan should draw connections between external, local, and organization-wide risks and prepare for each of them or multiple events happening simultaneously.

Identifying and mitigating all-pervasive risks involves a process called integrated risk management (IRM). Gartner defines IRM as “a set of practices and processes supported by a risk-aware culture and enabling technologies that improve decision making and performance through an integrated view of how well an organization manages its unique set of risks.”

So, organizations need the right tools and processes to monitor each moving part of the data center and deal with any risks that come up at any point in time, including malicious cyberattacks. Big data and analytics are instrumental in forming an accurate and comprehensive assessment of the risks to various operations that the data center enables, such as data access, application mobility, and DevOps. They also enable the implementation and execution of dynamic disaster recovery plans.

However, people — not processes — play a central role in creating these plans.

“There are specialists such as IT admins who are responsible for day-to-day IT operations to ensure uptime,” said Tuhina Goel, senior product marketing manager of business continuity and disaster recovery at Nutanix.

“But decision makers such as the CIO, VP or Director of IT are ultimately responsible for data center risk management. They own the budget and other resources to invest in right security measures, tooling and employee training.”

How Data Lifecycle Management is Different in a Hybrid Cloud World

In an article on Data Center Knowledge, Kevin Read, GIO UK senior delivery center manager at IT consulting company Capgemini, reveals how he has developed a risk management approach that is designed to identify risks, their probabilities, the potential business impact, and estimated mitigation costs. His model changes over time.

“At Capgemini, we have put in place a monthly risk management system that logs all risks and issues with containment and action plans,” said Read. “An investment budget is made available if changes are required.”

Assessment comes before management

Any risk management plan needs to be in place before a disaster (okay, an “incident”) occurs. Risk assessment and auditing is the first step here. This begins with an evaluation of your existing owned and operated facilities from the point of view of facility design, IT architecture and topology, as well as operational sustainability.

Further, if there have been any outages in the past, there needs to be a post mortem and root cause analysis to identify and address the inadequacies specific to the parts of the ecosystem that were affected.

Finally, if the organization has a hybrid infrastructure with multiple data centers in place and there are plans for data center expansion or consolidation, each asset needs to be individually assessed for resiliency.

It helps to create a chart or sheet for handy reference that lists the major risk categories, mentions all the crucial systems each category affects, estimates the damage and recovery costs, and makes it clear what to do in case of an incident.

Types of data center risks

It isn’t easy to categorize or even list out all the kinds of risks that a data center faces. Consequently, CTOs and IT teams have many uncertainties to worry about.

“While there are multiple reasons for IT downtime, power outages and human error remain the top and most frequent for downtime. However, in the last one year with IT practitioners working remotely, there’s been a noticeable rise in cybersecurity attacks causing unplanned downtime,” said Goel, when asked to list out the most significant problems that cause downtime in data centers.

Here are some more of these hazards and tips on how to mitigate them.

Geographic threats: Topological and climate risks should be evaluated at the time of choosing a data center location and then again during the facility planning phase. If areas at higher risk of natural disasters such as earthquakes, hurricanes, floods, and bushfires can’t be avoided, consider the use of stronger construction material in the buildings to offset the risk.

Luckily, many natural disasters can be forecasted, and therefore, prepared for. Further, data centers built in cooler climates have natural, renewable options for energy savings and cooling, which is why Nordic countries are a popular destination for building data centers.

In addition to natural hazards, data center managers should also consider man-made dangers. Make sure airports, power grids, chemical plants, military bases, and water bodies are a safe distance away. On the other hand, it helps if there is a fire station, hospital, and police station nearby.

Power outage: Power disruption can pose an existential threat to a mission-critical data center. Organizations need to make sure there is enough resilience built in with UPS-backed power routes to each rack and cooling system. Having dual power sources with direct connection to a multi-substation power grid for the site is a minimal protection against local substation power failure. On top of that, backup generators can be on standby as a last resort.

Water seepage: Water is a double-edged sword for data centers. Even a few drops on critical hardware can cause irreparable and permanent damage. At the same time, water supply and storage for cooling and fire control systems needs to be maintained at optimal levels.

Acoustics: Exposure to high-decibel sounds for prolonged periods of time is one of the most overlooked risks when building data centers. Hard drives and storage systems are particularly susceptible to loud sounds – high-frequency sound vibrations can significantly lower read and write performance, possibly compromising data quality and integrity.

It follows that the data center should be located far away from airports, arenas, and the like. Acoustic suppression technologies play a critical role in reducing equipment exposure to sonic shockwaves from high-decibel noise sources such as security and fire alarms or other apparatus on and around the premises.

Fire: Fires in data centers are mostly caused by power surges in the electrical equipment. One fire could destroy thousands of dollars’ worth of devices if not detected and put out immediately. In the early stages of a fire, the amount of smoke is so low that it can’t be detected by smoke detectors. Further, air conditioning and circulating systems disperse it quickly. The solution is Aspirating Smoke Detectors (ASD) that detect smoke at a very early stage and alert users as soon as minimum thresholds are crossed.

Security: Security failures in a data center could include anything from a network breach to sabotage and damage caused by individuals present at the site. One of the biggest threats is cyberattacks that result in leakage of account data or personally identifiable information (PII) belonging to customers.

Certain application or system failures may result in security personnel being unable to verify card holders’ identity or authorize them to go to certain areas. Video cameras and doors with access control might lose their connection to the central system too.

Breaches and threats caused by ransomware can only be dealt with using a multilayered approach to data protection, which has three aspects: prevention, detection, and recovery. Specific defense mechanisms include educating end users, regular vulnerability scanning, role-based access control, and regular data backups (the proverbial last line of defense).

System failure: This is where the most number of things might potentially go wrong, with the highest frequency. It is important to identify and fix all the single points of failure (that might possibly affect the data center) in the entire IT infrastructure.

This starts with a resilient network architecture and connectivity. Redundant fiber optic connectivity is the gold standard for data centers. Then come servers with multiple tenants or multiple applications running on them. Clustering, mirroring, and duplication help in ensuring continuous access and delivery and minimize the possibilities of downtime.

Modern HCI-powered data centers now pack everything together and deliver IT infrastructure as a resilient, secure, and self-healing platform.

Backing up data and files is a routine procedure for most organizations, but immediate recovery of real-time or transactional data in the event of downtime should be a priority for data centers. This is done in different ways in different companies according to the regulatory standards applicable to their industry. Again, by consolidating multiple backup solutions into a single turnkey platform such as Nutanix Mine, organizations can simplify data lifecycle management and get complete visibility and control over their data.

Another risk is when software applications go rogue on the data center and take down systems and servers with them. IT needs to make sure that these applications can run seamlessly over the entire infrastructure without causing any glitches in servers located in the data center or any other environment.

Poor Disaster Recovery planning: Identifying and minimizing any and all risks isn’t the end of the story. Any risk management plan worth its salt should know exactly what to do when (not if) disaster strikes and include a step-by-step recovery plan for every imaginable undesirable event. This starts with having systems in place that monitor key environmental factors and alert the concerned people when certain thresholds are crossed.

Failing this, the situation might quickly get out of hand and losses will escalate in the event of a sudden disaster.

And yet, organizations are caught on the wrong foot with surprising frequency when disaster strikes. “Most of the causes of disasters are not only underrated by organizations but also under budgeted for which leads to being under prepared in the event of unplanned downtime,” laments Goel.

Platforms that are flexible and automated are critical for non-disruptive recovery in the event of a disaster. Nutanix Xi Leap is a DR orchestration solution that is simple to deploy and manage, as well as adaptable to on-premises or cloud sites. It eliminates data silos and facilitates replication and recovery from a single user interface.

Balancing the Ecosystem with Data Center Risk Management

A data center has a thousand moving parts. It itself is a cog in the organizational wheel, so to speak. One small misalignment upsets the whole equilibrium of the organization, across departments.

Risk mitigation, therefore, is a shared responsibility. Each employee or stakeholder can help keep the facility operating at its optimal level either by following or by enforcing the rules and learning how to do both better. IT leaders should know exactly where and how much it costs to keep everyone trained and have access to resources they need to carry out any tasks where the data center is involved. The responsibility falls on the CTO or CIO to set expectations and give clarity on these operations.

Of course, data centers or the IT infrastructure itself doesn’t function in isolation. Spending money on data center risk management may not necessarily be a top priority for all managers – most departmental objectives pale in comparison to meeting revenue targets.

“Conflicting goals can be hard to address, but one of the most effective methods of doing so is to have a highly efficient process for continuously identifying where a risk resides. You also need a predictable, reliable method of updating systems without impacting the overarching business goals of the organization,” said Gavin Millard, VP of Product Marketing at Tenable.

As with everything else in IT, people are as important as technology in data center management too. Standardized processes and methodologies such as DevOps can help streamline workflows and processes and align all components of data center facility management with broader business objectives.

Dipti Parmar is a marketing consultant and contributing writer to Nutanix. She writes columns on major tech and business publications such as IDG’s CIO.com, Adobe’s CMO.com, Entrepreneur Mag, and Inc. Follow her on Twitter @dipTparmar and connect with her on LinkedIn.

Subscribe