Of Mice and Men and Mother Nature: Preparing for Disaster
Most people prefer not to think about disasters–which is why, despite all of the available information, most people are grossly underprepared when disasters do occur. The same holds true for enterprises. No one disputes that a datacenter outage, or even worse, the total loss of a datacenter, can be catastrophic–in terms of revenue, productivity, and reputation. And yet only a fraction of enterprises have a comprehensive and well-tested disaster recovery strategy, which comprises tools, processes, and people. We recently conducted an informal survey of our customers to learn about their experiences with datacenter disaster and DR. We wanted to share a few of their stories to prompt you to think about your DR strategy, or lack thereof.
It may surprise you that several respondents admitted that they had no DR strategy at all–unless you consider responses such as, “hope and prayers” and “buying all the milk, bread, toilet paper, and water,” strategies. Yet disasters come in many forms, from the spectacular and tragic–hurricanes, tornadoes, fires, and earthquakes–to the predictable (failing devices and human error), to the comically trivial.
One member of the Nutanix community recalled a full weekend outage at a hospital thanks to a rodent taking down an electrical grid by chewing through lines. “UPS [uninterruptible power supply] kicked in, but the generator did not.” It took 30 hours for the storage vendor to come on site, which meant that “the hospital lost a few million dollars from cancelling every single non-urgent appointment the Monday following. The Board never approved a DR site. ROI would have easily been 400-500% in that single day of outage.”
Wayne Conrad, a Nutanix Consulting Architect, noted the varying fortunes of enterprises in Hurricane Sandy in 2012: “Goldman Sachs HQ was lit up like a Christmas tree, while all those hospitals had gone dark. Why? Goldman Sachs realized that DR and disaster prep are like buying insurance, and the hospitals were cutting and scraping by on smaller budgets. If you’re facing a nasty credit card bill, who says, ‘eh, I’ll just skip paying the car and house insurance.’ IT leadership does this all the time with DR sites.”
Some users made sensible, good-faith efforts, but ran into problems nonetheless: “In my last job we placed the core of our data center in a building designed to withstand an F5 tornado. Safest place in the city. It turned out that heat in the summer was our highest threat to our servers because they started turning the A/C off to that part of the building in the afternoons to save money.”
And of course there were many close calls. One community member said when a potential disaster was imminent, such as a tornado or hurricane, he would back up his company’s data onto two external hard drives. “We were never directly hit with any of the storms, but it’s a little nerve-racking knowing that you have the whole server room stored in your backpack while being evacuated for a storm.”
"We were never directly hit with any of the storms, but it’s a little nerve-racking knowing that you have the whole server room stored in your backpack while being evacuated for a storm."
Nutanix user Tre Bell observed that there is a “common misconception that a successful backup strategy equates to a successful DR strategy.” He admonished that “the restoration of systems in a different location or environment is not always cut and dry – restoring an environment to be fully functional often requires reconfiguration beyond simple backup and restore. Let’s say you have 50 systems in an environment that is a complete loss due to a disaster – most, if not all, of these systems have integrations between one another that require reconfiguration once you are able to successfully restore them to the new backup location or environment.” Bell says that “successfully restoring systems is only the first part of a successful DR strategy – DR testing is also crucial; you don’t know what you don’t know until you perform DR testing to verify 100 percent functionality.”
Bell’s observations were borne out by another respondent’s experience. He says they had two fully functional sites–one for HQ and another for DR–but when they finally got around to testing failover “we were unable to switch back to the main site within the down time and we had to run on the DR site for months until we got the down time again.”
Bell reminds us that it’s vital to conduct a thorough Business Impact Analysis (BIA) and identify target RPOs and RTOs. Once you’ve done this, you perform DR testing to confirm your ability to successfully restore systems to a functional state as well as your ability to meet RPOs and RTOs.
Some community members shared successes as well. Doan Nguyen recalled severe high winds and heavy rain causing “eight utility poles to fall outside of our building. The power went out, the road was blocked by hot wires and transformers, and everyone who made it into work that morning were trapped in the building. Initially, battery and generator backup provided phone and Internet capability. And by utilizing resources at several other locations, the company was able to continue to function until we got the all-clear to evacuate. That’s when DR efforts began in full. We executed on our own DR plan, and by 3 PM were operating completely remotely, with some of our employees at our Business Resumption Center and others working from home. Customer service calls, billing, email, phones–everything we needed to keep functioning was operational. Lessons learned: Conducting DR drills and testing our DR plan quarterly was and is fundamental. Even little disasters can have a huge impact. You need to be as prepared for a mundane disruption as for a catastrophic one.”
Given the enormous benefits of proper disaster preparation, why don’t more people take steps to properly protect themselves, or their enterprises? In The Ostrich Paradox, Why We Underprepare for Disasters, Robert Meyer and Howard Kunruether point to several widely shared cognitive biases:
- Short memories when thinking about the painful lessons of the past
- Short horizons when thinking about the future (especially when weighing immediate costs against potential benefits of protective actions).
- Unwarranted optimism–won’t happen to me!
- Oversimplification of cost-benefit analyses when considering risk.
- A tendency to follow the actions of others–that is, herding.
- A tendency to default to the status quo when faced with complexity and uncertainty.
Some good news, however, is that there are offerings now that mitigate some of the biases that keep us from tackling DR, ones that eliminate the complexity and uncertainty associated with traditional DR solutions. Disaster Recovery as a Service (DRaaS) solutions such as Xi Leap provide recovery automation and on-demand non-disruptive testing to ensure business continuity.
Xi Leap is part of the Nutanix Enterprise Cloud OS, which means that IT has no need to master another management console, or worry about reconfiguring network and security settings during DR failover processes.
Meyer and Kunruether point out that humans actually have something to learn from ostriches when preparing for disaster–not by sticking our heads in the sand, but by adapting to circumstances to survive. The ostrich compensates for the vulnerability of being flightless with speed and agility. Rather than simply avoiding proper preparation for disaster by doing nothing (hope and prayers), or defaulting to the status quo (complex DR systems), Enterprises may also want to consider embracing a simpler, faster, and more agile alternative.