What Does Your DR Look Like? Or "Holy #$*%! Everything is down!"

IT Management

Mar 2

This is a topic near and dear to me these days. Having suffered a recent outage at my job with over 9 hours of downtime, this is now a major issue for me to work through. Everyone always gives disaster recovery (DR) lip service. They come up with ways to backup data, provide alternative networking access as they can afford, and try to create plans. My feeling, much like what I have dealt with at prior positions, is that no one really invests into DR. I hope to provide a few cautionary tales to help you convince your management to make the investment.

Disaster recovery is insurance. All the investment that is made in DR is insurance against a downtime. At the same time, everyone keeps saying "it will never happen to me." I can provide references that it does happen and the outcomes can be brutal for a business. Downtime can lead to loss of business opportunity, change in customer perception reducing their business with you, loss of customers entirely, or complete collapse and closure of the business. To offset these outcomes, businesses invest in disaster recovery to mitigate the impact of downtimes.

When looking at DR, first thing that people need to determine is what are those critical systems; what systems do you have that if you lost them would impact the business most. For a manufacturing company, it could be their control systems for their machinery. For a datacenter, it could be the power and networking systems to keep the hosted systems online. For a healthcare company, it could be all the systems involved with patient care. The IT team needs to sit with the business and management teams to determine which systems are those critical systems and all of the infrastructure that supports it.

Now that the critical systems are identified and their infrastructure is determined, a full risk assessment of those systems and infrastructure needs to be completed. Are there devices that have single points of failure? Can servers be connected to the network in diverse paths, also known as teaming? Can the software be setup in clustering technologies to allow more than one server to be setup and kept in sync? What equipment is the oldest and have a higher possibility of failure? Working through the risk assessment with knowledgeable team members in both the IT and business teams will help find the answers quickly.

Now that the risks are identified, the professionals need to step in and make some plans to mitigate those risks. That planning can include duplicate systems, cluster creation, backup and recovery techniques, additional networking equipment and lines, and warm/cold spare hardware to name a few. Each of these plans need to be fully thought out including the costs of creation and ongoing maintenance.

Part of the maintenance of backup systems is using them, a largely overlooked step of DR planning. Both business and IT teams need to role-play disasters to ensure these policies, procedures, and systems will work. These sorts of tests interrupt normal business operations but should be done on a regular basis to ensure all systems are go for a real disaster. After each test, the affected teams should get together and review the test event to improve policies, procedures, or systems in the future.

I know that what I have said so far is something that everyone else has said to their management to push for better DR planning and testing. I have said it myself at times. Having gone through a large outage that affected my company's business has brought it to the forefront for me and gotten the attention of my company, a company that runs 24x7 for our business. We lost our primary datacenter, the hosting location for primary servers and the hub of our network, for approximately 9 hours on a Thursday night, which is our busiest times of the week. While we had some basic processes and procedures in place, it was thanks to the hard working teams at my company that we made it through the outage.

During the outage, the primary datacenter lost its primary power at the Automatic Transfer Switch (ATS) that allowed them to select either the utility company or their generators as the power source. Not only did they lose the power there, the ATS literally blew up blowing out part of the wall behind it. In trying to get the datacenter power back online, they also found that a fuse in the transformer was bad, possibly causing the whole problem. To correct the transformer fuse, they would have to fail their second power source from the utility to generator to allow the utility to pull a fuse from that second transformer as the utility crew did not have a spare on hand instead of waiting up to 2.5 hours for them to go get one at their warehouse and returning.

While seeming a simple fix, this would have impacted part of the datacenter that was still operational and hosting one of their biggest customers. That customer did not want any more change introduced into their hosting systems. As a customer impacted by the continued outage, I pushed on the datacenter to start the change with haste. This put the datacenter in the middle between customers.

Eventually, this was resolved and the generator added to the second circuit, allowing the utility to repair the primary circuit. This is where good process and planning helped out my team because we knew which systems had to be started first and what order to effectively restart our business. Once we got our systems up, the business teams started in cleaning up their issues from the outage.

After the outage, an emphasis was placed on all parts of my company to determine ways to improve our business resilience to outages. This includes alternative network connectivity for outages, secondary datacenters, hardened systems, and improved policies and procedures to reduce the impact on our customers if we have another outage.

I will admit that I wrote this blog entry a while ago but could not finish it off until now. It was difficult to read what I wrote because it would make me go back and remember all that happened; reading my blog entry brought back all of those memories and feelings as if they were happening again. Major service interruptions are difficult for any group. What made this worse for me was that there was nothing I could do but wait for our hosting provider to fix their facility and services. Since this occurred, they also have taken some steps to improve their offering to ensure clients like my company do not suffer through something like this again. Improvement can happen for you directly or for your providers and partners.

The key takeaway is that outages will occur. The better your systems and networks are designed and the more time is invested in both business and IT policies and procedures, downtime impact can be reduced and customers can be kept happy during those outages. The best outcome that IT and business teams can hope for is no impacts for their customers at all while systems are offline or unavailable. No single system can stay 100% available forever but well-designed systems and networks can offer the "Five 9's of availability" (99.999%) or no more than just over 5 minutes per year of downtime.

What are you doing for your disaster recovery? Is it even a thought for you or your company?

Jared Shockley https://jaredontech.com

What Does Your DR Look Like? Or "Holy #$*%! Everything is down!"

My Azure Hosting Hiccups, or "How to Shoot One's Self in Your Own Foot"

Moved my Site and Blog to Azure … How Easy!