Several years ago I was called on to manage the recovery a large system IBM data center that was completely destroyed by a natural regional disaster — In this case, a California earthquake. Eleven days after I received that initial phone call, business resumed with a newly built mainframe data center in a distant city. The tens of thousands of end-user clients who relied on that computer system had no idea what had gone on behind the scenes to keep things running.
This was a great success then, wasn’t it?
To the outside world this seemed like an resounding success. But having been in the middle of it all, I would give us collectively a C-minus at best, maybe even a “D” grade.
Huh? So what was the problem?
The problem is this: The recovery was performed by, and was only made possible in large part by a small group of extremely experienced, talented technical people who worked in my group. In short, we overcame many failures to plan and prepare for this disaster by deploying a bunch of really smart, really experienced people. The bottom line is: We got very lucky!
We should have been able to make a phone call, activate a protocol, and resume business in an orderly way. Instead, eleven days of chaos and burnout ensued, which is a really lousy way to run any effort–professional or otherwise.
The company suffering this disaster had their own small but very capable (brilliant even?) staff who surely could have performed the recovery instead of us. They each had considerable technical skill and specific knowledge about the system that needed to be recovered. The only problems were that 1) they had written down very little of that critical knowledge, and 2) that every one of these people were busy tending to their own personal safety, and that of their families–which is exactly what they should have been doing.
So why would “boring” have been better?
I work with another company who pretends once per year, that their IBM mainframe computer system has been destroyed by a natural disaster. Every year a handful of staff members from this company and I fly to New York and arrive at an IBM Corp cold site about an hour outside of New York City. Armed with nothing more than a big box of backup tapes, we have 48 hours to create a fully functioning system that would be ready to be used by the company, in case of a true disaster.
It’s important to note that before ever doing a live test of our disaster recovery procedure, a bunch of really smart people spent months refining the plan, and honing it to a fine edge. We were ready!
At the risk of boring you, let me give you a quick review of how it went:
Year 1: What a disaster! We were all up for 48 hours straight–and at the end of that 48 hours, we were exhausted and were nowhere near having a working system. Ouch! We went down in flames even though we had a bunch of brilliant people working on it! We were not ready.
After we catch up on lost sleep, we spend months going over what went wrong, what to do different next time, etc.
Year 2: What a disaster! We had different problems this time, but we still came nowhere near having a working system after 48 hours. Keep in mind that there is nothing magical about a 48 hour test window; except for the fact that if you can’t recover a system in 48 hours, you may not be able to do it in 480 hours either, or 4,800 hours! Exhausted, bruised egos, we flew home and spent more time devising a new,m improved DR plan.
Let’s skip ahead to year 5 now:
We didn’t even fly to New York this time. By this point, we had all spent so much time over the past 5 years live tests refining and improving the recovery plan, were sick of even looking at it.
It had become boring!
For the test this year we all just sat at our individual desks, at our own computers, and talked to each other on the phone. We telecommuted. A couple of staff members even brought their guitars to work to ease the boredom while the various long-running recovery processes were running.
Although now very boring, just 24 hours after starting the recovery test, it was complete and 100% successful. The recovery plan no longer relied on brilliant people “putting out fires” so to speak. Instead, it relied on working and refining the plan, and improving it over and over again from our collective experience until it became routine. Boring even.
When it comes to recovering from, and responding to a disaster: Boring is good.
For a good, boring plan for recovering from any disaster (and not in any real order):
- Make a formal, detailed plan for recovery, or operations, or whatever you need to be accomplishing as a team;
- Write everything down. Commit nothing to memory;
- No one person should *ever* be indispensable. In a true disaster, that “expert” person may not be able to participate or help. They could be dead, or just have a dead cell-phone battery. Either way, they can’t help;
- Include in this script/plan/document a detailed list of everything you need to do right, including a one-sentence description of what constitutes success for that task/unit of work;
- Expect things to go wrong;
- Assume nothing;
- Grade the team success on every single item in the script. If you don’t measure it, it’s hard to know if you are improving;
- Don’t presume expertise on the part of the person running the plan. The person executing this plan may be much less knowledgeable, or experienced, then you. Assume that a junior-level person may be called upon to execute this plan in real-life;
- Make the plan a complete, stand-alone document. If it needs appendices, that’s fine. But don’t ever assume that someone working the plan has the time or ability or resources to track down some manual, or piece of documentation, or whatever;
Make the plan the “Bible” for operations.
- Test, and test more. When everything is working, test some more.
- Give the final plan to the newest person on the team. Can he or she effectively lead the team effort with minimal extra input? If not, then consider adjusting the level of detail to enable any reasonably trained team member take the lead.
- Always be asking: What are we missing? Are there things we don’t know that we don’t know?
- When things get crazy during a real disaster (or a test even), stop, and take a moment to breathe. Rushing around in a panic (even if you feel that way inside) will only make things worse. So stop… Breathe! And above all:
- Stay Safe!
Keep repeating the above steps until bored silly. Bored will beat Brilliant, every time!