Dealing with Development House Disasters?
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
I don't have experience with this myself, but if I were in your boat I would make a system for classifying types of disaster and the appropriate recovery methods for each. For instance at the top level you would have either a disaster resulting in either physical damage or non-physical damage. From there you could classify disaster types according to how much and what physical damage occurred.
So, your meteorite example would probably fall into something between a horrible fire and earthquake, as the kind of damage inflicted on your facility would be similar in such events.
I've done alot of disaster scenario planning for emergency service providers - and we come up with some really wild stuff, but the above is perhaps the best advice I've seen here so far. Don't worry about the aliens hijacking the data center, worry about the data center resources not being available for whatever reason.
The concept of breaking down the recovery phases is the best recovery advice I can give you. Worry about things in sort of a concentric ring of problems much as the previous poster presented. Start with the simplest broken piece and move on to the more compilicated.
The things companies went through due to the WTC going poof is a true real-world example of the worst case scenario occuring - Not only the data center disappeared, so did the staff that ran IT there!
Of the recovery efforts I've read about - the guys that had deals with hot-standby facilities out of the immediate area came back the quickest.
Have you compiled your kernel today??
Adding to the last point you mention, the most critical thing you do to any plan is TEST IT.
Stage a disaster.
Either that, or just fake it. Take the data to and try to bring up all the data, bring it up on the internet (under a dr.www.?????.com DNS name for example) and see what's accessable.
In case of failure, tune the plan, and try again in 6 months.
In case of success, tune the plan, and try again in a year.
Jason
Zapman
Years ago (Back in the days of drum memory), I was taken on a tour of a data center (first computer I ever saw)
They guy told me something about their disaster preps (it was financial data)
The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....
The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...
Then he said, "we have a second Identical data center about 5 miles across town"
Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...
Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.
And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that
Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
There is a natural tendency to think that this is all about keeping the data safe, or about having procedures in place, but I have a different way of looking at it that I think is more practical:
Make sure the company stays profitable.
All this involves is insuring against disasters, and making sure the payout will *exceed* whatever it costs to recover.
If my office exploded and I had to re-build everything from scratch, I'm fine, because my company will soon be getting a cheque that will cover everything, including the expected profits for the next 6 months. If we rebuild in 5 months, the disaster is actually a revenue generator for the company.
Compare this to someone with excellent plans and the ability to get rebuilt in just a month, but no insurance on the lost profits. They are facing a net loss even if they work like crazy and get the rebuild done in 3 weeks.
Of course you should still do off-site backups to deal with problems that are 'serious' but not 'disasterous', and have contingency plans in place for day to day traumas, but I think the best way to deal with 'the big one' is simply to insure against it, and make sure the insurance covers the lost profits.
A pizza of radius z and thickness a has a volume of pi z z a