Dealing with Development House Disasters?
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.
For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.
Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
break it down into procedures that you can take based on real problems, not causes. some things to consider:
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
There aint no pancake so thin it doesn't have two sides.