Slashdot Mirror


Dealing with Development House Disasters?

Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"

2 of 59 comments (clear)

  1. Partial answer: do dry runs of the plans by herrlich_98 · · Score: 4, Insightful

    One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.

    For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.

    Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.

  2. simple by farnsworth · · Score: 4, Insightful
    having both worked on disaster recovery plans and having worked in a data center in the world trade center that was completely destroyed, I can say that the best recovery plans are extremely simple.

    break it down into procedures that you can take based on real problems, not causes. some things to consider:

    1. internet connectivity down (switch to backup colo)
    2. unrecoverable db (drive to offsite backup storage and get backup data)
    3. app servers fried (engage hot standby boxes)
    4. all of the above (shit, it's going to be a long night)

    etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.

    of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.

    --

    There aint no pancake so thin it doesn't have two sides.