Dealing with Development House Disasters?
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
First and foremost, make sure you have your data backed up and secure. Nothing else matters if the data is missing. When did you last test your backups?
Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?
Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.
Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.
No data, no job, it's as simple as that.
'I'm on it, Sir' was my response
A friend of mine once gave a response that was less gentle:
Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.
There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.
We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?
We don't need a contigency plan sir, we ARE IN the contigency plan.
Get real, sir.
Still kept his job. Ok, maybe he wasn't that snotty...
"Can of worms? The can is open... the worms are everywhere."
One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.
For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.
Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
break it down into procedures that you can take based on real problems, not causes. some things to consider:
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
There aint no pancake so thin it doesn't have two sides.