Dealing with Development House Disasters?
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.
For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.
Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
I know of one outfit that shall remain namless.
They layed off their sys admin.
Then soon found out what a root password was for.
134340: I am not a number. I am a free planet!
Instead of planning for fire, flood, and alien attack, categorize things by level of severity: whether or not the primary site is intact, the amount of time it will take to resume normal operations, whether the event is isolated/regional/national, etc. It doesn't matter if your office is ruined by an asteroid or a terrorist's bomb, the company will still be doing its work at another site.
:)
Identify critical paths and personnel. An organization can function for several weeks without C*O's, but without the "worker bees" the company will grind to a halt almost immediately. Also consider the effects of losing large numbers of staff. If every developer and admin quit, what would be the effect on the company?
Verify the physical security of all sites. Imagine "man with a gun" security breaches -- would an armed intruder (willing to kill) be able to cause significant damage?
I visited the colo site for a company I was with. They had multiple electrical hookups and enough fuel to run the diesel generators for 48 hours. There were two separate water main connections (a holdover from the water-cooled mainframe days). The colo was connected to two different phone companies (on opposite sides of the building), plus a dish for satellite uplink. It was incredibly expensive to build and maintain, but it would be more expensive if there was ever any downtime.
Also take cost into account. How much is your company willing and able to spend for disaster prevention and recovery? Don't forget to include your time in those figures...
break it down into procedures that you can take based on real problems, not causes. some things to consider:
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
There aint no pancake so thin it doesn't have two sides.
Little realized thing until it's too late: Do you have raid arrays? If so, you might want to ponder making sure you can always "get" another raid card of the same type currently running your array. Either have another on a shelf (off site, whatever), or another compatable system.
Nothing sucks more to have something stupid like the raid card (or motherboard) die, and you can't a replacement that will recognize your current array of drives.
What you've been asked (volunteered) to do is a risk analysis. This is a whole lot more than being a l33t admin of a high-availability site, and many Slashdotters seem to think. You've hit onto some of the non-technical risks (your CTO example), but to address this properly you need to concentrate on identifying risks, and how to handle them.
The first thing to realise is that you can't have a preformed contingency plan for everything. What you can do is identify every point of risk, weight it according to likelihood and severity, and develop plans for the "likely worst cases" that you discover. The rest of the risk is a business risk, that is, you insure yourself against it and deal with it if and when it happens.
You should also bear in mind that, from a technical viewpoint, there are no absolute guarantees. Almost all high-availability strategies protect against a single point of failure, but this isn't enough. What if you have multiple failures? How quickly can you detect and respond to a failure? How long can you suffer a complete outage (this is really important to know, and "we can't" is not an acceptable answer). Uptime costs money, calculate the point of balance.
Ask Google about "organizational risk" - you'll find a lot of information about auditing risk that can put you on the right path.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net