Dealing with Development House Disasters?

← Back to Stories (view on slashdot.org)

Dealing with Development House Disasters?

Posted by Cliff on Thursday April 3, 2003 @12:45PM from the planning-for-the-worst dept.

Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"

10 of 59 comments (clear)

Min score:

Reason:

Sort:

Secure the data by Usquebaugh · 2003-04-03 12:55 · Score: 4, Informative

First and foremost, make sure you have your data backed up and secure. Nothing else matters if the data is missing. When did you last test your backups?

Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?

Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.

Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.

No data, no job, it's as simple as that.
1. Re:Secure the data by stefanlasiewski · 2003-04-03 12:59 · Score: 3, Informative
  
  Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.
  
  And something else as important: Where are your backup tapes?
  
  If they are sitting in a locked cabinet right next to the computers, they won't survive the asteroid blast either.
  
  Off site backups. A pain to maintain, but good idea for any contigency plan.
  
  --
  "Can of worms? The can is open... the worms are everywhere."
2. Re:Secure the data by 0x0d0a · 2003-04-03 19:55 · Score: 3, Funny
  
  It fills me with pride to know that even if an asteroid pastes me and half the people on earth, my TPS cover sheet templates will still be readable.
  
  --
  May we never see th
A certain friend of mine by stefanlasiewski · 2003-04-03 12:56 · Score: 4, Funny

'I'm on it, Sir' was my response

A friend of mine once gave a response that was less gentle:

Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.

There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.

We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?

We don't need a contigency plan sir, we ARE IN the contigency plan.

Get real, sir.

Still kept his job. Ok, maybe he wasn't that snotty...

--
"Can of worms? The can is open... the worms are everywhere."
The Sky is falling!!! by TheDarkRogue · 2003-04-03 12:56 · Score: 3, Funny

The Ceiling (Or floor to the party (Company sucess celebration) going on upstairs) of the server room fell in at my friends place of work, impaling their file server with a peice of rebarb(sp?) through the motherboard and a raid/ide controller card. All was well till the hole caught on fire due to what was later found to be a damages coffe cup heater that a stack of books had fallen onto. No one was seriously injured by the event other then the group of people who designed the building.

This was an architectural firm.

--
(Score:0, Interesting)
Partial answer: do dry runs of the plans by herrlich_98 · 2003-04-03 12:59 · Score: 4, Insightful

One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.

For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.

Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
simple by farnsworth · 2003-04-03 14:07 · Score: 4, Insightful
having both worked on disaster recovery plans and having worked in a data center in the world trade center that was completely destroyed, I can say that the best recovery plans are extremely simple.
break it down into procedures that you can take based on real problems, not causes. some things to consider:
1. internet connectivity down (switch to backup colo)
2. unrecoverable db (drive to offsite backup storage and get backup data)
3. app servers fried (engage hot standby boxes)
4. all of the above (shit, it's going to be a long night)
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
--
There aint no pancake so thin it doesn't have two sides.
1. Re:simple by stevew · 2003-04-03 14:20 · Score: 3, Interesting
  
  I've done alot of disaster scenario planning for emergency service providers - and we come up with some really wild stuff, but the above is perhaps the best advice I've seen here so far. Don't worry about the aliens hijacking the data center, worry about the data center resources not being available for whatever reason.
  
  The concept of breaking down the recovery phases is the best recovery advice I can give you. Worry about things in sort of a concentric ring of problems much as the previous poster presented. Start with the simplest broken piece and move on to the more compilicated.
  
  The things companies went through due to the WTC going poof is a true real-world example of the worst case scenario occuring - Not only the data center disappeared, so did the staff that ran IT there!
  
  Of the recovery efforts I've read about - the guys that had deals with hot-standby facilities out of the immediate area came back the quickest.
  
  --
  Have you compiled your kernel today??
Years ago by CharlieG · 2003-04-03 15:56 · Score: 3, Interesting

Years ago (Back in the days of drum memory), I was taken on a tour of a data center (first computer I ever saw)

They guy told me something about their disaster preps (it was financial data)

The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....

The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...

Then he said, "we have a second Identical data center about 5 miles across town"

Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...

Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.

And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that

Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Decide where to spend your effort. by PinglePongle · 2003-04-03 23:57 · Score: 3, Informative
There have been a number of posts already, but I would create a simple spreadsheet breaking down all the risks you can think of as follows :
- Risk category : eg. people risk (CTO gets hit by a bus), infrastructure risk (your server is destroyed by aliens), legal risk (you get sued by the RIAA for having an MP3 somewhere on your network), commercial risk (your biggest client goes bust), regulatory risk (the government licenses development shops) etc.
- Risk impact : - the impact on the business if the risk were to occur. Probably best to summarize into 5 or so levels from "negligible" to "unrecoverable"
- Likelihood - the chances of the risk occuring. Your office is unlikely to get hit by a meteorite, but your top coder may get headhunted and take all that knowledge with him.
You then have to reach a business decision how much to spend on mitigating each risk. Clearly it's worth spending time on "high likelihood, disastrous impact" risks, but you may not care much about a transport strike stopping the cleaning staff from getting into work for a couple of days.
When you know which risks you care about, identify mitigation strategies. Typically, this starts with identifying an owner for the risk, who is in charge of the mitigation strategy. For instance, the development manager may have to find ways of mitigating the "top coder headhunted" risk by implementing code review processes, knowledge sharing systems, etc.
You should not let the business view risk management as a technology issue - it's a business issue. Risk mitigation has associated costs, either financial, time, or opportunity cost - the best way to avoid not getting paid for your work is to avoid working for unreliable bill payers. If you come up with a wonderful risk list and proposals for mitigation, your work will be wasted unless the business is willing to bear the cost of implementing your proposals.
--
It's all very well in practice, but it will never work in theory.