Slashdot Mirror


Dealing with Development House Disasters?

Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"

19 of 59 comments (clear)

  1. Secure the data by Usquebaugh · · Score: 4, Informative

    First and foremost, make sure you have your data backed up and secure. Nothing else matters if the data is missing. When did you last test your backups?

    Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?

    Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.

    Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.

    No data, no job, it's as simple as that.

    1. Re:Secure the data by stefanlasiewski · · Score: 3, Informative

      Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.

      And something else as important: Where are your backup tapes?

      If they are sitting in a locked cabinet right next to the computers, they won't survive the asteroid blast either.

      Off site backups. A pain to maintain, but good idea for any contigency plan.

      --
      "Can of worms? The can is open... the worms are everywhere."
    2. Re:Secure the data by 0x0d0a · · Score: 3, Funny

      It fills me with pride to know that even if an asteroid pastes me and half the people on earth, my TPS cover sheet templates will still be readable.

  2. A certain friend of mine by stefanlasiewski · · Score: 4, Funny

    'I'm on it, Sir' was my response

    A friend of mine once gave a response that was less gentle:


    Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.

    There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.

    We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?

    We don't need a contigency plan sir, we ARE IN the contigency plan.

    Get real, sir.


    Still kept his job. Ok, maybe he wasn't that snotty...

    --
    "Can of worms? The can is open... the worms are everywhere."
  3. The Sky is falling!!! by TheDarkRogue · · Score: 3, Funny

    The Ceiling (Or floor to the party (Company sucess celebration) going on upstairs) of the server room fell in at my friends place of work, impaling their file server with a peice of rebarb(sp?) through the motherboard and a raid/ide controller card. All was well till the hole caught on fire due to what was later found to be a damages coffe cup heater that a stack of books had fallen onto. No one was seriously injured by the event other then the group of people who designed the building.

    This was an architectural firm.

    --
    (Score:0, Interesting)
  4. Partial answer: do dry runs of the plans by herrlich_98 · · Score: 4, Insightful

    One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.

    For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.

    Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.

  5. Categorize by Andrew+Lockhart · · Score: 2, Interesting

    I don't have experience with this myself, but if I were in your boat I would make a system for classifying types of disaster and the appropriate recovery methods for each. For instance at the top level you would have either a disaster resulting in either physical damage or non-physical damage. From there you could classify disaster types according to how much and what physical damage occurred.

    So, your meteorite example would probably fall into something between a horrible fire and earthquake, as the kind of damage inflicted on your facility would be similar in such events.

  6. Volcano Disaster Plan by Rick+the+Red · · Score: 2, Funny

    I worked at a large aircraft manufacturer in the Pacific Northwest back when Mt. St. Helens blew. They quickly imposed a "Volcano Disaster Plan" that, AFAIK, is still officially in place. We never followed it, however, because it included such mandates as turning off all equipment at night and sealing it up in plastic and duct tape just in case the building got dusted with ash. Yeah, right! (remember, this was 1980, well before a computer could fit in a garbage bag) It was bad enough for us with our CAD workstations and Tektronix terminals; I can imagine what the boys running the IBM big iron thought of that plan. Where are you going to find a plastic bag big enough for a 370 mainframe?

    --
    If all this should have a reason, we would be the last to know.
  7. What not to do by the_other_one · · Score: 2, Insightful

    I know of one outfit that shall remain namless.

    They layed off their sys admin.

    Then soon found out what a root password was for.

    --
    134340: I am not a number. I am a free planet!
  8. A couple of thoughts by travail_jgd · · Score: 2, Insightful

    Instead of planning for fire, flood, and alien attack, categorize things by level of severity: whether or not the primary site is intact, the amount of time it will take to resume normal operations, whether the event is isolated/regional/national, etc. It doesn't matter if your office is ruined by an asteroid or a terrorist's bomb, the company will still be doing its work at another site.

    Identify critical paths and personnel. An organization can function for several weeks without C*O's, but without the "worker bees" the company will grind to a halt almost immediately. Also consider the effects of losing large numbers of staff. If every developer and admin quit, what would be the effect on the company?

    Verify the physical security of all sites. Imagine "man with a gun" security breaches -- would an armed intruder (willing to kill) be able to cause significant damage?

    I visited the colo site for a company I was with. They had multiple electrical hookups and enough fuel to run the diesel generators for 48 hours. There were two separate water main connections (a holdover from the water-cooled mainframe days). The colo was connected to two different phone companies (on opposite sides of the building), plus a dish for satellite uplink. It was incredibly expensive to build and maintain, but it would be more expensive if there was ever any downtime.

    Also take cost into account. How much is your company willing and able to spend for disaster prevention and recovery? Don't forget to include your time in those figures... :)

  9. simple by farnsworth · · Score: 4, Insightful
    having both worked on disaster recovery plans and having worked in a data center in the world trade center that was completely destroyed, I can say that the best recovery plans are extremely simple.

    break it down into procedures that you can take based on real problems, not causes. some things to consider:

    1. internet connectivity down (switch to backup colo)
    2. unrecoverable db (drive to offsite backup storage and get backup data)
    3. app servers fried (engage hot standby boxes)
    4. all of the above (shit, it's going to be a long night)

    etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.

    of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.

    --

    There aint no pancake so thin it doesn't have two sides.

    1. Re:simple by stevew · · Score: 3, Interesting

      I've done alot of disaster scenario planning for emergency service providers - and we come up with some really wild stuff, but the above is perhaps the best advice I've seen here so far. Don't worry about the aliens hijacking the data center, worry about the data center resources not being available for whatever reason.

      The concept of breaking down the recovery phases is the best recovery advice I can give you. Worry about things in sort of a concentric ring of problems much as the previous poster presented. Start with the simplest broken piece and move on to the more compilicated.

      The things companies went through due to the WTC going poof is a true real-world example of the worst case scenario occuring - Not only the data center disappeared, so did the staff that ran IT there!

      Of the recovery efforts I've read about - the guys that had deals with hot-standby facilities out of the immediate area came back the quickest.

      --
      Have you compiled your kernel today??
    2. Re:simple by Zapman · · Score: 2, Interesting

      Adding to the last point you mention, the most critical thing you do to any plan is TEST IT.

      Stage a disaster.

      Either that, or just fake it. Take the data to and try to bring up all the data, bring it up on the internet (under a dr.www.?????.com DNS name for example) and see what's accessable.

      In case of failure, tune the plan, and try again in 6 months.

      In case of success, tune the plan, and try again in a year.

      Jason

      --
      Zapman
  10. Backups... and another RAID board by Anonymous Coward · · Score: 2, Insightful
    Yes backups backups backups... OFF SITE BACKUPS. Get a fire safe and keep a set in there too, what the hell.

    Little realized thing until it's too late: Do you have raid arrays? If so, you might want to ponder making sure you can always "get" another raid card of the same type currently running your array. Either have another on a shelf (off site, whatever), or another compatable system.

    Nothing sucks more to have something stupid like the raid card (or motherboard) die, and you can't a replacement that will recognize your current array of drives.

  11. Years ago by CharlieG · · Score: 3, Interesting

    Years ago (Back in the days of drum memory), I was taken on a tour of a data center (first computer I ever saw)

    They guy told me something about their disaster preps (it was financial data)

    The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....

    The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...

    Then he said, "we have a second Identical data center about 5 miles across town"

    Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...

    Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.

    And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that

    Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"

    --
    -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
  12. Risk management by Twylite · · Score: 2, Insightful

    What you've been asked (volunteered) to do is a risk analysis. This is a whole lot more than being a l33t admin of a high-availability site, and many Slashdotters seem to think. You've hit onto some of the non-technical risks (your CTO example), but to address this properly you need to concentrate on identifying risks, and how to handle them.

    The first thing to realise is that you can't have a preformed contingency plan for everything. What you can do is identify every point of risk, weight it according to likelihood and severity, and develop plans for the "likely worst cases" that you discover. The rest of the risk is a business risk, that is, you insure yourself against it and deal with it if and when it happens.

    You should also bear in mind that, from a technical viewpoint, there are no absolute guarantees. Almost all high-availability strategies protect against a single point of failure, but this isn't enough. What if you have multiple failures? How quickly can you detect and respond to a failure? How long can you suffer a complete outage (this is really important to know, and "we can't" is not an acceptable answer). Uptime costs money, calculate the point of balance.

    Ask Google about "organizational risk" - you'll find a lot of information about auditing risk that can put you on the right path.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  13. Make sure the disaster is profitable by Andy_R · · Score: 2, Interesting

    There is a natural tendency to think that this is all about keeping the data safe, or about having procedures in place, but I have a different way of looking at it that I think is more practical:

    Make sure the company stays profitable.

    All this involves is insuring against disasters, and making sure the payout will *exceed* whatever it costs to recover.

    If my office exploded and I had to re-build everything from scratch, I'm fine, because my company will soon be getting a cheque that will cover everything, including the expected profits for the next 6 months. If we rebuild in 5 months, the disaster is actually a revenue generator for the company.

    Compare this to someone with excellent plans and the ability to get rebuilt in just a month, but no insurance on the lost profits. They are facing a net loss even if they work like crazy and get the rebuild done in 3 weeks.

    Of course you should still do off-site backups to deal with problems that are 'serious' but not 'disasterous', and have contingency plans in place for day to day traumas, but I think the best way to deal with 'the big one' is simply to insure against it, and make sure the insurance covers the lost profits.

    --
    A pizza of radius z and thickness a has a volume of pi z z a
  14. Dev? by photon317 · · Score: 2, Informative


    For a development shop, I should think that all you really need is to make sure you've got secure, recent, usable backups of your source code and important licensing/contract data.

    Unless you live in an area prone to a certain type of natural disaster, the types of things that cause real contigency plans to go into effect have a statistically small chance of ever happening to you. It's just not worth the money and effort to go to great lengths to make sure the company is running full speed the next day rather than two to three weeks down the road. As long as your data is safe, you should be ok. Just take good backups in duplicate - put one set in a fire safe onsite and ship one to a secure offsite location, perhaps using a service provider like Iron Mountain (although I'd encrypt that tape before I gave it to some random Iron Mountain driver if I were you).

    Some businesses have to worry about 24/7 production operations that can't be allowed to stop. Typical examples of extreme uptime environments are stock exchanges, utility/telco companies, various emergency services, etc. In a lot of these sorts of cases, it's actually justifuable to double or even quadruple the cost of your implementations and the ongoing maintenance and salary costs just to make sure than when a 1:1,000,000 chance event occurs, you experience a 10 second performance hiccup rather than a serious outage. In some cases a one hour outage simply cannot be tolerated at any cost. These are the environments that really have a hard time pushing the bleeding edge of engineering geographically redundant "systems", where systems includes the machines, the networks, and the people using them. A development house, in contrast, is a pretty easy problem.

    --
    11*43+456^2
  15. Decide where to spend your effort. by PinglePongle · · Score: 3, Informative
    There have been a number of posts already, but I would create a simple spreadsheet breaking down all the risks you can think of as follows :

    • Risk category : eg. people risk (CTO gets hit by a bus), infrastructure risk (your server is destroyed by aliens), legal risk (you get sued by the RIAA for having an MP3 somewhere on your network), commercial risk (your biggest client goes bust), regulatory risk (the government licenses development shops) etc.

    • Risk impact : - the impact on the business if the risk were to occur. Probably best to summarize into 5 or so levels from "negligible" to "unrecoverable"
    • Likelihood - the chances of the risk occuring. Your office is unlikely to get hit by a meteorite, but your top coder may get headhunted and take all that knowledge with him.

    You then have to reach a business decision how much to spend on mitigating each risk. Clearly it's worth spending time on "high likelihood, disastrous impact" risks, but you may not care much about a transport strike stopping the cleaning staff from getting into work for a couple of days.
    When you know which risks you care about, identify mitigation strategies. Typically, this starts with identifying an owner for the risk, who is in charge of the mitigation strategy. For instance, the development manager may have to find ways of mitigating the "top coder headhunted" risk by implementing code review processes, knowledge sharing systems, etc.
    You should not let the business view risk management as a technology issue - it's a business issue. Risk mitigation has associated costs, either financial, time, or opportunity cost - the best way to avoid not getting paid for your work is to avoid working for unreliable bill payers. If you come up with a wonderful risk list and proposals for mitigation, your work will be wasted unless the business is willing to bear the cost of implementing your proposals.
    --
    It's all very well in practice, but it will never work in theory.