Dealing with Development House Disasters?
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
First and foremost, make sure you have your data backed up and secure. Nothing else matters if the data is missing. When did you last test your backups?
Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?
Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.
Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.
No data, no job, it's as simple as that.
'I'm on it, Sir' was my response
A friend of mine once gave a response that was less gentle:
Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.
There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.
We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?
We don't need a contigency plan sir, we ARE IN the contigency plan.
Get real, sir.
Still kept his job. Ok, maybe he wasn't that snotty...
"Can of worms? The can is open... the worms are everywhere."
The Ceiling (Or floor to the party (Company sucess celebration) going on upstairs) of the server room fell in at my friends place of work, impaling their file server with a peice of rebarb(sp?) through the motherboard and a raid/ide controller card. All was well till the hole caught on fire due to what was later found to be a damages coffe cup heater that a stack of books had fallen onto. No one was seriously injured by the event other then the group of people who designed the building.
This was an architectural firm.
(Score:0, Interesting)
One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.
For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.
Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
Of everything - not just those files in CVS. Every person, every concept, every document needs to be duplicated, or be /easily/ reconstructed from others that are.
No person is so special that someone else can't be trained to be 'their backup' - no single person should ever hold the only set of 'keys to the kingdom'.
I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. (...) the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point
On the other hand, the CEO could get hit by bus, and you wouldn't have to deal with those disasters at all...
I don't have experience with this myself, but if I were in your boat I would make a system for classifying types of disaster and the appropriate recovery methods for each. For instance at the top level you would have either a disaster resulting in either physical damage or non-physical damage. From there you could classify disaster types according to how much and what physical damage occurred.
So, your meteorite example would probably fall into something between a horrible fire and earthquake, as the kind of damage inflicted on your facility would be similar in such events.
nt
I worked at a large aircraft manufacturer in the Pacific Northwest back when Mt. St. Helens blew. They quickly imposed a "Volcano Disaster Plan" that, AFAIK, is still officially in place. We never followed it, however, because it included such mandates as turning off all equipment at night and sealing it up in plastic and duct tape just in case the building got dusted with ash. Yeah, right! (remember, this was 1980, well before a computer could fit in a garbage bag) It was bad enough for us with our CAD workstations and Tektronix terminals; I can imagine what the boys running the IBM big iron thought of that plan. Where are you going to find a plastic bag big enough for a 370 mainframe?
If all this should have a reason, we would be the last to know.
I know of one outfit that shall remain namless.
They layed off their sys admin.
Then soon found out what a root password was for.
134340: I am not a number. I am a free planet!
Instead of planning for fire, flood, and alien attack, categorize things by level of severity: whether or not the primary site is intact, the amount of time it will take to resume normal operations, whether the event is isolated/regional/national, etc. It doesn't matter if your office is ruined by an asteroid or a terrorist's bomb, the company will still be doing its work at another site.
:)
Identify critical paths and personnel. An organization can function for several weeks without C*O's, but without the "worker bees" the company will grind to a halt almost immediately. Also consider the effects of losing large numbers of staff. If every developer and admin quit, what would be the effect on the company?
Verify the physical security of all sites. Imagine "man with a gun" security breaches -- would an armed intruder (willing to kill) be able to cause significant damage?
I visited the colo site for a company I was with. They had multiple electrical hookups and enough fuel to run the diesel generators for 48 hours. There were two separate water main connections (a holdover from the water-cooled mainframe days). The colo was connected to two different phone companies (on opposite sides of the building), plus a dish for satellite uplink. It was incredibly expensive to build and maintain, but it would be more expensive if there was ever any downtime.
Also take cost into account. How much is your company willing and able to spend for disaster prevention and recovery? Don't forget to include your time in those figures...
break it down into procedures that you can take based on real problems, not causes. some things to consider:
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
There aint no pancake so thin it doesn't have two sides.
Make sure that you know if the backups are actually where they are supposed to be, and that people know for sure where they are located.
Also, if you are in charge of backups, and come up with a "clever" storage location, please tell at least one other person where it is.
After all, you never know what the future holds, and if you are unable (for any reason) to tell other people where the backups are when they are needed, it is as if they never existed in the first place.
(Posted anon, and with no details to preserve my job).
Offsite backups, and more than one person that can perform/knows the same critical aspects of the business (code or other specialized information), is all that is really necessary. Some standarized software inventory (OSs, versions, other software and version, service packs) on what machines. This is really a procedural issue. This is about all you can really to to be ready for about anything that is thrown at you.
plan for the effect, not the cause. aliens, fire flood are all different causes, but they can all cause the same result - destruction of the building.
plan for that.
if you need to relocate an office, move it to a hotel. you have space (rooms, convetion space) all designed to have furniture brought in and phones, power and data, and available on short notice. you also have staff to keep it all clean.
plan the order for it systems restoration. development can wait a week but if your cash flow stops your screwed.
those are my tidbits of advice.
hope it helps.
First, find out how much they want to spend, ie. somewhere like the NYSE has 3 remote sites that immediately mirror every transaction that occurs.
The only reason the NYSE stopped during 9/11 was for political reasons. Most of the brokerage firms in the WTC had warehouses in Jersey rented for years waiting for something like 9/11 to happen.
It comes down to cost, how much is it worth if you loose all your records. How much is it worth if you are down for 2 weeks. First figure out how much a minute of downtime costs, a week, a month, then figure out how much it costs for each of these services.
The mostly what you have to plan for is, suppliers being eliminated, loss of the block your office is in, loss of the the city you are in, loss of the nation and resulting personelle. Once you've factored those things you will have a good basis for a contingency plan, and you will find out how much really needs to be planned for and how much is if it happens we're dead anyways.
Little realized thing until it's too late: Do you have raid arrays? If so, you might want to ponder making sure you can always "get" another raid card of the same type currently running your array. Either have another on a shelf (off site, whatever), or another compatable system.
Nothing sucks more to have something stupid like the raid card (or motherboard) die, and you can't a replacement that will recognize your current array of drives.
1. off site backup, of ALL data (and out of country or state backup, if your worried about rebels, or floods.)
2. makesure that every person in the company can be replaced, directly (by someone else), or indirectly (by a combination of people)
--meh--
Years ago (Back in the days of drum memory), I was taken on a tour of a data center (first computer I ever saw)
They guy told me something about their disaster preps (it was financial data)
The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....
The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...
Then he said, "we have a second Identical data center about 5 miles across town"
Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...
Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.
And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that
Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
As to what your organization should do, that depends. How much do you have to spend? Disaster Recovery is a big subject, and there are a lot of choices to make. I could help, but I'd have to charge you.
There's a workaround for bypassing the 'eprom' password on Sparcs (actually it's NVRAM with a battery built into the module). You remove the NVRAM chip from it's socket, boot up the system to the OK prompt, then plug the chip in live, with the system running, and make your security changes. I have successfully done this on SparcStations that I bought on eBay that had a password. It's slightly risky, but on older Sparc boxes (all those nice classic SparcStations) it would be NUTS to have to buy new NVRAMs.
The technique is documented here. And here. And here too.
There's also a technique to tack on a replacement external battery on those NVRAMs. There's no reason to EVER buy a new one for non-critical boxes. Most of my older Sparc boxes have had that surgery performed on their NVRAM chips (involves actual physical surgery on the module) and live happily powered by a pair of AAA cells.
What you've been asked (volunteered) to do is a risk analysis. This is a whole lot more than being a l33t admin of a high-availability site, and many Slashdotters seem to think. You've hit onto some of the non-technical risks (your CTO example), but to address this properly you need to concentrate on identifying risks, and how to handle them.
The first thing to realise is that you can't have a preformed contingency plan for everything. What you can do is identify every point of risk, weight it according to likelihood and severity, and develop plans for the "likely worst cases" that you discover. The rest of the risk is a business risk, that is, you insure yourself against it and deal with it if and when it happens.
You should also bear in mind that, from a technical viewpoint, there are no absolute guarantees. Almost all high-availability strategies protect against a single point of failure, but this isn't enough. What if you have multiple failures? How quickly can you detect and respond to a failure? How long can you suffer a complete outage (this is really important to know, and "we can't" is not an acceptable answer). Uptime costs money, calculate the point of balance.
Ask Google about "organizational risk" - you'll find a lot of information about auditing risk that can put you on the right path.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Great way to test the UPS batteries and auto-shutdown software is to walk over to the wall and yank the power cord of the UPS out of the socket.
Plugging it back in after 30 seconds is good way to test the "power came back, cancel the shutdown" part of the software, too.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
There is a natural tendency to think that this is all about keeping the data safe, or about having procedures in place, but I have a different way of looking at it that I think is more practical:
Make sure the company stays profitable.
All this involves is insuring against disasters, and making sure the payout will *exceed* whatever it costs to recover.
If my office exploded and I had to re-build everything from scratch, I'm fine, because my company will soon be getting a cheque that will cover everything, including the expected profits for the next 6 months. If we rebuild in 5 months, the disaster is actually a revenue generator for the company.
Compare this to someone with excellent plans and the ability to get rebuilt in just a month, but no insurance on the lost profits. They are facing a net loss even if they work like crazy and get the rebuild done in 3 weeks.
Of course you should still do off-site backups to deal with problems that are 'serious' but not 'disasterous', and have contingency plans in place for day to day traumas, but I think the best way to deal with 'the big one' is simply to insure against it, and make sure the insurance covers the lost profits.
A pizza of radius z and thickness a has a volume of pi z z a
For a development shop, I should think that all you really need is to make sure you've got secure, recent, usable backups of your source code and important licensing/contract data.
Unless you live in an area prone to a certain type of natural disaster, the types of things that cause real contigency plans to go into effect have a statistically small chance of ever happening to you. It's just not worth the money and effort to go to great lengths to make sure the company is running full speed the next day rather than two to three weeks down the road. As long as your data is safe, you should be ok. Just take good backups in duplicate - put one set in a fire safe onsite and ship one to a secure offsite location, perhaps using a service provider like Iron Mountain (although I'd encrypt that tape before I gave it to some random Iron Mountain driver if I were you).
Some businesses have to worry about 24/7 production operations that can't be allowed to stop. Typical examples of extreme uptime environments are stock exchanges, utility/telco companies, various emergency services, etc. In a lot of these sorts of cases, it's actually justifuable to double or even quadruple the cost of your implementations and the ongoing maintenance and salary costs just to make sure than when a 1:1,000,000 chance event occurs, you experience a 10 second performance hiccup rather than a serious outage. In some cases a one hour outage simply cannot be tolerated at any cost. These are the environments that really have a hard time pushing the bleeding edge of engineering geographically redundant "systems", where systems includes the machines, the networks, and the people using them. A development house, in contrast, is a pretty easy problem.
11*43+456^2
Risk category : eg. people risk (CTO gets hit by a bus), infrastructure risk (your server is destroyed by aliens), legal risk (you get sued by the RIAA for having an MP3 somewhere on your network), commercial risk (your biggest client goes bust), regulatory risk (the government licenses development shops) etc.
Risk impact
You then have to reach a business decision how much to spend on mitigating each risk. Clearly it's worth spending time on "high likelihood, disastrous impact" risks, but you may not care much about a transport strike stopping the cleaning staff from getting into work for a couple of days.
When you know which risks you care about, identify mitigation strategies. Typically, this starts with identifying an owner for the risk, who is in charge of the mitigation strategy. For instance, the development manager may have to find ways of mitigating the "top coder headhunted" risk by implementing code review processes, knowledge sharing systems, etc.
You should not let the business view risk management as a technology issue - it's a business issue. Risk mitigation has associated costs, either financial, time, or opportunity cost - the best way to avoid not getting paid for your work is to avoid working for unreliable bill payers. If you come up with a wonderful risk list and proposals for mitigation, your work will be wasted unless the business is willing to bear the cost of implementing your proposals.
It's all very well in practice, but it will never work in theory.
The electric company uses a backhoe to cut all your data lines?
We actually kept supplying a Big3 auto manufacturer in a "Just in time" sequencing operation when that happened to us.
Not bad...
The worst disaster a company that depends on tech can have is slashdotting.
Plus, it's a fscking good excuse to get new, fun hardware that rocks for after-hours Unreal 200x Tournament!
a peice of rebarb(sp?) through the motherboard
This stuff is called 're-bar', which is short for 'reinforcing-bar'. It is a metal rod about a half-inch in diameter (there are larger/smaller versions) that is used to add strength to concrete structures. Re-bar is made with a coarse pattern on the outside so the concrete can get a grip.
For those of you who are still lost, this is the stuff that Cordelia fell on in 'Lover's Walk', an episode from the second or third season of Buffy the Vampire Slayer. Does that help?
I want to drag this out as long as possible. Bring me my protractor.
I wonder how business in the towers coped after 11/9. They must have had to have applied their contingency plans there, perhaps you could try looking for someone in one of those companies.
For those of you who are still lost, this is the stuff that Cordelia fell on in 'Lover's Walk', an episode from the second or third season of Buffy the Vampire Slayer. Does that help? :-)
I find that the following explanation is a better example:
When you were a kid, it was the metal bars you stole from the new house next door to play "Darth Vader vs. Luke Skywalker".
"Can of worms? The can is open... the worms are everywhere."
I thought those things got hot enough to melt tapes. Certainly the last ones I laid eyes on said explicitly that they wouldn't work for tapes. More expensive models may or may not do better, but if the assumption about the heat and duration of the fire exceeds their rating, you're still pretty screwed.
I'm not saying that you shouldn't plan and protect, I'm saying that, in real life, you have to look at what your risks are, what your legal obligations are, and what your competitors are doing. Accepting the risk of going out of business is part of doing business in the first place, and aliens abducting your mainframe should probably be lower on your list of worries than having an understaffed tech support line.