Disaster Recovery?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Monday February 4, 2002 @02:44AM from the living-thru-an-IT-nightmare dept.

M. Grochmal asks: "A three-alarm fire at Southern Maine Technical College burned through the Computer Technology and Technical Graphics departments. We have salvaged most of what we can, but cannot return into the building until the asbestos risk decreases. The hard part now is rebuilding the networks in another building. The schedules have been rearranged, many of the department students and faculty are volunteering to relocate salvageable computers, as well as install/configure the new computers that will be arriving in the next day or so. On top of that, we have to rebuild the Netware servers, restore from backups, and get them networked again. I was wondering how other Slashdot readers were able to recuperate from unforeseen damage to their work (and learning) environments. You can read about the fire here and see what the schedule is. Wish us luck."

18 comments

Min score:

Reason:

Sort:

NDS and the 3 day rule by CounterZer0 · 2002-02-04 02:50 · Score: 2, Informative

Just make sure you get your NDS stuff restored within three days. AFter that, have fun rebuilding your tree! Unless of course, you followed guidelines and had offsite replicas of each partition!
1. Re:NDS and the 3 day rule by AndyDeck · 2002-02-04 10:52 · Score: 2, Interesting
  
  Bull. Utter crap. There is ABSOLUTELY NO reason that you would be forced to restore your NDS within three calendar days or rebuild your tree. Where did you get your information? The only three day limit I know of is a recommendation that if a server was going to be offline for more than this time, it might be better to remove it from the tree. This does keep NDS updating cleanly, as a partition can't advance its timestamps to the current time until the changes have been seen by all replicas.
  
  That said, of course you should have had offsite replicas of each partition. With offsite replicas (or at least replicas on servers that didn't get destroyed) you can remove the crashed servers, still have access to your NDS resources (IDs, passwords, etc), and then re-install the crashed servers to the tree later.
  
  But with or without offsite replicas - three days makes no sense. If the whole tree is involved in the disaster (all servers containing replicas of the tree are gone), it doesn't matter how long it takes you to begin NDS recovery. If the whole tree is not involved, but ALL replicas of a partition are, you may have isses with external references and subordinate reference partitions until you've completed recovery - but nothing that would require 'rebuilding your tree'. If all replicas are not gone, you do NOT have a problem, just a challenge :) Follow the TIDs and you will be fine.
  
  Netware crashed server recovery has been interesting over the years. Netware 5.x & NDS8 removed the old DSMaint NLM that could preserve server references, so until the recent release of the XBROWSE program, the only clean way to quickly recover from a crashed server while preserving server references was to hang on to an old NW4x box to run the DSMaint process.
  
  Recovering NDS from real disasters can be... challenging. If you don't know what you are doing, you *can* really mess things up. Be careful and you'll be fine.
  
  --
  
  The Crystal Wind is the Storm, and the Storm is Data, and the Data is Life
2. Re:NDS and the 3 day rule by CounterZer0 · 2002-02-04 11:28 · Score: 1
  
  I was exagerrating a bit with the 'rebuilding the tree' deal, but putting a server that's been offline (ESP if it holds/held a r/w or master of a partition) for more than 3 calendar days is NOT something that is going to happen. You'd have better luck slipping a well, you understand. My comment was based on the assumption that he has an active tree still alive and CHANGING. In which case, it'd be assinine to put one of those servers back in after 3 days. PS: Glad to see at least one other person familiar with NDS on /. ;)
A Day late and a Dollar short question of the year by xinu · 2002-02-04 02:56 · Score: 2, Informative

I was wondering how other Slashdot readers were able to recuperate from unforeseen damage to their work (and learning) environments.

Uhh, you would have a DR plan BEFORE the place burns down based on your DR plan that was developed with the business or school needs. Tape backups and another site to restore to. Perhaps even the information was mirrored to it via a SAN. Everyone keeps their tapes offsite right?
That's all I've been doing since the 9/11 incident and I think it has something to do with that since I work in the Boston WTC everyone is a bit paranoid about dataloss since we had offices that got toasted in NYC.
Been there, done that ... by ninewands · 2002-02-04 03:35 · Score: 2

I do wish you luck. I'm an admin in the Engineering Computer Center at the University of Houston. Fortunately the College of Engineering was one of the 10 buildings that escaped flood damage during Tropical Storm Allison last suummer.

Even though out Telecomm Department was able to pull enough equipment out of the Telecomm Engineeing lab to get the network sort of back up, we were without full connectivity for almost a month. It took about 4 days to get the electricity back on to our undamaged building and we didn't have phone for about 2 weeks. There are a few buildings on campus that are still unusable.

Best of luck. It sounds like your situation is going to be more tedious than difficult, though.

--
utter rubbish
Some Idea ... by NWT · 2002-02-04 04:31 · Score: 2, Informative

Nothing is better than a recent, working and complete backup ... but a few days ago, i saw an advertisement from a firm (DriveSavers) they are specialised in data-recovery for destroyed harddisks, maybe they can help!

Besides, i suppose it would be best to see the positive side of that incident, i'm sure it will be a good experience rebuilding the network! Anyway, good luck to you ...

--
Life sucks.
Seize the moment! by martyb · 2002-02-04 04:45 · Score: 5, Insightful

I was visiting some friends at your campus just this past December; sorry to hear about your loss.

Sadly, I can't give you any suggestions on how to better recover from your current situation -- seems like what can be done now is being done. It seems there's not been much of a response to this as yet, so I'll go out on a limb and offer some ideas that may sound obvious, but forest and trees and all that.

I'm reading between the lines, but I suspect that prior thoughts of backups and disaster recovery were shot down by the PHBs as being too expensive or time consuming. Here's your chance!

You now have a rare opportunity where proposals for FUTURE disaster recovery would actually be listened to!

First off, document what you are doing now! Write it down in a notebook, carry around a pocket tape recorder, use a PDA, hire some students who will answer a phone so that when something comes to mind, you can just dial a phone and get it recorded; whatever, but document what it is actually costing to recover! And not just the hardware/software expenses either! Increased calls to the help desk. Impact on faculty and students' schedules. Reconstructing the network topology.

Anything you can think of, now, document it! If, upon later review, some things are questionable, you can omit it then. But, if during that later review the thought was: "Gee this took more than we had thought it would, too bad we didn't keep track..." Get the picture?

So, now you'll have some kind of baseline as to what the actual recovery costs were, in this case. With that, you can now make a strong business case to implement a solid disaster recovery plan. Include server configs, backups, inventory of hardware and software... in short you've got a list of what you actually had to do to recover from this disaster; use that to identify what you'd need to do again.

Other ideas off the top of my head: Get a fire supression system. Split some of the equipment (e.g. labs) across multiple buildings so that if one burns down, there's some infrastructure that is still usable. You'll have a working system that you can refer to while rebuilding the destroyed system, too.
What is the question? by Adrian+Voinea · 2002-02-04 04:55 · Score: 3, Informative

I am sorry that such a terrible thing happened. I hope you find this helpful. I bumped into this webpage a while ago :

Disaster Recovery Resources - it contains a lot of useful articles about disaster recovery.
I wish you luck!
Disaster Recovery..... by Chanc_Gorkon · 2002-02-04 05:35 · Score: 3, Insightful

First off, servers belong in a nice server room, not in a closet near the lab. It may be ok for your home network, but for a network at a college or company, this is a must. Also, if you can, have the server room in one building, and labs in others. This way your lab may go up in smoke and your servers will be fine, or your server may get damaged, and your clients are fine. When doing a server room, make sure it has elevated floors (about 1 foot above rest of the floors floor), conveyance trays, redundant air conditioning, FM200 fire supression, TSM or some other backup solution, possible offsite mirroring of servers, NO WINDOWS (the glass kind, not the OS kind), UPS's and if possible, make it a hardened, 1 floor building with the chillers located inside (storms can't rip chiller off ground if they are inside), generator backup and some bathrooms, food storage, and maybe even a shwoer facility if admins must pull an all nighter. This may sound silly for a school, but that depends on how important your data is. We used to have servers serving the labs all over campus, but now they are all centrally located in the data center. Management is easier, but then we have more to loose if our data center is hit. That's why we have a halon fire supression (until new center is built, and it will use FM200) and a disaster recovery plan including a hotsite. Have all of the servers centrally located also assists in running backups either via a networked TSM type solution (Tivoli software, IBM hardware) or individual tapes (not reccomended, but better then nothing).

--
Gorkman
1. Re:Disaster Recovery..... by scarl · 2002-02-04 08:13 · Score: 1
  
  Having used both IBM's TSM/ADSM (Tivoli Storage Manager) and Legato's Networker application, I FULLY recommend Legato for this purpose (or any environment with more than 5 servers). Remember, unless stated otherwise, this is IMHO.
  
  PROs:
  1) scheduled full backups (TSM uses a full incremental package, only "full" is your first run)
  2) Better user interface (layout) (TYPES -- Legato: Unix-X/cli, Win32-gui/cli TSM: Unix-cli, Any-http)
  3) Better/more complete client/group/schedule configuration
  4) Interwoven, multi-session push to device (not single threaded)
  5) Can get site license (maybe w/TSM too?)
  6) GREAT spin up time for even moderately tech person ('tier 1' (gen. operational support) in 1week, 'tier 2' (installs/restores/basic troubleshooting) in ~1-3 months, YMMV)
  7) Others I'm not thinking of
  
  CONs:
  1) BOTH: Complete lack of metrics/reporting characteristics (Legato allows for a '@completionDO' script execution, so think DB....)
  2) TSM has a more complete Table of Contents/Index while viewing Volumes in GUI(Legato needs restore initiation to read TOC)
  3) LEGATO: Client interface is a "canary in a coal mine". If there are issues causing slowdowns, you often see it first in the interfaces.
  4) BOTH: Media management for offsite. TSM has DRM (Disaster Recovery Manager) which apparently makes it easier to deal with, but Legato (I think) has a media management program called AlphaStor(?) that is supposed to be end-all/be-all (nothing like marketing, ehhh?).
  5) Others I'm not thinking of
  
  Overall, I have not have as 'happy' a time working with TSM as I have had with Legato. I'm sure that someone who is more comfortable with TSM could/would have counters to some of the TSM apparent shortcomings, and I welcome hearing about them. But having been pitched in the deep end in an environment of Legato and TSM, and mostly learning as I go, Legato is working out better for me.
  
  Lastly, having NOT worked with Veritas NetBackup, but having talked to some who have, the biggest complaint I have heard involved convoluted, non-intuitive user interface and assumed Looooooong spin-up times to reach operational support levels. In this particular case, I was told by someone with 2 backup products under their belt that it would take them at least one month to reach the point of being able to do general operational support.......and 37 different active views/interfaces to the application.
  
  I do not speak for my team, my company, my government, my race, or, sometimes, myself. Please insert 'Large Grain o'Salt'(TM) in mouth at this time.
  
  --
  Papa's got a brand GNU bag. -- Advertisement: year 30 ALC (After Linux Commercialization)
Good time to get rid of legacy shit... by duffbeer703 · 2002-02-04 06:38 · Score: 2

Now might be a good time to take a good look at what you wanted to get rid of in your old network.

Since everything is destroyed for the most part, use this as an opertunity to get rid of those pesky NT 3.51, Novell Netware, and Vax machines that have been cluttering up the computer room.

Ditch that legacy shit and start anew with the insurance check. (Presuming the machines were insured.)

--
Conformity is the jailer of freedom and enemy of growth. -JFK
1. Re:Good time to get rid of legacy shit... by sphealey · 2002-02-04 07:29 · Score: 2
  
  [...] Novell Netware [...] machines that have been cluttering up the computer room.
  
  Ditch that legacy shit and start anew with the insurance check. (Presuming the machines were insured.)
  Yeah, that would be smart. Dump perfectly good, well-designed technology for - what? Samba? You'll be sorry. Windows NT? _Really_ sorry.
  sPh
2. Re:Good time to get rid of legacy shit... by Detritus · 2002-02-05 21:55 · Score: 1
  
  The people who get their work done by running their applications on that "legacy shit" may have other ideas. The computer center is not your personal toy box.
  
  --
  Mea navis aericumbens anguillis abundat
Questions about the schedule. by Anonymous Coward · 2002-02-04 09:05 · Score: 0

Is this a student run network or something?

Looking through the schedule, I see you've got random students crimping cable, "Before I (you notice I have dropped the We...), will allow you to crimp a connector, I will expect you to have read the pages at the above web site that describe "How-to". "

Nothing wrong with doing your own connectors if you have the proper test equipment to check it with. But having students who have never crimped before, doing so? Seems like a good way to learn how to break Ethernet, especially if you don't test.

As well, students installing the servers after taking a course, "Seniors, who have taken the Network System Management and Network Engineering courses should take advantage of this opportunity to build NetWare servers."

I assume you will be wiping the drives and installing them with your own secure setup after?

Your situation is the kind of thing I'd volunteer a weekend to help with, if you were local. I just have to wonder about what your network is going to look like after it is setup.
Data Recovery Is Only Half the Battle... by NOT-2-QUICK · 2002-02-04 09:06 · Score: 2, Insightful

Contrary to much popular belief, a good data recovery contingency (off-site back-ups, etc...) is only half of a sound DRP. When it comes to recovering from a cataclysmic disaster of this nature - the second, and equally critical component of a well thought out DRP is an all-inclusive BCP (Business Continuation Plan)...

Without this vital aspect, companies such as Deutsche Bank (who were ravaged by the WTC disaster on 9/11), would have been down for days/weeks while attempting to relocate, rebuild and restore their data center operations...

I, for instance, work at a rather large, international fortune 500 company and we have BCP strategies that include a complete off-site location. This facility houses fail-over systems for all business critical processes including a 1.2 terabyte, mirrored SAP database that can go online within minutes notice, and a phone bank/workstations for our 50+ CSR's (customer service reps) and our global helpdesk. Even more, we frequently (twice yearly) perform non-production drills to validate the systems health and improve upon our strategies...

This is obviously a bit late for you, but I would suggest reading up on the matter a bit more thoroughly prior to redesigning your future systems and developing your next DRP...

--
Beer is proof that God loves us and wants us to be happy. -- Benjamin Franklin
1. Re: Data Recovery Is Only Half the Battle... by sphealey · 2002-02-05 02:09 · Score: 2
  
  I, for instance, work at a rather large, international fortune 500 company and we have BCP strategies that include a complete off-site location. This facility houses fail-over systems for all business critical processes including a 1.2 terabyte, mirrored SAP database that can go online within minutes notice, and a phone bank/workstations for our 50+ CSR's (customer service reps) and our global helpdesk
  Nice if you can justify it. The problem for a small- to medium-sized organization is that there is no positive cost/benefit or return on investment for DRP/BCP plans of this scope. Do backups? Of course. Store them offsite? Yeah. Have a list of your installed equipment? That would be nice.
  But from there the decision tree goes like this: cost of having a real disaster recovery plan like the big guys? $X. Probability of a disaster? p%. Cost of having our sysadmin and his buddies work 25 hours a day for 3 or 4 days, slapping together whatever equipment he can find at CompUSA and doing the miminum necessary to get back online? $Z. Cost of not having our computer system for 3 day? $C.
  And for a small organization, $X * p% is almost always greater than $Z + $C. So creating an extensive DRP isn't justified. Tough luck for the guys who DO end up working 25 hours a day for a week or so (been there), but the economics are usually pretty clear.
  sPh
Re:A Day late and a Dollar short question of the y by mjoconnor81 · 2002-02-04 10:05 · Score: 1

I agree. The IT department I work for has an exellent Disaster Recovery program, if the Data center I am currently in were to burn to the ground we can have everything operational as if nothing happened within 36 hours.

as for recovering without a plan. If there was anything that you always wished you could go back to the begining and change, now's the time :-)

--
Pseudocode is code to demonstrate a concept, not designed to be run. Like certain M$ software.
You have never worked at a university by oneiros27 · 2002-02-06 04:23 · Score: 2

Most of the system administrators would love to be able to consolidate systems down to a few supported platforms, break services up so we're not supporting 'mainframe-esque' systems running 20+ applications. Unfortunately, the system admins are not what drives the university. Research dollars are. And to get the research dollars, you have to have faculty, and you have to keep them happy, which means installing undocumented software at some faculty member's whim, or keeping a Wang still running so that they can do their word processing. [It does, however, keep pizza warm, so it's not all bad].

Yes, some systems can be consolidated down, upgrade, or otherwise be made more space efficient, but you need to maintain the same OS and similar hardware, or you're looking at significantly increasing your workload due to instability, installation headaches, etc.

Now, there may be some systems that just can't be recovered, but it's not the system admin's job to decide that-- it's management's. The system admin can give advice, but they don't run the university, and if management decides that it's in the best interest to restore 17 year old mainframes that suck down $300k/yr in maintaince contracts and cooling costs, and occupy 1/2 of the space in the machine room, it's their decision. You can either do what they tell you to, or find a new job.

Now, if your management repeatedly doesn't listen to you, and continues to do what you warn them against, you'd probably be happier finding a new job. One of the nice benefits of university jobs is that you can carry over TIAA-CREF to most schools.

Personally, I would recommend first recovering every system possible, and once that's done, and you have everything back and operational again, work on migrating out machines that are harder to maintain and recover. Do not worry about getting rid of systems unless they just can't be recovered. Don't worry about anything else until after the systems are recovered.

PS. Don't put machine rooms in basements. Sewer pipes breaking above the machine room is bad.

--
Build it, and they will come^Hplain.