Slashdot Mirror


Hospital Brought Down by Networking Glitch

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

14 of 569 comments (clear)

  1. Problem was with an application, by Anonymous Coward · · Score: 5, Insightful

    according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

    Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

    1. Re:Problem was with an application, by cryptowhore · · Score: 5, Insightful

      Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

      --
      Happiness is a slider variable
  2. Of course they need another network by virtual_mps · · Score: 5, Insightful

    Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml

  3. Re:No. by barberio · · Score: 5, Insightful

    The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

    So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.

  4. I don't buy it by hey! · · Score: 5, Insightful

    The same explanation was floated in the Globe, but I don't buy it.

    People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

    The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

    One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  5. The Solutoin by Shishak · · Score: 5, Insightful

    Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.

    The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.

    Sure you'll be a touch slower routing between the segments but you'll have much more reliability.

    --
    Now I hope and pray that I will But today I am still, just a bill
  6. Add a second network? Not likely to help by markwelch · · Score: 5, Insightful
    > Do you think the answer to having an massive and unreliable network is to build a second identical network? <

    Of course not. Two solutions are more obvious:

    1. Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
    2. If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
    A third, and somewhat obvious, solution would be to make sure that
    • crucial data is kept on the local server farm, but also copied in real time to a remote server; and
    • a backup access mode (such as a public dial-up internet connection, with strong password protection and encryption) is provided for access to either or both servers, in the event of a crippling "local" network outage.

    This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.

    The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.

    I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.

    --
    -- http://www.MarkWelch.com/ Pleasanton California
  7. Re:Hospital Systems by gorf · · Score: 5, Insightful

    To be fair, they have gotten much better...

    You seem to have forgotten to explain why they were worse.

    If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.

    ...truly terrified me...

    What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.

    New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.

  8. I work at a teaching hospital... by pacsman · · Score: 5, Insightful

    The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.

  9. Re:No. by ostiguy · · Score: 5, Insightful

    If a network problem breaks down network 1, what is going to stop it from breaking network #2? If the problem was with the firmware in device#23a, the problem will reoccur on network 2 with device #23b

    ostiguy

  10. Re:Life threatening? by benwb · · Score: 5, Insightful

    Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.

  11. Contribution to causality responsibility by hey! · · Score: 5, Insightful

    Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.

    Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.

    But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.

    I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  12. Downtime Procedures by Kraegar · · Score: 5, Insightful
    Posting this kind of late, but it needs to be said.

    I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.

    I don't know how that hospital was able to pass inspections.

  13. It's all about the Benjamins by sjbe · · Score: 5, Insightful

    My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.

    The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.

    Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.