Hospital Brought Down by Networking Glitch

← Back to Stories (view on slashdot.org)

Hospital Brought Down by Networking Glitch

Posted by michael on Wednesday November 27, 2002 @02:41AM from the risks-digest dept.

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

17 of 569 comments (clear)

Min score:

Reason:

Sort:

No. by Clue4All · 2002-11-27 02:45 · Score: 5, Interesting

do you think the answer to having an massive and unreliable network is to build a second identical network?

No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

--

Is your browser retarded?
1. Re:No. by pubjames · 2002-11-27 03:38 · Score: 5, Interesting
  
  I spoke to an electrician at our local hospital recently. He told me the hospital had three separate electricity systems - one connected to the national grid, one connected to an onsite generator which is running all the time, and a third connected to some kind of highly reliable battery system (sorry can't remember the details) for life support and operating theatres in case both the national grid and the on-site generator fail simultaneously.
  
  If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.
Disaster recovery by laughing_badger · 2002-11-27 02:50 · Score: 4, Interesting

do you think the answer to having an massive and unreliable network is to build a second identical network?
No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.
That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

--
Help children born unable to swallow - www.tofs.org.uk
Re:Problem was with an application, by sugrshack · 2002-11-27 02:52 · Score: 5, Interesting

that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.
Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.
realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

--
I can't believe it's not lard!
Re:Well! Woopsy! by hey! · 2002-11-27 02:52 · Score: 4, Interesting

I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:That's why I hate automatic routing by Swannie · 2002-11-27 02:56 · Score: 5, Interesting

Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).

This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.

Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)

Swannie

--
:q!
Re:Spanning tree by GLX · 2002-11-27 02:57 · Score: 5, Interesting

This would imply that either:

A) A campus could afford to do Layer 3 at every closet switch

or

B) Live without Layer 2 redundancy back to the Layer 3 core.

I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

Spanning tree is our friend, when used properly.

--
Sig (appended to the end of comments you post, 120 chars)
Complexity brings bugs by stevens · 2002-11-27 02:58 · Score: 5, Interesting

The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
Cisco implemenatation of Spanning Tree sucks by xaoslaad · 2002-11-27 03:01 · Score: 4, Interesting

I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

Two wrongs don't make a right.
YES- air traffic management experience... by mekkab · 2002-11-27 03:08 · Score: 5, Interesting

Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.

Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.

How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).

How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Problem was with an application, by nolife · 2002-11-27 03:16 · Score: 5, Interesting

Not only that but they gave the impression no one had problems using the old paper method. Actually noting that at times the network was fine but they decided to stick with the backup method until the issue was resolved because it was harder switching back and forth when the network was working. All in all though they made a point that no appointments were missed, no surgeries were cancelled etc.. Meaning business was as usual but using a backup manual method.

I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?

--
Bad boys rape our young girls but Violet gives willingly.
Re:Spanning tree by stilwebm · 2002-11-27 03:21 · Score: 5, Interesting

I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.
Re:Leading question by enkidu55 · 2002-11-27 03:23 · Score: 4, Interesting

Isn't that the whole point in posting a story? To foster your own personal agendas? What would be the point in making a contribution to /. then if everything was vanilla in format and taste. You would think that the members of the /. community would feel a certain sense of pride knowing that their collective knowledge could help another business/community out with some free advice.

IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.
Re:Simple Answer by gorilla · 2002-11-27 03:43 · Score: 4, Interesting

Having worked in a hosptial, I'll tell you that's not acceptable.
Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.
Fraternal Twins by SEWilco · 2002-11-27 03:51 · Score: 5, Interesting

I hope the "second redundant network" uses equipment by a different manufacturer and has at least one network technician whose primary duty is that network. That person's secondary duty should be to monitor the primary network and look for problems there. Someone in the primary network staff should have a secondary duty to monitor and check the backup network.
The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.
Mission Critical Networks 101 by rhoads · 2002-11-27 04:11 · Score: 5, Interesting

One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.

We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.

Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.

These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.

FUNNY NETWORK TRICKS

I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.

-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.

-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.

-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.

And the list of stories goes on. You get the point.
Re:Um.. by Anonymous Coward · 2002-11-27 04:47 · Score: 5, Interesting

They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.

Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.

In this case, staying with netware has saved lives.

Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.