Hospital Brought Down by Networking Glitch
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.
Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.
do you think the answer to having an massive and unreliable network is to build a second identical network?
No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.
Is your browser retarded?
A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.
... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.
We had to rebuild the entire network
Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.
Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.
You may now begin to panic in an orderly fashion, thank you.
Wax-Museum Fire Results In Hundreds Of New Danny DeVito Statues
Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.
Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.
To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.
Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.
see this page for mode info
Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml
If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.
Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!
Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.
This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.
Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people?
Swannie
:q!
This would imply that either:
A) A campus could afford to do Layer 3 at every closet switch
or
B) Live without Layer 2 redundancy back to the Layer 3 core.
I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.
Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.
Spanning tree is our friend, when used properly.
Sig (appended to the end of comments you post, 120 chars)
The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.
We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.
But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.
Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
The same explanation was floated in the Globe, but I don't buy it.
People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.
The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.
One thing I would agree with you is that the hospital probably needs a separate network for life critical information.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...
As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.
and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..
---- Booth was a patriot ----
Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.
Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.
How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).
How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.
In the future, I would want to not be isolated from my friends in the Space Station.
Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.
The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.
Sure you'll be a touch slower routing between the segments but you'll have much more reliability.
Now I hope and pray that I will But today I am still, just a bill
Of course not. Two solutions are more obvious:
- Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
- If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
A third, and somewhat obvious, solution would be to make sure thatThis might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.
The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.
I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.
-- http://www.MarkWelch.com/ Pleasanton California
To be fair, they have gotten much better...
You seem to have forgotten to explain why they were worse.
If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.
What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.
New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.
1) introduction of routed domains to seperate groups of switches
2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.
3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.
4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.
Thank you.
I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.
The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.
Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.
Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.
Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.
But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.
I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.
"Dammit, Jim, I'm a doctor, not a CCIE!"
--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!
AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.
I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.
The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!
The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.
I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.
If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.
One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.
We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.
Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.
These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.
FUNNY NETWORK TRICKS
I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.
-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.
-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.
-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.
And the list of stories goes on. You get the point.
I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.
I don't know how that hospital was able to pass inspections.
There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.
The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!
I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".
When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.
You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...
They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.
Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.
In this case, staying with netware has saved lives.
Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.
No application can cause a spanning tree loop. It is simply impossible.
A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".
There are 2 likely causes:
Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.
Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.
A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.
This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.
Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.
The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.
Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.