Hospital Brought Down by Networking Glitch
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.
... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.
We had to rebuild the entire network
Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.
I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.
Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.
To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.
Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.
see this page for mode info
If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.
Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!
Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.
As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.
was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...
As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.
and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..
---- Booth was a patriot ----
1) introduction of routed domains to seperate groups of switches
2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.
3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.
4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.
Thank you.
Cisco only runs per-VLAN spanning tree if you're using ISL as your trunking protocol. The reason you don't get it on Extreme Networks stuff is because they use 802.1q. In fact, Cisco devices trunking w/ the IEEE protocol run single instances, just like the Extreme product.
There are tradeoffs, of course. STP recalculations (when running) can be kind of intensive, and if you've got to run them for each of your 200 VLANs, it can take a while. However, for my particular environment, per-VLAN STP is a better solution.
"The crisis had nothing to do with the particular software the researcher was using."
"The large volume of data the researcher was uploading happened to be the last drop that made the network overflow. "
While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge.
There is no "-1 offended" or "-1 you don't agree with me" mod options for a reason.
The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.
Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.
Laziness and ignorance are the villians of most network problems.
Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.
No.
You can only multiply them together like you have done if the two variables are independent.
Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.
I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!
AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.
I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.
The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!
The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.
I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.
If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.
No application can cause a spanning tree loop. It is simply impossible.
A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".
There are 2 likely causes:
Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.
Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.
A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.
This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.
Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."