Hospital Brought Down by Networking Glitch
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.
Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.
... "an old boys' network"
do you think the answer to having an massive and unreliable network is to build a second identical network?
No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.
Is your browser retarded?
Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.
But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?
Which puts them right back to where they were.
Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.
A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.
... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.
We had to rebuild the entire network
Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.
do you think the answer to having an massive and unreliable network is to build a second identical network?
Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?
Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.
You may now begin to panic in an orderly fashion, thank you.
Wax-Museum Fire Results In Hundreds Of New Danny DeVito Statues
do you think the answer to having an massive and unreliable network is to build a second identical network?"
I think the answer is to disable spanning tree.
We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.
My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.
I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Or as was said in the movie "Contact" -
"Why buy one when you can buy two at twice the price?"
No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.
That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...
Help children born unable to swallow - www.tofs.org.uk
In six years they never thought to have a backup/redundant system in place in case of a failure like this?
Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.
Be you Admins? nay, we are but lusers!
Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.
man
No manual entry for
I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.
However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
And your change in routing policy is going to affect spanning tree how?
How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.
Should there be a few replacement devices on hand for failures? Yes. Should there be backups of the IOS and configurations for all of the routers? Yes. Should this stuff be anal-retentively documented in triplicate by someone who knows how to write documentation that is detailed yet at the same time easy to understand? Yet another yes.
If it is so critical, it should be done right in the first place. If a physically damaged or otherwise down link is ESSENTIAL to the operation or is responsible for HUMAN LIFE, then there should be duplicate circuits in place throughout the campus to be used in the event of an emergency; just like certain organizations have special failover or dedicated circuits to other locations for emergencies.
Last but absolutely certainly not least; the 'researcher', regardless of their position at the school, should be taken severely to task for this. You don't experiment on production equipment at all. If you need switching fabric; you get it physically separated from the rest of the network or if you really need outside access you drop controls in place like a firewall, etc. to severely restrict your influence on other fabric areas.
I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.
Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.
To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.
Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.
see this page for mode info
Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml
If you're just using a Primary Domain Controller, that could be your problem. I'd recommend adding a backup PDC, as well as a Tertiary Domain Controller, and add an X.25 backup network layer to give you hot-swappability and real-time rollover capabilities.
Comment removed based on user account deletion
If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.
Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!
Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.
This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.
Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people?
Swannie
:q!
The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days.
Did that mean the doctors couldn't play Quake for four days!?
Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?
-ted
The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.
We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.
But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.
Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
They have a huge hot lab in California where they have pre-configured switches, routers, ect running and ready to go at a moment's notice. When my ISP went down, they sent (same day) three new racks of modems configured with our last known "good" configuration so all we had to do was unplug, pull, connect.
It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.
-- Mark Lyon http://www.marklyon.org
However, the probability of both failing at the same time is:
0.1 * 0.1 = 1%
So as long as it can run on just one out of two, get you get ten-fold increase in stability.
Because Cisco is very California-centric, and the fact is that when it comes to their switching and routing gear, there is very little "hardware" that you can bring in to troubleshoot that's little more than commodity software loaded onto a commodity PC.
The best thing they had was the input of (hopefully) knowledgeable Cisco engineers. God knows if they relied on Cisco TAC Level 1 support they'd still be down today.
Sig (appended to the end of comments you post, 120 chars)
That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.
Obviously, if something fails due to design, then duplicating the design duplicates the problem. While this can be a useful troubleshooting tool, it makes somewhat less sense for production enviroments.
I would be willing to guess that the network was one giant collision domain, and that the trouble springs from that. But it is just a guess.
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
isn't that hard to troubleshoot. You look at the device ID that most recently made a Topology Change Notification, and then start looking at the hardware diagnostics for that system. If they're showing clean, reboot the switch. If, while the device is rebooting, the network stabilizes, you've found the problem. When the system finishes it's boot, check the hardware diagnostics again (Ciscos only run H/W diags at POST, and a reset is the only way to re-run them); odds are that you'll see there's a failed component.
A previous poster nailed it too, simply back out the changes you made (obviously the problem you were fixing is of a lower magnitude than a total outage), and things should start working again.
was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
There will probably be many lawsuits after this.
The line of thinking will be something like this:
How many people died or will die, or get improper treatment because of this networking glitch? If the hospital is as large as described, certainly a number of persons were given inadequate healthcare while they were there.
Some may have a good case.
I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.
I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.
I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.
Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.
Two wrongs don't make a right.
Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!
No, disabling STP is NOT an option. Learning how to use STP properly is the option.
Insert offensive troll-style sig here. Please mod or respond appropriately.
There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.
So what's the lessons?
1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.
The same explanation was floated in the Globe, but I don't buy it.
People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.
The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.
One thing I would agree with you is that the hospital probably needs a separate network for life critical information.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
I'm a little confused here:-
:)
Prob train A fails = 0.1
Prob train B fails = 0.1
Prob train A doesn't = 0.9
Prob train B doesn't = 0.9
So Prob neither fail = 0.9 * 0.9 = 0.81
So prob at least one fails = 0.19 = 19%
One of us has got the maths wrong.
Can someone who's not trying to remember his stats courses from years back tell me if it's me
In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.
Best Slashdot Co
If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.
It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.
In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.
They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."
Another telling comment about the situation was: "network function was fading in and out".
I don't know what SAT is, but I think you made some mistakes.
:
... 98%, much better than the previous 90%.
if your 10% is the probability that 1 train will fail during NY -> LA trip then you've got the following probability
0 train fails = 0.9 * 0.9 = 0.81
1 train fails = 2 * 0.1 * 0.9 = 0.18
2 train fails = 0.1 * 0.1 = 0.01
which means that the probability of having at least one train going from NY -> LA is
#include "coucou.h"
While paper-based may seem like the best solution to you; what you don't realize is that paper-based is just a single phrase for the rest of these 'bases':
sneaker-based when everyone must run throughout passing paper;
warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;
administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;
and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.
Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.
The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.
I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
I have seen this happen before in an organisation I have worked for. It happened when a second Cisco network (installed by a large well known company) was joined to an existing one and the routing protocol problems of the new network corrupted the existing one. Solution was to disconnect the two and force the external company to rebuild the new network from scratch.
That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...
As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.
and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..
---- Booth was a patriot ----
Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.
Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.
How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).
How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.
In the future, I would want to not be isolated from my friends in the Space Station.
The sad thing is I've seen this so many times before in different medical environments I've been in. They usally aren't very modivated to spend money on *any* infrustucture costs. Hospitals may spend some, but it's usally with the modivation to increase donations; "Oh look! It's shiny!"
Just like any other critical service, it costs big bucks to be prepared. How much you want to bet they 1) didn't have version control, 2) didn't have change control and
I am proud of them for one thing in particular. IMHO, your last line of redudancy, backups and recovery, etc. should ALWAYS be tangible. When you are involved with something life, death or riches, dead tree backups are the most reliable form I know. I am glad not everyone has lost their common sense to electron envy.
Democrats and Republicans only disagree about how to enslave you
In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.
I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.
We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.
Test Network GOOD (if you have the money).
Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.
The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.
Sure you'll be a touch slower routing between the segments but you'll have much more reliability.
Now I hope and pray that I will But today I am still, just a bill
I'm surprised I'm not seeing the really simple, obvious answer here to the question that's posed in the story.
do you think the answer to having a massive and unreliable network is to build a second identical network?
Don't build a second identical network. Just set it up so that whenever a file is saved, it's dumped onto a secondary network that's locked down so tightly that it doesn't run programs, search for documents, or anything like that. It just provides documents and that's it. For instance, it could be just a bare bones, huge-ass listing of links to patient data in a single document, and you would just use Ctrl+F or some such to find the name, and then click through it to see a TXT or HTML document with the patient's data in it. That way, you can have fancy programs and extensive information and such on the normal network without risking the network instability that comes with them.
Of course not. Two solutions are more obvious:
- Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
- If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
A third, and somewhat obvious, solution would be to make sure thatThis might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.
The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.
I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.
-- http://www.MarkWelch.com/ Pleasanton California
I live in the Boston area, and I have the perfect solution: they should hire me. I'll make sure their network never fails.
Well, maybe not. But I still need a job... =)
Networks are fragile, I'm surprised there arn't more massive outages.
The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.
KISS: Keep it simple, stupid!
The article is a little light on technical details, but does anyone know what internal routing protocol they were using? We've got a network with 11 cisco routers running OSPF. The routing changes happen very often, because there's a bunch of dial-ups and a few dozen routes that come and go with short-term connections (like backups from a remote office or running a CC authorization from a remote office). Everything works perfectly if none of our three newest routers are the first powered up. Those three are running IOS 11.0. After several calls to cisco (we buy all cisco internally and for our customer ends, so we get very good support from them) over the past three years, cisco is still stumped as to what the problem could be. The lines in the config file for OSPF are only five lines long, so we (and cisco) are sure there's no problem there. The hospital's problems sounds like it's of the same sort.
If this hospital is like any of the medical instituions I've worked for, then it's not unreasonable to expect that the IT group has been begging for more money to upgrade the infrastructure because they knew this kind of thing could happen. This usually falls on deaf ears at the doctor and senior administration level of the hospital because they see computers and networks as "magic" and don't take any time to understand the kind of reliance that is now placed on those systems. Also, it is very common for doctors to reject any spending on IT because it will bring their 8 figure salaries down to 7 figures and that is totally unacceptable!!! The story did say they are looking at 3$million for future upgrades, but that ONLY happened after this disaster.
Believe in things of which no person has ever learned
A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.
I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.
The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.
Remember, You are unique...just like everyone else.
I used to work for a systems intergrator. Just by general pratice, anything that was mission critical was on a seperate network.... if not two different networks. This is most likely a WinXP machine that somebody played with the stp/vlan settings.
Speaking of teaching hospitals... Yes, they are large..... I live just a few miles from Wake Forest/Baptis Hospital. They add, or renovate a wing a year.... There are always large crains over the building... and since I'm looking for work... I applied there... Even though they had a polethra of positions open for Network Techs, and since I'm well over qualified, and cheap... you would have thought they would have hired me... they did not... they seem to go for bottom barrel regarding techs... cheapest... most likely they think A+ is the best cert you can get.
1) introduction of routed domains to seperate groups of switches
2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.
3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.
4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.
Thank you.
Health Insurance Portability and Accountability Act.
Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.
The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.
They need a smaller test environment that ALL changes have to be checked off on before implementing. They need images of all router configs they can roll back to if necessary, and they need a diff comparison tool (mantrap or somesuch) to see what's changed between their known good configuration and what exists now.
Oh yeah, and they need a signed piece of paper with the moron's signature saying the change wouldn't impact the network. (a papertrail, as archaic as that seems.)
"Draco dormiens nunquam titillandus."
do you think the answer to having a massive and unreliable network is to build a second identical network?"
Take the number of patients in the hospital, A, multiply by the probable rate of death should the network fail, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a redundant network, we don't build one.
do not read this line twice.
True. For the most part, having a Cisco cert means you studied hard on how to pass the cert, it really has little bearing on wheather or not you can do the work. Not to say that a chimp can pass them, but I have met some people that couldn't troubleshoot a toaster problem with CCNPs.
Yes, I have some Cisco certs.
Carpe Deez
The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.
This sounds like a case of poor network infrastructure management. That being said, you can't pin it all on IT. Organizations like this have networks that grow out of necessity, and are often nearly impossible to make large changes to.
Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.
Do not fold, spindle or mutilate.
As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.
Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...
Swannie
:q!
Of course the answer is to build a completely seperate network if I am the one who you will pay to build it ;)
This is obvious.
In truth the network problem was not a physical one so then solution should not be a physical one.
The story I heard was that they had already approved the new network and it was still a few months away from being implemented when the old chewing-gum-and-bailing-wire network prematurely fell apart.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Way I see it, there are 2 things that need to get done.
1) Policy change. Only production machines on a production network.
2) Topology change. Make it easy to get a non-production network connection so people don't violate #1
The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.
Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.
Laziness and ignorance are the villians of most network problems.
Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.
I spent three years (1995-1998) at Perot Systems as a consultant designing and implementing hospital networks for Tenet Healthcare (2nd largest hospital chain in the US). There was at least one hospital that had the budget and the foresight to see that reliance on the network would do nothing but increase.
For that hospital, my network design was one that incorporated as much redundancy as possible at the time. For each patient care area, such as nurse's stations and ancillary areas such as radiology, cardiology, surgical theaters, etc. it was decided that each of the two network jacks would terminate in seperate closets. This meant doubling the number of closets required in order to meet distance limitations, but the hospital had already started working on allocating that space for the closets. Also for any important ancillary areas such as the lab, central supply, there also was two seperate networks. For the server farms theirselves, the Patient Care systems all had redundant connections to the primary and backup networks as well.
As each wall jack terminated into a different closet, each closet had two seperate networks as well. Each closet would house the primary network for half of the jacks served, and the backup network for the other half of the jacks served. The fiber paths from each closet took disparate paths back to seperate data center rooms, one external to the main building of the campus and one inside the main building. At the time layer 3 switches, or switch routers such as the Foundry Big Irons, or Cisco 6500s were not available. So as much as I dislike using Spanning Tree, I had used it at the time. All priorities were manually set though so there was no doubt where the root was and where it would move to in case of failure.
So, the switches terminated on another switch which was partitioned to several segments. Switch connections were made between the two data center as well. Each segment had a connection to a Cisco 7507 Fast Ethernet port local to that computer room, and another in the second computer room. Forming the core were two sets of two Cisco 7507s. In order to prevent one OSPF network from affecting the other OSPF network static routes were used (would use BGP if I had to do it over again). Outside WAN connections were terminated redundantly on the two patient care networks as well.
While the primary network in the hospital also supported the non-patient care areas (such as administration, the backup network was only for the patient care areas. That was just to prevent the type of thing that happened in the article, where something non-patient care related ends up taking everything down.
Reverting to backup paper systems would be nearly impossible once the "tube" systems were sealed up. Much like the movie Brazil, hospitals used to have pneumatic tubes running all over the place, especially between the lab and the nurse stations. Running samples and results back and forth would definately introduce a LOT of delay for a doctor trying to make a life and death decision.
I am sure that I would I design things different these days (for one, Layer 3 would go all the way to every single edge switch and collapse on a fast switch router) but I think the design probably held together well. I should check back in someday and see how long and well it lasted, if they did replace it.
Jay
Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.
Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.
But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.
I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.
No.
You can only multiply them together like you have done if the two variables are independent.
Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.
Seeing as these paper forms hadn't been used for 6 years, I'd have to assume that the network was very reliable. Problems do occur from time to time, but it doesn't mean that the whole thing should be replaced. Just fix the issue and move on.
The Globe was indeed short on technical details. What puzzles me is that they say the network was down for four days.
NOT a rhetorical question:
Why didn't they power-cycle the whole complex? Maybe even literally? Presumably a hospital should be able to handle a short interruption in AC power... and presumably the network equipment wouldn't preserve the "I'm-broken-state" in nonvolatile memory. Why wouldn't a scheduled power outage for 10 minutes at 2 a.m. in the morning have been less disruptive than the network being down for four days?
Less drastically, couldn't they have called every operator and system administrator in and said "Synchronize your watches... at 2. a.m. power off every piece of computer gear within a hundred feet of your chair off, then at 2:10 a.m. power them on again?"
"How to Do Nothing," kids activities, back in print!
...the TV show you intend to watch is there. It may begin a few seconds late, on purpose or as a result of some discrepancy, but the TV show you want to watch is there.
For the past few years, networks on the national and local levels have all been switching over to server-based content play-out. TV from Computers! How Exciting! How Wonderful! How... frickin' scary, for those whose jobs it has been to ensure that Buffy plays down at 8, and not 8:02, or 8:15, or - Powers-That-Be Forbid! - Wednesday morning.
Professional TV Master Control operations traditionally operate (often contractually) to "five 9's" of reliability, 24x7, assessed monthly. Full Stop, Period, End-of-Story. TV Master Control geeks, their supervisors, and the maintenance engineers who support them have ever been a priesthood apart when it comes to worship at the Uptime Altar.
So what has their industry done, to ensure that all this "new wave" server and automation technology provides them with the same reliability as manual control and tape-based playback? Why, buy two of everything, of course! EV-ER-Y THING!
The server industry is only getting around to understanding that now, and is beginning to price their wares accordingly. I've attended dozens of vendor meetings over the past ten years where the salesguys, who six months earlier were selling mailservers to sysAdmins, are now selling their new video servers to Master Control guys. (Chum dished into a shark tank is the only comparable visual I can come up with.) What makes the sale is never the reliability of server over tape or (especially) the quality of server over tape, but desire of management to run more channels with fewer bodies. In the past this has led to management re-assessment of just how "inexpensive" server-based playout technology was and, in many cases I have seen, an increase in the number of channels created or planned as a means to justify the hardware costs.
The only debate point in most TV Master Controls comes down to what components are in-chassis redundant, which are external-chassis "hot" spares, and which are shelf spares.
My point (and I do have one...) is how it is unconscionable that a hospital where lives are at stake, lacks the war-room mentality that an entertainment operation has. It's real simple at the end of the day to assess which components in a network --info or video or both - chain are critical, and buy two of them and keep it all lit and tested. Lives are at stake, and your signature is on the shift report? You rent a tertiary back-up system to bring online while you do your regular and frequent preventive maintenance on your primary and secondary.
The guys who take care of Buffy do it. I would have thought that the guys who take care of sick babies and grandmothers would be playing in the same league.
"Dammit, Jim, I'm a doctor, not a CCIE!"
--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!
AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.
I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.
The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!
The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.
I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.
I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?
If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.
One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.
We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.
Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.
These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.
FUNNY NETWORK TRICKS
I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.
-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.
-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.
-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.
And the list of stories goes on. You get the point.
Interesting how even an army of Cisco engineers couldn't fix the problem. Perhaps a testament to how overly(and needlessly) complex cisco's equipment is...and/or, how bad their certification/training is.
As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.
Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?
One would crash, then the second would crash immediately.
As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.
I'd also guess improper change control procedures were involved here as well.
Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.
As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.
Here are some examples of the ways in which failures can occur that have implied linkages:
(1) Both trains are damaged by an earthquake.
(2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).
(3) The firm has cut the maintenance budget and is neglecting routine maintenance.
(4) The train is sabotaged by disgruntled employees or terrorists.
(5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.
(6) Design flaw means trains do not meet expected performance specifications.
In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.
I don't know how that hospital was able to pass inspections.
Etherhose (10b5 thick coax) is a useable networking technology. It has very good resistance to RFI/EMF. Lots of hospitals still run it, on links where 10 Mb/sec is sufficient.
Etherhose is no longer a good investment because it is labor-intensive to work with (vampire taps, and thick, heavy cabling) and because nobody is developing the technology any more.
Today, fiber optics might seem a better choice for noise isolation, since the cost has come down to a reasonable level.
However, glass has the same potential for future obsolescence as etherhose - I have a half-dozen mutually incompatible fiber links here. And termination, splicing, and interconnection of fiber is at least as difficult as working with etherhose... having done both, I'd say drilling for a vampire tap is easier.
In short, don't replace a working piece of infrastructure needlessly (wait until you project a need for additional bandwidth) and for noise isolation cat 5e in a grounded metal conduit is probably your best bet. Large diameter, professional quality conduit runs through electrically noisy areas are costly but also a very safe investment.
I wouldn't knock that old etherhose - it does its job quite well, far better than the 10b2 thin coax that replaced it ever did. And it's far more physically sturdy than anything else outside of conduit.
I was hoping for at least a funny. :)
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
This outage was caused by a researcher's data creating a storm of data which outpaced the network's ability to cope. The problem was allowing the research data to flow unimpeded across vital systems. The solution is to implement methods of controlling bandwidth, not just routing.
In order to prevent this from happening again, engineers should analyze the system to determine where to put data storage. In this case, almost certainly (although the article is unclear) data was stored in a central location but spanned across several servers and then backed up in another location. One part of the solution is to have distributed data storage spread across the institution and then that data backed up (across a separate network) to a central location.
The data storm itself could be prevented by using QoS bandwidth management. Of course, every network user believes that he/she should have unfettered access to all the bandwidth available, but quietly implementing some well-known techniques for limiting bandwidth usage would have at least mitigated the damage.
Finally, routing protocols other than spanning-tree or OSPF should be used. Creative implementation of internal addressing schemes (10.0.0.0 IP addresses) and a combination of BGP and last-resort static routes would certainly help to avoid these sorts of problems. I'm also wondering whether a *nix box running Zebra in critical locations might not reduce the problems. Certainly Zebra can remove the routing load from the Ciscos and, with plenty of RAM and processing speed available on PCs nowadays, could probably improve routing efficiency when a circuit goes down.
But the key to this problem is bandwidth management not routing management. Of course, the next problem could be routing. One seldom has the budget to solve everything.
No one ever had to evacuate a city because the solar panels broke!
What do you mean 'reduced to'? What else are they good for?
There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.
The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!
I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".
When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.
You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...
I read in a book about the number zero that I mentioned here before that the real cause was someone accidentally left a zero in a line of code, rather than a person pressing zero and crashing the entire network. Perhaps someone tried to execute a command that led to this faulty code being used by the ship's computers?
Maybe this was proven to be false later, I dunno.
Kind of funny though...
Yes, there is always the possibility you might be born blind, but most people don't have that genetig defect. They have two eyes which work very well, even if one of them happens to be broken by a random toothpick accident.
Redundancy is always good in a system where uptime is king. That is why so much of nature has organisms based around semi-redundant designs.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Yes, i'm not the wizard of words (or apparently math ;) this morning am i?
My main reason for posting was to appease my instinctual reaction to the (somewhat intuitive) mistake soemtimes made that having twice the stuff makes it twice as good/reliable, etc. Which holds true for availability (10-fold in fact), but you'll get less in the case of reliability, and manageability is also a concern since you'll have to constantly check the backup network (if it's not in active use, failures are harder to find or predict for that matter). Also, failures aren't always randomly dispersed throughout the network, as the model might imply. You have to figure out how much failure each part of the network can sustain.
So, throwing more hardware, developers, or whatever at the problem isn't a real solution. Figuring out what was wrong in the first place will let them spend their money more wisely, rather than letting all that hardware go to waste, doing nothing. They could possibly get all the redundancy they want with less than twice the hardware and maybe even increase performance of the network during regular usage.
ok, i've totally over spent my $0.02.
Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.
Seastead this.
First, I don't have all the details of what happened, nor do I have any idea of what the network looked like prior to the outage. However, I have a general design philosphy based on my experience with teaching hospitals and telco networks.
The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.
All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.
Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.
Sig??? I don't need no stinkin Sig!
If the hospital had been paper-based, this tragedy would not have occurred.
Tragedy? It sounds like they handled it quite well, and nobody died because of it.
The advantage of a paperless hospital is that you don't have to wait an hour for the lab results or X-rays to get to you (or longer, if they get lost). That saves time, letting the hospital save more patients.
If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.
Someone failed their vision test...
See that percent sign? The little "%" thingy?
Go Wireless, Use copper for Backup
I'm not talking 802.11, but miltary grade Spread Spectrum. It would cost a lot less then laying new copper. And if some a$@hole inadvertantly starts a DOS attack you could just flip off the main antena array at your NOC for 10 minutes and let the network reset itself. Also throttle your nodes to say 10 mbit. That way one node can't take down your entire network.
If a storm or other activity takes out the antena array you still have the old copper. Keep a switch(physical switch, not hub like switch) so that you could walk over to a pannel a switch your node over to copper in a jiff. If they both fail then go carrier pigeon, CB's, or cellphones. Nothing like a good old analog message in a pinch.
You say things that offend me and I can deal with it. Can you?
First of all, this was apparently a flat layer-2 network. From the information I have seen, it was a very large network. Spanning tree is a wonderful protocol and layer-2 networks are not bad things, BUT spanning tree is very complex in a large network, and latency is going to be an issue if there are no routed boundaries to control traffic. I have experience in designing networks for hospitals (and financial institutions and universities and gov't institutions), so I am aware that implementing layer-3 to the edge is not necessarily feasible for many reasons - financial, legacy setups, etc. That being siad, however, there should be some layer-3 at some point to segregate traffic and protect the critical pieces of the network. Identify the critical points of the networks and put redundancy there - i.e. the server farm, critical care monitoring systems, WAN connection. All network equipment vendors have some type of redundancy feature that would take care of automatic failover for these devices.
Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.
The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.
Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.
If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.
The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.
Just my $0.02, and I'm just that blond chick, so what do I know anyway...
So what are you going to do? Bleed on me?
0 train fails = 0.9 * 0.9 = 0.81
... 98%, much better than the previous 90%.
1 train fails = 2 * 0.1 * 0.9 = 0.18
2 train fails = 0.1 * 0.1 = 0.01
which means that the probability of having at least one train going from NY -> LA is
Erm... to quote you, "I think you made some mistakes."
100% - 1% = 99%.
81% + 18% = 99%.
How'd you get 98% out of those numbers?
That this happened in a teaching hospital, rather than a large corporation, makes their response much different.
They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.
I don't really understand all of the comments saying a redundant network infrastructure is bad/stupid/etc.
If your network is critical to your business, you should absolutely consider backing up every bit of that network with one (or more?) redundant components. This means every router should have a redundant pair, every physical network link should be redundant (including how it's routed through the building), every firewall, switch, etc. If you have mission-critical servers, they should have two NIC cards. Upgrades should never occur on both "sides" of the infrastructure at the same time, and both sides should be capable of running alone.
Not only does this type of configuration resist failures, but upgrades or configuration changes to the A or B side should never impact the other side, and if it does, you should be able to shut down the offending sections without impacting availability.
If your network staff doesn't understand these concepts, you desperately need to train them better. If the expense cannot be justified by management, then that's a business decision and when failures like this occur, they should not be surprised.
Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
Since Michael asked it like that I will leave behind my network engineer role (professional) and pick up my role as armchair mathmatician.
The item too be doubled is a network. Unreliability and massiveness are qualities of that network. So, using the distributive property of multiplication this would give us the equivalence of one network that is twice as large and twice as unreliable as the original.
I lived in Boston until 1999 and had my (ruptured) appendix removed at that hospital. That place is absolutely HUGE, many city blocks in size. It's network must be huge too and that's the problem. A LAN that size HAS to be sub-netted into smaller segments! Now, I'm not a whiz bang Network engineer, but I do know when something's done WRONG, and it sure seems like this is the case here. Building a parallel WRONG network won't solve the problem, it'll DOUBLE the problem! There are many gifted people here....why not come up with a solution for them here? Consider it a public service to a very public oriented hospital.
Well, mostly transparent to end stations.
Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.
Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.
Apple's DHCP implementation was bitten by this on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.
I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.
- Peter
INsigNIFICANT
Let your imaginations wander, and ponder a point in the future when all of our health care facilities will be run on Microsoft... .
Read the EFF's Fair Use FAQ
Mail any lucrative^h^h^h^h^h^h^h^h^h job offers to:
Former MIS Director,
Beth Israel Deaconess hospital
Boston, MA 02215
No application can cause a spanning tree loop. It is simply impossible.
A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".
There are 2 likely causes:
Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.
Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.
A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.
This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.
Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions.
I call sensationalist bullshit. It takes at most 15 minuites to switch over to a fully paper hospital here.
Either that or their hospial is really really shity.
I live in a giant bucket.
Well, that's it you see! Alan Ralsky thought it said spamming tree protocol and tried to use the network!
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Well, this explains what happened when I was there after being hit by a truck. The doctors were great but the place was very disorganized. Hrm.
... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...
Comparing it to Windows will be a moot point, since El Dorado is going to have a 40% larger code base than XP.
I was an operations manager for a large hospital for several years, and planning for this such as that should be a number one goal for IT staff.
The first rule in anything to do with hospitals is to ensure that they have disaster plans in place and that these are tested on a regular basis. The disaster plans should include scenarios such as total power outage, failures of vital equipment etc.
The second rule I used was to ensure that in critical areas there was a second independant network path that if needed could be isolated from the rest of the network. Usually this mean putting in a run of fibre that bypassed buildings etc.
The third rule is to ensure that vital equipment can be run without need for a network. Nothing should be so dependant on networking that if there is a failure it will stop it from working. If networking is a requirement (eg Medical Imaging) that network should be independant from the main network.
The fourth rule is to ensure that there is a secondary method of accessing electronic patient records in the event of an extended downtime. I wrote an application that would dump the most needed patient information and leave it available on PC's in critical areas in query only mode. This allowed access to most of the patient details for using the patient forms.
To begin with, it's unlikely a CCIE would have required a consultation w/ the inventor of the protocol, as they'd already have a firm understanding of the inner workings of STP. And there is no "quick start" to a CCIE. That's why there's less than 10,000 of them in the world. And why, even in the depressed tech market, CCIEs are still follwed by headhunters bearing offers of $100K+/yr jobs...
Now, however, if two vlans get bridged (a computer with a wire in one vlan, and a wireless card in another vlan), the forwarding tables on the switches get confused because there are multiple paths to the same stp root.
Excuse me? Since when do end hosts forward BPDUs? Since when do end hosts forward _anything_, for that matter?
Unless you're going the el cheapo route, there's no reason that individual computers should be forwarding traffic. Okay, I'm sure some of you could show me valid scenarios, but I'll bet that none of them are realistic production environments (unless management has been incredibly stupid).
The last time I had a problem with a spanning tree algorithm I lost 12 points on my CS final!
Ok, so seriously, I'd be embarassed if I screwed up a spanning tree algorithm on a test. If it took Cisco engineers 6 days to fix it, it musta been something really quirky, most likely the software not configuring something right. I can't imagine an application problem that would hose a network past a power toggle.
paintball
The above is specious. I know nothing about the network or campus in question. I'm sure the folks on hand know what to do. Good luck.
Friends don't help friends install M$ junk.
Why not buy M$ wireless 802.11b install W2K/XP on every computer and set up an MS exchange server. Who needs BSD when you have M$ :)
<I>just kiddi'n the uptime of the above mentioned network would be measured in nanoseconds, and then they will have to switch MS paper'n'pen method</I>
Live for the present, learn from the past, and dream of the future!
My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.
The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.
Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.
The kidneys are internally redundant. You only need a 10% kidney function to contintue to survive. Ditto for Liver and other organs (aside from heart). They take years of abuse via smoking or drinking before they finally start to wear out to the point of causing system collapse.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Algorhyme
I think that I shall never see
A graph more lovely than a tree.
A tree whose crucial property
Is loop-free connectivity.
A tree that must be sure to span
So packets can reach every LAN.
First, the root must be selected.
By ID, it is elected.
Least-cost paths from root are traced.
In the tree, these paths are placed.
A mesh is made by folks like me,
Then bridges find a spanning tree.
---Radia Perlman
"Those who make peaceful revolution impossible, make violent revolution inevitable" - JFK
It's always nice to see those people doing useful work for a change.
Sounds like a standard UPS system to me. You have the grid feeding banks of batteries. The batteries feed the hospital. The generators are between the grid and the batteries, but they are not wired in such a way as to allow a generator failure to disrupt pawer from the grid. If the grid fails, no one notices because the batteries are what feed the hospital. After a few minutes, the generators start and they keep the batteries full. Once the grid is back on, the generators shut down.
I'd rather you do it wrong, than for me to have to do it at all.
Actually there is more truth to that than you know. They can't keep any files locally and simply have to not rely on the systems for anything critical. Recently they had their computers taken away for 3 weeks (refurbishing offices), which was a terrible inconvenience, but it didn't bring work to a halt. Just made everyone's lives harder than they had to be.
Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.
My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.
Happy Fun Ball is for external use only.
And how will you know if the backup network even works? Of course you could test it. But will it work under the kind of extreme live stress that would take down the primary network? And what if the issue is simply load than neither network can fully handle? Could you run both networks in tandemn correctly? It sounds to me like the original problem was that the network was designed by someone who thinks of the switches as magical black boxes that will take care of everything ... someone that assumes perfect abstraction. That 3 million dollars to build a parallel network I think could be better spent by hiring competent people to build a correct network that includes redundancies structured in the right places. No matter what you do, there will be some single points of failure, such as the very logic used to switch over to the backup network if that's what you have (which would be a big waste if it sat there idle). The network engineering people need to know and understand those single points of failure and have plans to deal with failures at those points.
now we need to go OSS in diesel cars
To elaborate on what zzyrc said, TTL wont decrement when it passes through a typical layer 2 switch - only a router or other layer 3 device.
Never never never smoke crack before geometry class!
Probable results:
Tech Public Policy stuff
Meanwhile, the hospital was figuring out how to run at its usual pace without the 100,000 e-mails it usually sends a day.
So thats where they're doing all those penis enlargements!
"I'm tired of all this 'Aren't humanity great' bullshit. We're a virus with shoes" - Bill Hicks
Build a second parallel network because the network designers didn't know wtf they were doing? How are you going to fail over to this network? STP? (insert obnoxious chortle here)
10 bridged hops = big flat network = they needed layer 3 switching in the first place, ergo, the network was badly designed. The very fact that a root bridge STP reconverge occurred indicates a poorly framed implementation plan and obviously no backout plan.
Find somebody who knows what the hell they are doing and have them do a network audit.
Cisco Systems, the hospital's network provider...
I wonder if that open source ATC comment was for that UK airspace shutdown on May 17th...
Its nothing open source could fix...
he shouldn't worry though, we've put a fix in for that (Works damn well, too!)
In the future, I would want to not be isolated from my friends in the Space Station.
There's always some error in calculs, in this case, the traindriver forgot to lace its shoes.
#include "coucou.h"