Hospital Brought Down by Networking Glitch
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
This is almost too good... could someone have hacked in to their network and deliberately taken it down?
Time is an illusion, lunchtime doubly so. --Ford Prefect
i said n/t
according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.
Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.
having an identical network would almost be like raiding several harddrives to have the databacked up (raid 0+1 i think). It would almost guarrantee a connection unless of course they both go down. But how likely is that? :)
scapegoat
... "an old boys' network"
do you think the answer to having an massive and unreliable network is to build a second identical network?
No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.
Is your browser retarded?
Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.
But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?
Which puts them right back to where they were.
Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.
If the first one's bust, how's a second going to help? :)
Although i must admit that redundancy is a wonderful thing for servers, power supplies, etc, but for infrastructure?? Having identical copies of routers kicking around is extremely useful, but cost effectiveness comes into play. If you can afford it, I can't argue with the logic.
Hmm, a second parallel system. Would this include parallel wiring closets? I suspect that the cost involved (I once worked on a project team that was merely replacing wiring at a hospital, and it took 6 months) would have them continue to use existing wiring runs. You have now created a single point of failure for *both* networks.
For those who think that a hospital wouldn't cut corners in that way, think again. I know what we had to do with our project, and I for one will never let anyone I know stay at that hospital. If they were willing to cut there, where else will they cut?
Anon Coward
that's what u get when u sign onto monopolyware. fact is, with all the fancy toys that docs use like MRI and tomography, i haven't met one that knows anything about a computer. in fact they were probably glad their stuff crashed. in fact, it was probably a setup to get the old system back! lousy docs :(
"You never want a serious crisis to go to waste." - Rahm Emanuel
A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.
... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.
We had to rebuild the entire network
Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.
I would look at making the original network more reliable and what the hell, if the hospital has money to burn, redundancy is a good thing. I didn't read the article. Was this caused by some knucklehead that was testing in a production environment?
Money not found! A)bort, R)etry, D)eclare Bankruptcy
It's all fun and games until Bobby loses an eye because his doctor couldn't read his forwards.
-Eezy Bordone
do you think the answer to having an massive and unreliable network is to build a second identical network?
Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?
do you think the answer to having an massive and unreliable network is to build a second identical network?"
I think the answer is to disable spanning tree.
We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.
My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.
I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Or as was said in the movie "Contact" -
"Why buy one when you can buy two at twice the price?"
No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.
That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...
Help children born unable to swallow - www.tofs.org.uk
This is an American hospital in Boston. Geez, if you are going to bash Israel, at least do it with something credible...
In six years they never thought to have a backup/redundant system in place in case of a failure like this?
Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.
Be you Admins? nay, we are but lusers!
Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.
man
No manual entry for
This event has a lesson for us. Of course, I expect the Slashdot response to be something along the lines of "they should have used Linux," but the true fact is that all technology, even Linux, is unreliable. Rather than dicking around with which OS can provide the best network, we should accept that none of them provide the robustness necessary for things like hospitals and fire departments, and what we really need is to reduce our dependency on technology altogether. If the hospital had been paper-based, this tragedy would not have occurred.
Karma: Good (despite my invention of the Karma: sig)
and now there server gets slashdotted, administrators run around trying to work out what to do - rebooting NT boxes. Well the article is on the boston globe so there server is okay.
Doesn't really matter. If you had to deal with Med Students as we do, you'd die before you went to the doctor. Trust me.
Comment removed based on user account deletion
Ok, so here's an SAT question for ya:
.10
IF you have one train going from NY->LA that's likely to break down 10% of the time, and you get a second identical train going in the opposite direction, what's the probability that one of the trains will fail?
(number of trains) * (probability of failure)
= 2 *
= 20%
The more components in the system, the more likely it is that parts of the system will be down. This isn't to say that the extra redundancy isn't useful, but it doesn't give you more reliability...it decreases it. So additional mangement costs are incurred in making sure that enough redundancy is always available to compensate for parts of the system that are down, and replacing bad components.
Spanning Tree is pretty robust protocol. Problems usually arise when admins get impatient with convergence times and start messing with the timers.... or enabling features like portfast, backbonefast and the like.
And your change in routing policy is going to affect spanning tree how?
How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.
This is what they were testing.
Should there be a few replacement devices on hand for failures? Yes. Should there be backups of the IOS and configurations for all of the routers? Yes. Should this stuff be anal-retentively documented in triplicate by someone who knows how to write documentation that is detailed yet at the same time easy to understand? Yet another yes.
If it is so critical, it should be done right in the first place. If a physically damaged or otherwise down link is ESSENTIAL to the operation or is responsible for HUMAN LIFE, then there should be duplicate circuits in place throughout the campus to be used in the event of an emergency; just like certain organizations have special failover or dedicated circuits to other locations for emergencies.
Last but absolutely certainly not least; the 'researcher', regardless of their position at the school, should be taken severely to task for this. You don't experiment on production equipment at all. If you need switching fabric; you get it physically separated from the rest of the network or if you really need outside access you drop controls in place like a firewall, etc. to severely restrict your influence on other fabric areas.
I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.
Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.
To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.
Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.
see this page for mode info
How complete of a moron are you?
No wait... you've already answered the in your post:
A total moron!
This hospital is not in Israel, its in Boston Massachusetts. Try reading the article before wasting everyone's time with your idiocy.
Did a company as large as Cisco seriously have no appropriate troubleshooting equipment on the WHOLE of the east coast or anywhere closer the california? What kind of mickey mouse support outfit are they running??
Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml
Having worked on several database systems, improper planning and maintenance are the main causes of large, unwieldy and ultimately unstable systems. In large organizations where IT is not a major business area, i.e. a Hospital system, their existing database system has probably been augmented several times to increase functionality (and capacity) - probably by different parties as well. This multiple patching approach results in instability as the database has grown far beyond its orginal intended purpose. However, due to the vast stores of data, and the repeated tinkering with it by various parties, migration is a nightmare.
Rebuilding the system from the ground up poses several major hurdles. First being the systematic migration of data while the original database is still running! as for hospitals, this database is clearly mission critical!
The other problem is mimicing the interface and relationships within the database, such as to reduce retraining. Retraining is a major problem when switching systems. All in all, it is a major undertaking to redo the database, and probably not viable, both in time or money for the hospital.
Saddly, I have to contend that duplication of their system is the best short to medium term solution.
oh, put a sock in it already.. the juvenile racism and hatred gets old after a while.
The network can be designed (hierarchical) such that a network fault will isolate only a part of network that can be locally fixed and does not affect the entire network. The important network servers should be redundant and can be be made fault tolerant by automatic switchovers during server faults. The main switches and routers can use loopback addressing to other network cards in case a network card on the switch or router goes down.
If you're just using a Primary Domain Controller, that could be your problem. I'd recommend adding a backup PDC, as well as a Tertiary Domain Controller, and add an X.25 backup network layer to give you hot-swappability and real-time rollover capabilities.
Comment removed based on user account deletion
If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.
Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!
Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.
As much as we all laugh at the Windows "close all your applications and reboot" way of "solving" problems, there is something to be said for rebooting systems: If all else fails, you can quickly restore the system to a known working state.
Ideally, rebooting a system should be unnecessary. But practically speaking, people make dumb mistakes -- like the bug which caused the telephone crash of 1990 -- and Bad Things can happen. Rebooting a system should be a last resort; but it should be a last resort which always works.
Tarsnap: Online backups for the truly paranoid
So when I submitted this a week ago it gets rejected, but now that Mr. hey! submits it, it gets accepted. I see what's going on. Damn I need more punctuation in my handle.
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.
Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people?
Swannie
:q!
The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days.
Did that mean the doctors couldn't play Quake for four days!?
I assume by "network" they just mean backbone. Obviously the backbone is what failed, otherwise it wouldn't have brought down the entire network. Obviously they need some redundancy there.
Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?
-ted
How many other organisation scan run at all if their network dies? And if the execs really were running around as errand boys, that's just great. Nice to see the senior staff actually caring enough to help keep things going. Really they need a prodcedure to deal with the networks failing rather that a redundant network.
Nice math, but the point here is that only 1 train has to arrive, thus in those 20% we can still safely travel.
Linux hosting for $2.50/mo
The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.
We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.
But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.
Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
"Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions."
By law they have to have a disaster recovery plan, all US hospitals HAVE to. So they "scrambled" to the disaster recovery plan, made copies of the forms, and were up. Big deal.
However, the probability of both failing at the same time is:
0.1 * 0.1 = 1%
So as long as it can run on just one out of two, get you get ten-fold increase in stability.
That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.
Obviously, if something fails due to design, then duplicating the design duplicates the problem. While this can be a useful troubleshooting tool, it makes somewhat less sense for production enviroments.
I would be willing to guess that the network was one giant collision domain, and that the trouble springs from that. But it is just a guess.
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
The probability that one train will fail is still 0.1. It is irrelevant how many trains there are, the probability that any given one will fail will be 0.1(Of course assuming the trains fail independently). The probability that that both train will fail simultaneously is 0.1*0.1
isn't that hard to troubleshoot. You look at the device ID that most recently made a Topology Change Notification, and then start looking at the hardware diagnostics for that system. If they're showing clean, reboot the switch. If, while the device is rebooting, the network stabilizes, you've found the problem. When the system finishes it's boot, check the hardware diagnostics again (Ciscos only run H/W diags at POST, and a reset is the only way to re-run them); odds are that you'll see there's a failed component.
A previous poster nailed it too, simply back out the changes you made (obviously the problem you were fixing is of a lower magnitude than a total outage), and things should start working again.
Even if a network is engineered perfectly, someone could maliciously or accidentally physically harm it and cause down time. Having a second, perhaps lower-end, backup network, when you have people's lives at stake (missing prescription information could quickly cause a fatality) ..it's a necessity, especially for a hospital with such a good reputation. Plus, the telecomm industry giants such as Cisco are just DYING for more business, so this could also help the economy :)
Actually we have a horribly complex network. (Australian national network, multiple extranets to governmental offices, national dialup service, plus DialConnect global roaming hook up. Its 95% static routed.
Management is a key issue, with tools to aid deployment the next. Static in large networks is not impossible, sometimes you have to set limits and miss out on some "cool" features.
Probably having old school network engineers is a big part of this setup. They don't like giving up control to automated systems.
was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
There will probably be many lawsuits after this.
The line of thinking will be something like this:
How many people died or will die, or get improper treatment because of this networking glitch? If the hospital is as large as described, certainly a number of persons were given inadequate healthcare while they were there.
Some may have a good case.
I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.
I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.
I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.
Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.
Two wrongs don't make a right.
Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!
No, disabling STP is NOT an option. Learning how to use STP properly is the option.
Insert offensive troll-style sig here. Please mod or respond appropriately.
Its just too complex for people to understand.
One of the first things we learned when we got to this part of our networking class, was that spanning trees for more than a few nodes is damn near impossible for a human to figure out. We learned how to diagnose the problem if it occurred, we even studied ethernet frame dumps to watch the spanning tree build itself. But, if you weren't there to watch the tree get built, there's no way at all to guess what exactly went wrong with it. You just pull all the bridges and routers, reset them all, and start over.
This was probably caused by a combination of bad hardware, and some nut plugging two branches of the network together that were already connected somehow. The hardware should have recognized this as a loop and cut it, but for some reason it didn't.
Well, hopefully they won't repeat the same loop in their backup network.
If I have been able to see further than others, it is because I bought a pair of binoculars.
There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.
So what's the lessons?
1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.
To be compliant you should have massive amounts of validation documents covering everything from how to build *the whole system from scratch* in the event of an error, to your business continuity plan, your disaster recovery plan etc etc etc.
Your initial User Requirement Spec document when the system was implemented should have included details of failsafes and redundancy and been built in from the word go.
You would be on very shaky legal ground if you ran a system that was not FDA compliant like this.
What *really happened here*?
The same explanation was floated in the Globe, but I don't buy it.
People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.
The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.
One thing I would agree with you is that the hospital probably needs a separate network for life critical information.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Overall, as long as patient care wasn't diminished (the degree of diminishment is debateable), it is probably good that things like this occasionally happen. It's a great way to test non-technical systems that usually only get tested in a wide-spread disaster.
Sex - Find It
I'm a little confused here:-
:)
Prob train A fails = 0.1
Prob train B fails = 0.1
Prob train A doesn't = 0.9
Prob train B doesn't = 0.9
So Prob neither fail = 0.9 * 0.9 = 0.81
So prob at least one fails = 0.19 = 19%
One of us has got the maths wrong.
Can someone who's not trying to remember his stats courses from years back tell me if it's me
It depends on how it's set up. I think of it in terms of parallel or serial wiring. Your example is serial, in that if one goes down they both go down, thereby decreasing reliability. If you ask the question a different way, such as "What is the possibility that both trains break down" (i.e., parallel--if one goes down it doesn't affect the other one), the probability is .10*.10, which is .01: more reliable.
--RJ
In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.
Best Slashdot Co
If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.
It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.
In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.
They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."
Another telling comment about the situation was: "network function was fading in and out".
I don't know what SAT is, but I think you made some mistakes.
:
... 98%, much better than the previous 90%.
if your 10% is the probability that 1 train will fail during NY -> LA trip then you've got the following probability
0 train fails = 0.9 * 0.9 = 0.81
1 train fails = 2 * 0.1 * 0.9 = 0.18
2 train fails = 0.1 * 0.1 = 0.01
which means that the probability of having at least one train going from NY -> LA is
#include "coucou.h"
Why wouldn't a full network power down then power up, fix this. Surely that would be quicker than 4 days.
Or was it a case of poor management / incorrect configuration, resulting in bad configs in the devices NVRAM?
Well... in the case of the network system things would be different. We can tolerate the failure of one network or the other, but not both.
P(Failure) = 0.1
P(Net1 Fails) = 0.1
P(Net2 Fails) = 0.1
P(Both Fail) = 0.1 * 0.1 = 0.01
P(Net1 Fails, Net2 Works) = 0.1 * 0.9 = 0.09
P(Net2 Fails, Net1 Works) = 0.1 * 0.9 = 0.09
P(Either Net Fails) = 0.09 + 0.09 + 0.01 = 0.19
Yes, we are more likely to experience a failure, but at the same time we are ten times less likely to experience a catastrophic failure.
Spanning tree doesn't kick in just for fun, there was a problem, and suggesting another parallel network only means thay haven't found it and that it's going to happen again. And it surely doesn't take days to figure out a problem like there are a huge number of ways to fix it temporarily anyway.
:-)
I dealt with Cisco in the past and believe me, if they're not the ones who created and builted your network, they're not the ones you want around fixing things. And surely not the ones making (sorry) stupid suggestions like this one (parallel network, duh!).
Cisco have probably one of the greatest hardware around but it will still behave like crap when you don't have the right people managing it.
The quick answer is, find what happened, fix it and maintain it correctly and document it, then if something happens, before pulling plugs and messing around, RTFM
Well, occupying another country is another kind of a job. It's hard to say "manyana" when you've got lethal advesaries like stone throwing kids. No sir, that's when you just have to do something with your M-16.
That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...
As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.
and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..
---- Booth was a patriot ----
Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.
Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.
How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).
How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.
In the future, I would want to not be isolated from my friends in the Space Station.
Hrmm, says that many CISCO engineers rushed in to "save the day" and did not get it fixed. I have seen this before. Perhaps those CISCO CCNP/CCIEs are not really that good... Then again, as someone else pointed out, if the current network engineer at the hospital did not have the common sense to revert any changes that were made, or figure out a (relatively) simple spanning tree problem, he should be the 1st to go. Sheesh, people need to recall the fundamentals of networking and protocols before they are made heads of very large networks.
The sad thing is I've seen this so many times before in different medical environments I've been in. They usally aren't very modivated to spend money on *any* infrustucture costs. Hospitals may spend some, but it's usally with the modivation to increase donations; "Oh look! It's shiny!"
Just like any other critical service, it costs big bucks to be prepared. How much you want to bet they 1) didn't have version control, 2) didn't have change control and
I am proud of them for one thing in particular. IMHO, your last line of redudancy, backups and recovery, etc. should ALWAYS be tangible. When you are involved with something life, death or riches, dead tree backups are the most reliable form I know. I am glad not everyone has lost their common sense to electron envy.
Democrats and Republicans only disagree about how to enslave you
As a Cisco engineer I believe if the network is done right the first time there is no need for that drastic of a Disaster recovery plan. The shear cost would be astronomical and if there is a design flaw in the orginal model why replicate that on the DR side. Just my 2 cents.
This type of problem will continue until the healthcare system adapts to common standards, protocols and SOFTWARE. The HIPPA regulations have started to put fear into the hearts of hospitals and software companies who have to be up to standard by next year. Millions of dollars are at stake yet some companies are still clueless. If you doubt this statement ask an IT person at the hospital how prepared are they for the upcoming HIPPA implantation. It will enlighten you. ....) it is surprising that a sourceforge project does not exist that would allow a hospital with minimal hardware to run a Java based HIS or something they can run own their existing legacy hardware. This is the killer application. I would tell Beth Israel Deacnoness to fire their software group. Hire opensource team and start the development of an opensource project to do the job that there current system is not. The savings will be great, but the contribution to healthcare will be legendary.
Tools that are used by the Hospitals are another issue. Many hospitals are still using proprietary systems developed by vendors, which are thinking of there own interest. Unlike the internet (apache), there are very little open source Healthcare information tools that a large hospital system can use that are HIPPA compliant. With all the great open source tools (java, gcc, KDE,
In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.
I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.
We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.
Test Network GOOD (if you have the money).
Valid point, but to nitpick (hey, I'm bored), your maths is wrong. By your reckoning, 11 trains would give
:-)
11 * 10% = 110% chance of failure!
The actual maths is
(probability of one or more failing) = 100% - (probability of none failing)
= 100% - (90% * 90%)
= 100% - 81%
= 19%
So obviously not significant to your point, but mathematically significant
Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.
The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.
Sure you'll be a touch slower routing between the segments but you'll have much more reliability.
Now I hope and pray that I will But today I am still, just a bill
I guess it depends on the amount of parallelism.
Most enterprise nets I have worked on are parallel all the way to the access layer switch on the user end with dual homed servers.
that is parallel.
If they mean two networks that don't touch one another, I think that is retarded.
The bottom line is, if they got taken down by spanning tree for a whole day, there network was extremely poorly designed to begin with.
If you follow a simple network design principle, and make each access layer switch a subnet(vlan), and make that switch the root of spanning tree for that VLAN, you will NEVER have a spanning tree loop.
never ever ever.
I'd love to be the VAR for that hospital.
"Yes, the only way we can prevent this is by building another poorly designed network in parallel!"
ka-ching.
I'm surprised I'm not seeing the really simple, obvious answer here to the question that's posed in the story.
do you think the answer to having a massive and unreliable network is to build a second identical network?
Don't build a second identical network. Just set it up so that whenever a file is saved, it's dumped onto a secondary network that's locked down so tightly that it doesn't run programs, search for documents, or anything like that. It just provides documents and that's it. For instance, it could be just a bare bones, huge-ass listing of links to patient data in a single document, and you would just use Ctrl+F or some such to find the name, and then click through it to see a TXT or HTML document with the patient's data in it. That way, you can have fancy programs and extensive information and such on the normal network without risking the network instability that comes with them.
Of course not. Two solutions are more obvious:
- Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
- If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
A third, and somewhat obvious, solution would be to make sure thatThis might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.
The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.
I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.
-- http://www.MarkWelch.com/ Pleasanton California
The opening is it bad to build a second bad network is ridiculous. What cisco did was build a second network that all comps could be moved to if the efforts to correct the first network were not successfully. The network crashed due to poor maintenance. The network was not maintained by anyone that knew much about a switched network nor did they really even know what spanning tree was. This just shows that businesses don't really understand the value of experienced or educated network staff. Any one with of inclining of maintaining a switched network would have been able to foresee that continuing with the current design ( or lack of a design ) would end in a crash.
Sorry my point was unclear!
;)
When i wrote "This isn't to say that the extra redundancy isn't useful" I was saying (without saying) that the redundancy *increases* availability. As you guys promptly clarified, the likelihood that both will go down, and hence be completely unavailable is reduced.
I was simply pointing out that the gut reaction that 2 is better than 1 doesn't always hold true. If I were them, my first priority would be to figure out why their current network failed so horribly (spanning tree apparently) and, rather than having two equally unreliable networks, create a mroe reliable network, with rendundant backups for availability. In a hospital setting, availability is paramount to other concerns, but they're going to incur more than twice the management costs by doubling the same network.
thanks for callign me out though
I live in the Boston area, and I have the perfect solution: they should hire me. I'll make sure their network never fails.
Well, maybe not. But I still need a job... =)
The probability of failure goes like this:
The probabilility of both trains failing is:
P(1st train fails) * P(2nd train fails) = 0.01
The probability of neither train failing is:
P(1st train doesn't fail) * P(2nd train...) = 0.81
The probability of exactly one train failing is:
P(1st train fails) * P(2nd train doesn't)
+ P(1st train doesnt) * P(2nd train does)
= 2 * (0.1 * 0.9) = 0.18
(notice this adds up to 1, so far)
and the probaility of at LEAST one train failiing is P(exactly one fails) + P(both fail) = 0.19
QED
Don't use Spanning Tree unless your routers still use Valve (Vacuum Tube) Technology. It's fine until it breaks, and then it can be a twat to make it settle down again. retire it .
...and he grinned, like a fox eating shit out of a wire brush.
Networks are fragile, I'm surprised there arn't more massive outages.
The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.
KISS: Keep it simple, stupid!
The article is a little light on technical details, but does anyone know what internal routing protocol they were using? We've got a network with 11 cisco routers running OSPF. The routing changes happen very often, because there's a bunch of dial-ups and a few dozen routes that come and go with short-term connections (like backups from a remote office or running a CC authorization from a remote office). Everything works perfectly if none of our three newest routers are the first powered up. Those three are running IOS 11.0. After several calls to cisco (we buy all cisco internally and for our customer ends, so we get very good support from them) over the past three years, cisco is still stumped as to what the problem could be. The lines in the config file for OSPF are only five lines long, so we (and cisco) are sure there's no problem there. The hospital's problems sounds like it's of the same sort.
Ahh yes, but what about the probability of both trains breaking down at the same time?
You are confusing a device which is twice as complex performing a given task, and two machines with the same complexity independantly performing the given task.
complexity = bad
redundancy = good.
If this hospital is like any of the medical instituions I've worked for, then it's not unreasonable to expect that the IT group has been begging for more money to upgrade the infrastructure because they knew this kind of thing could happen. This usually falls on deaf ears at the doctor and senior administration level of the hospital because they see computers and networks as "magic" and don't take any time to understand the kind of reliance that is now placed on those systems. Also, it is very common for doctors to reject any spending on IT because it will bring their 8 figure salaries down to 7 figures and that is totally unacceptable!!! The story did say they are looking at 3$million for future upgrades, but that ONLY happened after this disaster.
Believe in things of which no person has ever learned
The paperless office is still, and always will be, a myth.
"Times have not become more violent. They have just become more televised."
-Marilyn Manson
That would be redundant!
A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.
I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.
The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.
Remember, You are unique...just like everyone else.
I used to work for a systems intergrator. Just by general pratice, anything that was mission critical was on a seperate network.... if not two different networks. This is most likely a WinXP machine that somebody played with the stp/vlan settings.
Speaking of teaching hospitals... Yes, they are large..... I live just a few miles from Wake Forest/Baptis Hospital. They add, or renovate a wing a year.... There are always large crains over the building... and since I'm looking for work... I applied there... Even though they had a polethra of positions open for Network Techs, and since I'm well over qualified, and cheap... you would have thought they would have hired me... they did not... they seem to go for bottom barrel regarding techs... cheapest... most likely they think A+ is the best cert you can get.
1) introduction of routed domains to seperate groups of switches
2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.
3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.
4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.
Thank you.
The article is definitely interesting. Since hospitals affect all of us at one time or another, it's interesting to see how their networks are set up.
There was some talk here about unit-based network or basically separated parts of networks. How would I get more information about this topic? How would that work? Don't you lose some benefits from having a central network where resources can be allocated to when you need it? What does it mean to be on a unit or divisional network? As a metaphor, I imagine a building that has lock-downs at certain places. But for a network, for lock-downs to work efficiently, there has to be highly effective detection devices (kindof like fire detectors).
I can't imagine it would be as easy as turning on a switch.
I agree that this person should not have be on the production server at all but on a development server.
I also agree they should have had backups available though they did state none of the network has patient critical information. But can you imagine if your patient information had been inacessible?
Health Insurance Portability and Accountability Act.
Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.
The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.
They need a smaller test environment that ALL changes have to be checked off on before implementing. They need images of all router configs they can roll back to if necessary, and they need a diff comparison tool (mantrap or somesuch) to see what's changed between their known good configuration and what exists now.
Oh yeah, and they need a signed piece of paper with the moron's signature saying the change wouldn't impact the network. (a papertrail, as archaic as that seems.)
"Draco dormiens nunquam titillandus."
I'm the armchair kind. But I wouldn't this solution have led to TWO identical networks down? Whatever triggered the problem in network A could easily be present in netork B?
Unfortunately, downtimes are not fun in a hospital. In other places, it means that we can goof off and blame it on the IT department.
Ok, time to stop trolling...
No sig
do you think the answer to having a massive and unreliable network is to build a second identical network?"
Take the number of patients in the hospital, A, multiply by the probable rate of death should the network fail, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a redundant network, we don't build one.
do not read this line twice.
Why do you think telecom. networks cost so much.
The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.
This sounds like a case of poor network infrastructure management. That being said, you can't pin it all on IT. Organizations like this have networks that grow out of necessity, and are often nearly impossible to make large changes to.
Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.
Do not fold, spindle or mutilate.
The ramans did everything in threes for a reason. So in response to whether or not building a second identical netowrk is a good idea... I think a third should be implemented also! Expecially in a situation where lives are at stake.
I can say that hosiptal networks are a nightmare. You have dozens of departmeants, all with different combo's of hardware, software, requirements, & operating systems. Workstations for patient entry. Workstations for patient tracking. VT300's to access legacy VAX/VMS databases that NOBODY knows exactly how to port to a newer platform. Besides, convincing the powers that be that they need to spend BIG $$$$ to modernize and streamline is an endless battle. Kudos to them for having a workable system for 6 years, but they never should have abandoned the paper backup. Just my 1 cent.
Maybe Washington Mutual?
do you think the answer to having a massive and unreliable network is to build a second identical network?"
The answer is to build reliable networks in the first place. From each computer to the other there should be multiple routes. Firewalls should be kept between departments to stop NetBIOS and ICMP broadcast storms and Linux be used to replace M$ systems. All DBASE 5 apps should be replaced with mysql/ncurses equivalents on RAID 1/XFS filesystems. A central computer with daily backups be kept, with multiple power sources for each department.
Having a full-time network administrator, and shielding him from Sys admin tasks while he keeps a list of network analysers/ monitors the servers and keeps extra routers and cables, helps.
Quantity is still no alternative to quality. Install 4 networks in parallel and a DDoS attack will take it out. Else something like the slapper worm or even unplugging an important server will still break the system. Install good quality hardware and dont be understaffed in the IT sector.
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
Could a completely decentralized network, e.g. P2P, solve the problem? redundancy is built-in, so to speak.
_____________________________________
2 + 2 = 5 for very large values of 2
According to my former employers at WorldCom, yes!
Okay, maybe not the best example. But it was always fun to be on conference calls when we had to explain to the customer why their backup network had gone down at the exact same time as the primary ... assuming your idea of fun correlates with the deeper circles of Dante's hell.
"Freedom is kind of a hobby with me, and I have disposable income that I'll spend to find out how to get people more."
As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.
Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...
Swannie
:q!
Of course the answer is to build a completely seperate network if I am the one who you will pay to build it ;)
This is obvious.
In truth the network problem was not a physical one so then solution should not be a physical one.
I work at the hospital (not in IT however.) In reality, the IT department is reworking/fixing the existing infrastructure including much of the hardware, _and_ adding a new redundant network. It doesn't look like this will be a complete standalone parallel network, but more likely a limited one that serves only clinical applications.
While the data the mentioned researcher dumped into the network caused the crash - it was merely the proverbial straw. The amount of data the network shuffles around is astronomical - for example all imaging is online and images need to be passed all around (different clinicians, backups, etc). These images are huge (a CT scan for example may consist of the equivilent of 100 regular x rays), and need to be stored and transferred in a lossless format.
We setup redundant systems and an airline training centers to make sure the pilot training side wouldn't fail (apparently the cost of having pilots come back was huge - go figure...). In the end the training center was actually more redundant and reliable than the actual reservation systems at the airline and we enjoyed 98.5% uptime (save for small things like the power company killin power to the entire builiding without notifying us). To make matters more interesting this system was spread over two locations that are 600 miles apart.
Essentially we had redundant routers put in place nin each center so that if one failed the traffic would kick over to the second. In addition we had developed a small application that resided on the classroom computers that would check the application servers holding the training material. If the primary server was down it simply switched to the secondary server (there were three application servers in one center and two in another).
Furthermore, our database servers (two in one location, one in another; the primary server was located in the larger center and all machines went there first) had a product called DoubleTake installed which would cause the backup server to assume the identity of the primary server in the event of a failure. DoubleTake also allowed us to mirror image the data on our servers fo consistency in the event of a failure. This was important because if we had a WAN failure the database server in the smaller facility would activate and act as the database server for that facility (this actually happened - we had our IT work farmed out to a large support company, which I shall not name, that actually once failed to notice a T1 line had failed for OVER A MONTH!!!).
There were a few glitches, such as the need to wait until afterhours to bring back the primary server in the event of a failure (if you didn't you would be bringing up another server with a duplicate IP address due to the DoubleTake software which caused all sorts of problems so both actually had to be brought down), but for the most part it works very well.
Heck, even if all that failed we had stand alone machines that could run off of a CD. I think that may be a little difficult for a hospital to do though.
Kris
The problem with the duplicate network is that it can fall victum under the same problems the original had. Say the first network goes down because of this problem. Ok.. first you have to re-patch all the network nodes into the new network (probably not an easy task). But the new network, if designed the same way, the professor replugs into the new network and starts his number crunching again. Now the new (2nd) network is down..
Worse, with the 2nd network as a backup, they may never know what caused the problems, and therefore it wouldn't get fixed.
It's kinda like putting a "backup" engine on a plane because the fuel is dirty and kills the engine.. it will kill both engines.. cleaning the fuel is a better fix..
The answer is keep all life-critical systems on one completely seperated network. Keep all research on another completely seperate network.
If another researcher brings down the research network, that's fine. No one is going to die. But the life-critical network would be untouched, and that is the whole point of having a parallel network.
They should have done this in the first place. To not have done it was irresponsible. I would sue their asses off if I were a patient or a family member of a patient that died during those 4 days.
I used to work at a county teaching hospital. They had a really ancient, parts-a-muffin, mixed-topology network. Each department had its own separate, incompatible system. These systems were chosen by the department heads, not the IT staff. They then had to use Siemens OPENLink to tie them all together. They had downtimes all of the time. So all the staff was prepared for downtime procedures, because they had to use them once or twice a week, at least for the four months I worked there. So maybe a less reliable network is in order?
The story I heard was that they had already approved the new network and it was still a few months away from being implemented when the old chewing-gum-and-bailing-wire network prematurely fell apart.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Way I see it, there are 2 things that need to get done.
1) Policy change. Only production machines on a production network.
2) Topology change. Make it easy to get a non-production network connection so people don't violate #1
This is a moot question. Noone in the USA takes trains. Pure fiction!
-----
.....
0.1 * 0.1 = 1
-----
Someone failed math...
0.1 * 0.1 = 0.01
I must agree with all those folks complaining about the snippy zingers in /. articles recently. Please read the article *carefully* before you post a summary, and try to be more objective, "Michael".
The hospital was already in the process of overhauling their network with the help of a consultant. Now they're going to accelerate that work (doh!).
NOWHERE does it say they're going to build a "duplicate" network! They're going to add twice the amount of wire, but that's the only real detail cited, and that's hardly enough information to justify the petty jab.
Having an backup network based on the same design will not solve anything. The Arian Rocket disaster, caused by an buffer overflow, had duplicate processing and when the first one crashed the redundant processor took over and crashed too because of the buffer overflow. The solution (IMHO) would be to limit the resources a single individual/process/workstation can have so a single user cannot flood the system, causing a crash. I also believe there should be a development database/system set up that the user could test against before exposing critical production systems (Corporations have them - why can't hospitals).
...a completely digital, paperless
*echo chamber effect* Hospital of the Future *end effect*
here in Birmingham, AL (we have a big medical presence). I hope Scrushy's reading this...
Cake or Death? Cake Please!
I spent three years (1995-1998) at Perot Systems as a consultant designing and implementing hospital networks for Tenet Healthcare (2nd largest hospital chain in the US). There was at least one hospital that had the budget and the foresight to see that reliance on the network would do nothing but increase.
For that hospital, my network design was one that incorporated as much redundancy as possible at the time. For each patient care area, such as nurse's stations and ancillary areas such as radiology, cardiology, surgical theaters, etc. it was decided that each of the two network jacks would terminate in seperate closets. This meant doubling the number of closets required in order to meet distance limitations, but the hospital had already started working on allocating that space for the closets. Also for any important ancillary areas such as the lab, central supply, there also was two seperate networks. For the server farms theirselves, the Patient Care systems all had redundant connections to the primary and backup networks as well.
As each wall jack terminated into a different closet, each closet had two seperate networks as well. Each closet would house the primary network for half of the jacks served, and the backup network for the other half of the jacks served. The fiber paths from each closet took disparate paths back to seperate data center rooms, one external to the main building of the campus and one inside the main building. At the time layer 3 switches, or switch routers such as the Foundry Big Irons, or Cisco 6500s were not available. So as much as I dislike using Spanning Tree, I had used it at the time. All priorities were manually set though so there was no doubt where the root was and where it would move to in case of failure.
So, the switches terminated on another switch which was partitioned to several segments. Switch connections were made between the two data center as well. Each segment had a connection to a Cisco 7507 Fast Ethernet port local to that computer room, and another in the second computer room. Forming the core were two sets of two Cisco 7507s. In order to prevent one OSPF network from affecting the other OSPF network static routes were used (would use BGP if I had to do it over again). Outside WAN connections were terminated redundantly on the two patient care networks as well.
While the primary network in the hospital also supported the non-patient care areas (such as administration, the backup network was only for the patient care areas. That was just to prevent the type of thing that happened in the article, where something non-patient care related ends up taking everything down.
Reverting to backup paper systems would be nearly impossible once the "tube" systems were sealed up. Much like the movie Brazil, hospitals used to have pneumatic tubes running all over the place, especially between the lab and the nurse stations. Running samples and results back and forth would definately introduce a LOT of delay for a doctor trying to make a life and death decision.
I am sure that I would I design things different these days (for one, Layer 3 would go all the way to every single edge switch and collapse on a fast switch router) but I think the design probably held together well. I should check back in someday and see how long and well it lasted, if they did replace it.
Jay
There really isn't enough information in the story to "ass-ume" any intelligent discussion. The answer to a fail-over network isn't really in the building of a second identical infrastructure, but really in a redundant design that lends itself to automatic failover. HSRP, STP, redundant core and distribution layers are all excellent tools to perform this type of redundancy, but if not set up properly, or managed properly are no good at all. If not constantly monitored for performance and faults, a network is only as good as the hardware itself - Budget/Finance and Administrators often bypass the ongoing expense of maintaining a network infrastructure once it's built (monitoring software/personnel). It will be interesting to see if Cisco issues a case study on the problems that caused this failure.
I guess that it was you who failed maths.
1% = 0.01
when the time spent debugging the problem surpasses the time you would spend just doing it over. The hard part is determining when to give up on fixing it and moving forward with a new plan. Would any of you like to trace a network fault on some of the "Most dangerous server rooms in the world" (see The Register)?
And 0.01 = 1%.
I can offer my services to strightnen out thier problem for $165.00/hr.
Their Netowrk admins have no clue! First of all...
Why the hell did they design their network around
spanning tree? Poor design leads to failures like this.
Shut Spanning Tree Off. Lay out a plan, use routers
where needed (yes ethernet routers too). Or VLAN
if you must.
Segment the network. Man these people have no clue.
This is what happens when you have poorly laid out network.
This is of course redundant, but your webserver having 99.999% uptime is GREAT. A hospital having 99.999% uptime is a disaster. The ONLY way to responsibly manage a network like this is to build a redundant system. Fix what's broken of course, but have the backup. You do your best to make sure your company's database works all the time, but you still make back-ups, dont you?
People who think they know everything really piss off those of us that actually do.
Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.
Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.
But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.
I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
I don't know much about this stuff, merely having the Cisco Certified Network and Design Professional certifications, and not yet having the CCIE, but here goes.
... messy. If they've really got a situation where they've got one big a$$ subnet with 1024 or 2048 IP addresses in it they're pretty much going to have to *build a parallel network* with proper L3 equipment and an IP address allocation plan, then go floor by floor and convert users to that scheme.
In the bad old days before layer three switches became inexpensive networks were either routed or bridged. Spanning tree is a tool used on redundant layer two networks to detect and eliminate loops.
If these guys were a hardcore Cisco shop and they used Cisco's Inter Switch Link (ISL) VLAN technology it is possible they might have a very complex topology with multiple spanning tree roots. That can't be done with the IEEE 802.1Q VLANs more commonly used today, but this sort of thing was deployed for campus redundancy in the mid 90s.
The right solution in a case like this is
It sounds crazy, but I've been responsible for a campus with 800 MAC addresses in the core switch's CAM table and it is the easiest, safest route to take.
I am very easy to get along with, but I don't have time to waste being nice to people who are being stupid. -Theo
Gee, you must be at CWRU. As an alum, no other school's network could have been as poorly designed (then).
Hi guys,
:)
I work as a system integrator for one the biggest manufacturers of telecom equipment, and I don't understand how something that crucial could not have a redundant network. WHenever we sell something to any telco, the first question they ask is always "What's the backup plan if this or this or this fails?" and the second question is always ("What is the backup to the backup?".
I can't beleive that a network built for something as important as a hospital could not have a redundant network, or at least redudant nodes (switches, routers,..). If that's the case, then the guy who designed this network should be shot
K
If only the hospital had been using computers with Linux (+1 for insightful) and used an open-source model (+1 for interesting) then it would've been more stable. Plus the open-source community (+1 for underrated) would have it back up and running minutes afterwards if it did crash.
The only thing Microsoft I want to see in that hospital is Bill Gates with the AIDS (+1 for funny).
The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.
HA! They'll both fail when they meet in the middle!!!
MUAHAHAHAHAHA!
No sig
No.
You can only multiply them together like you have done if the two variables are independent.
Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.
Seeing as these paper forms hadn't been used for 6 years, I'd have to assume that the network was very reliable. Problems do occur from time to time, but it doesn't mean that the whole thing should be replaced. Just fix the issue and move on.
Dude, you so don't know what you're talking about; Cisco is the #1 supplier of layer-3 switching gear in the world:
/ in dex.html
http://www.cisco.com/en/US/products/hw/switches
Nor is it true that 'Cisco equipment runs a new instance of spanning tree each time a new VLAN is created'. You have to know what you're doing, of course, but it's very easy to create a very large layer-2 spanning-tree domain with a good-sized ST diameter. With good network design principles (read more on http://www.cisco.com, attend their Networkers sessions) and an understanding of how the equipment works, this sort of problem should never occur.
The Globe was indeed short on technical details. What puzzles me is that they say the network was down for four days.
NOT a rhetorical question:
Why didn't they power-cycle the whole complex? Maybe even literally? Presumably a hospital should be able to handle a short interruption in AC power... and presumably the network equipment wouldn't preserve the "I'm-broken-state" in nonvolatile memory. Why wouldn't a scheduled power outage for 10 minutes at 2 a.m. in the morning have been less disruptive than the network being down for four days?
Less drastically, couldn't they have called every operator and system administrator in and said "Synchronize your watches... at 2. a.m. power off every piece of computer gear within a hundred feet of your chair off, then at 2:10 a.m. power them on again?"
"How to Do Nothing," kids activities, back in print!
Spanning Tree is not the answer. Especially when you
have multiple platforms speaking multiple protocols.
You need to segment the network, each department
should be on it's own segment, it's own network address. As for other protocols... yes they
should be segmented as well. Each department should have it's own ethernet router. In a hospital.. preferably a fiber router.
Shut Spanning tree off! Damn!
Ok.. I'd go with a higher end central router.
A Cisco 7000 series if you are going fiber.
A 3600 series if you are just running 100 BaseT cat 5. And here's where my expertise comes in.
Program the routers properly. Do not use any autodiscovering protocols. That goes for all your
protocols! And if they have Novell... don't SAP
every minute, SAP maybe every 10 minutes or so.
Static routes should be used for IP, don't use RIP. And a poorly managed network can
come crashing down if Spanning Tree is used.
IN college I called this failure "Packet Avalanche". I bet if I put my Linux based laptop
on the network and analyzed the traffic there would be collisions up the wazoo.
...the TV show you intend to watch is there. It may begin a few seconds late, on purpose or as a result of some discrepancy, but the TV show you want to watch is there.
For the past few years, networks on the national and local levels have all been switching over to server-based content play-out. TV from Computers! How Exciting! How Wonderful! How... frickin' scary, for those whose jobs it has been to ensure that Buffy plays down at 8, and not 8:02, or 8:15, or - Powers-That-Be Forbid! - Wednesday morning.
Professional TV Master Control operations traditionally operate (often contractually) to "five 9's" of reliability, 24x7, assessed monthly. Full Stop, Period, End-of-Story. TV Master Control geeks, their supervisors, and the maintenance engineers who support them have ever been a priesthood apart when it comes to worship at the Uptime Altar.
So what has their industry done, to ensure that all this "new wave" server and automation technology provides them with the same reliability as manual control and tape-based playback? Why, buy two of everything, of course! EV-ER-Y THING!
The server industry is only getting around to understanding that now, and is beginning to price their wares accordingly. I've attended dozens of vendor meetings over the past ten years where the salesguys, who six months earlier were selling mailservers to sysAdmins, are now selling their new video servers to Master Control guys. (Chum dished into a shark tank is the only comparable visual I can come up with.) What makes the sale is never the reliability of server over tape or (especially) the quality of server over tape, but desire of management to run more channels with fewer bodies. In the past this has led to management re-assessment of just how "inexpensive" server-based playout technology was and, in many cases I have seen, an increase in the number of channels created or planned as a means to justify the hardware costs.
The only debate point in most TV Master Controls comes down to what components are in-chassis redundant, which are external-chassis "hot" spares, and which are shelf spares.
My point (and I do have one...) is how it is unconscionable that a hospital where lives are at stake, lacks the war-room mentality that an entertainment operation has. It's real simple at the end of the day to assess which components in a network --info or video or both - chain are critical, and buy two of them and keep it all lit and tested. Lives are at stake, and your signature is on the shift report? You rent a tertiary back-up system to bring online while you do your regular and frequent preventive maintenance on your primary and secondary.
The guys who take care of Buffy do it. I would have thought that the guys who take care of sick babies and grandmothers would be playing in the same league.
Dag-blamed technology always be messin things up!
J Moll - PC Load Letter - I know what it means!-
I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!
AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.
I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.
The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!
The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.
I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.
I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?
The more things change....
Those confused or interested in a good grounding should be reminded of Radia Perlman and her wonderful seminal book "Interconnections" subtitled something like "The theory of bridges and routers". As the inventor of the spanning tree algorithm and currently a Sun employed networking guru in the Boston area, perhaps a savy CCIE would have consulted her on this and thus shortened the MTTR. Those reading only "quick start" guides to certification, rather than broader texts get what the deserve.
PS: each chapter starts with a humorous quote to enliven serious topics
If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.
When Cisco was called on for help, they didn't redirect their customer to a 900 number, they didn't shuffle them off to a service contract salesperson. They just rolled up their sleaves and solved the problem. It may have been Boston area Cisco engineers in the trenches but there were Cisco engineers in San Jose, RTP and probably elsewhere involved in this.
One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.
We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.
Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.
These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.
FUNNY NETWORK TRICKS
I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.
-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.
-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.
-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.
And the list of stories goes on. You get the point.
Interesting how even an army of Cisco engineers couldn't fix the problem. Perhaps a testament to how overly(and needlessly) complex cisco's equipment is...and/or, how bad their certification/training is.
As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.
Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?
One would crash, then the second would crash immediately.
As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.
I'd also guess improper change control procedures were involved here as well.
Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.
"Do you think the answer to having a massive and unreliable network is to build a second identical network?"
The answer to having a massive and unreliable operating system was to build a second, more reliable operating system named Linux. If we can do it with an OS, why not do the same with a network?
As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.
Here are some examples of the ways in which failures can occur that have implied linkages:
(1) Both trains are damaged by an earthquake.
(2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).
(3) The firm has cut the maintenance budget and is neglecting routine maintenance.
(4) The train is sabotaged by disgruntled employees or terrorists.
(5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.
(6) Design flaw means trains do not meet expected performance specifications.
In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Slashdot is all about personal agendas.
I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.
I don't know how that hospital was able to pass inspections.
Simple solution: get another network to back up the production one...it does not have to be as fast since this is going to be used temporarily until the full production network is up again
Extra solution: make at least a couple of environments for work...development and production
Extra extra solution: back up production with a fail over environment (ie. when production falls this fail over will pick up where production left off)
Ack...just my 2 cents...
Etherhose (10b5 thick coax) is a useable networking technology. It has very good resistance to RFI/EMF. Lots of hospitals still run it, on links where 10 Mb/sec is sufficient.
Etherhose is no longer a good investment because it is labor-intensive to work with (vampire taps, and thick, heavy cabling) and because nobody is developing the technology any more.
Today, fiber optics might seem a better choice for noise isolation, since the cost has come down to a reasonable level.
However, glass has the same potential for future obsolescence as etherhose - I have a half-dozen mutually incompatible fiber links here. And termination, splicing, and interconnection of fiber is at least as difficult as working with etherhose... having done both, I'd say drilling for a vampire tap is easier.
In short, don't replace a working piece of infrastructure needlessly (wait until you project a need for additional bandwidth) and for noise isolation cat 5e in a grounded metal conduit is probably your best bet. Large diameter, professional quality conduit runs through electrically noisy areas are costly but also a very safe investment.
I wouldn't knock that old etherhose - it does its job quite well, far better than the 10b2 thin coax that replaced it ever did. And it's far more physically sturdy than anything else outside of conduit.
From the Globe articale:
"It was Dr. John Halamka, the former emergency-room physician who runs Beth Israel Deaconess Medical Center's gigantic computer network."
So the network admin is a former ER doctor? Since when did:
1. Network admins make more that ER doctors?
2. Med schools teach Cisco?
Sounds like a case of acute administratia to me.
``do you think the answer to having a massive and unreliable network is to build a second identical network?''
Obviously having two, or any number, of unreliable networks doesn't build one reliable one. If some user can take down the first net, he can also take down the second or nth net. If this user has bad intentions, he likely will. If the taking down was due to a program running wild (apparently the case here), it might happen again. More backup does increase reliability here, but never makes it really reliable. What does make a network reliable? Nothing does. They can always be trashed. I think they know that in Twente. At best, a network can be reasonably reliable, and what makes a network reasonably reliable depends on what is reasonable.
Please correct me if I got my facts wrong.
I was hoping for at least a funny. :)
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
oh, forget it.
Time is an illusion, lunchtime doubly so. --Ford Prefect
all sensationalization of every multi-site net admin's worst nightmares aside:
fundamental problems exist with the picture this article painted.
1) a researcher's "data" brought down the network ?
first off, critical hospital functions should be separate. Their own VLAN at a minimum. This is stability we're talking about here.
Second, when told that his data crunching was hurting performance, he should have done what he could to stop the application gracefully, not just "pull the plug"
2) network design.
This is the VLAN issue. a properly designed multi-campus network has separate networks for separate functions. if they were one big flat network, then yea, him pulling the plug would cause all sorts of hell...as each and every switch flooded layer 2 broadcast frames out every port trying to find his station. layer 2 broadcasts (broadcasts in general) are _BAD_.
3) ER physician turned network addict.
I'm not going to bag on anyone, really. However, the article fails to mention his network administration qualifications. How many years experience does he have configuring network gear? did he do this in his spare time? Seriously, seems to me that they need a bona fide Network Engineer or two.
To answer the question: backup network?
many NOCs have redundant networks. Some companies do the same for mission critical network gear. VLANs should be sufficient if admin'd correctly.
As far as Spanning Tree Protocol 'failing'. I've not heard of such, please point me to concrete examples! As far as Cisco's implementation being 'boogered,' I don't think so. It works the way it is supposed to, yes their switches offer the option of turning off part of spanning-tree for end nodes (spanning-tree port-fast), but properly used this doesn't present a problem (see above, design/qualifications).
A previous poster noted that stability should be of paramount importance in a hospital. They are absolutely correct. However, stability does come at a price, with HIPAA looming over all things EMR (Electronic Medical Records) you have to keep on your toes. Stable may not mean secure, and since that is one of HIPAA's stipulations, you have to go with secure (or relatively so).
That being said: a Layer-3 Switched network should be more than adequate for a multi-campus network with segregated NOC. a fully redundant near-line or off-line network may be overkill, but not all together unnecessary. With a heavy-iron Cisco Catalyst driving the network at the core and Catalyst 3500 series switches at IDFs this should prove to be a very managable and strong network.
This outage was caused by a researcher's data creating a storm of data which outpaced the network's ability to cope. The problem was allowing the research data to flow unimpeded across vital systems. The solution is to implement methods of controlling bandwidth, not just routing.
In order to prevent this from happening again, engineers should analyze the system to determine where to put data storage. In this case, almost certainly (although the article is unclear) data was stored in a central location but spanned across several servers and then backed up in another location. One part of the solution is to have distributed data storage spread across the institution and then that data backed up (across a separate network) to a central location.
The data storm itself could be prevented by using QoS bandwidth management. Of course, every network user believes that he/she should have unfettered access to all the bandwidth available, but quietly implementing some well-known techniques for limiting bandwidth usage would have at least mitigated the damage.
Finally, routing protocols other than spanning-tree or OSPF should be used. Creative implementation of internal addressing schemes (10.0.0.0 IP addresses) and a combination of BGP and last-resort static routes would certainly help to avoid these sorts of problems. I'm also wondering whether a *nix box running Zebra in critical locations might not reduce the problems. Certainly Zebra can remove the routing load from the Ciscos and, with plenty of RAM and processing speed available on PCs nowadays, could probably improve routing efficiency when a circuit goes down.
But the key to this problem is bandwidth management not routing management. Of course, the next problem could be routing. One seldom has the budget to solve everything.
No one ever had to evacuate a city because the solar panels broke!
We have had similar problems with networks 'going down'. We have many vlans, so just one vlan went down, but the it seemed to be a problem with how Cisco does STP for vlans on their newer equipment. Each vlan gets its own spanning tree, but the root identifiers are all the same, and the ethernet addresses for the vlans on our central switch are all the same. Older Cisco equipment had a different MAC address for each vlan. Thus, the root bridge identifiers were all unique, and when two vlans got bridged, loops didn't happen. Now, however, if two vlans get bridged (a computer with a wire in one vlan, and a wireless card in another vlan), the forwarding tables on the switches get confused because there are multiple paths to the same stp root.
This is really confusing to work through, but it really does look like cisco isn't implementing vlans the right way. We can't turn off stp on our whole network, so we turned on bpduguard on as mant switch ports as possible. That way, if someone starts bridging, the port gets shut off as soon as a switch sees a bpdu packet. The down side is that nobody can plug in a hub or switch to our network.
Its worth noting that our problem arose when we installed a new central switch, and ran it redundantly. The new switch confused stp root identifiers wherever a bridge occured.
We have many wireless laptops on our campus, and someone plugged a wireless laptop into a wired connection, which had a differen vlan, and turned on windows network sharing, which started bridging the to interfaces.
"We are all geniuses when we dream"
- E.M. Cioran
What do you mean 'reduced to'? What else are they good for?
Lumberg: So you fired him?
The other "Bob": No, we "Fixed the Glitch"
prisoner# msce18xxxxx. Currently planning my escape.
There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.
The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!
I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".
When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.
You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...
Their network staff should be looking at all solutions. They know better than we do what their bandwidth and connectivity problems are. I only hope they don't make the same mistakes on both networks.
Pleeeease.
If my network fails, I can always rely on my vast web of fishing string with paper cups attached to the ends for reliable, secure data transmissions to any hospital room via voice communication. Actually I have found the hospital urine cups to transmit at 100mbps rather than the 10mbps the paper cups get.
What this means is that if say some error or catastrophy strikes and destroys the data access (the actual data, the database the interface or logic, etc) then you can have a backup copy that is read only and statically "linked" for lack of a better term. A standalone printer can then start chugging away the pertinent records (in the event of a total network takedown) or even better the various departments and nursing stations (for this instance) could be uploaded with the pertinent areas from the snapshot, assuming even the electronic devices are working. If they are not, then you better walk across the street and start printing up those records paying attention to priority based on condition (patient), scheduled events (operations, lab work due, etc), and most importantly the "flash override" of requested records. Since not every detail of the record is usually needed, merely the more relevant areas pertaining to the matter at hand (e.g. history of lung scans for the oncologist) then only those can be printed out and delivered.
Hand held devices utilizing a backup wireless network could be a method of shooting this data to various departments. If however there is a total disaster, like an EMP attack or more physical attack disabling the entire system, then really you are more screwed by that than anything else so backup is a luxury at that time. In that case you better go by the existing printed reports (you did make hard copies of current procedures and the latest status, didn't you?) and then interview the patient and family... thats what EMS has done, so you can too.
Fix the PROBLEM, not just one particular manifestation or symptom from it.
On a related but different note, ever wonder why the suited monkeys at many large government agencies and companies fail to understand the true meaning of redundant processes, data and systems? Thats their job to think about things like that yet they are too busy being used car salesmen and patting their golfing buddies on the back. In war we call them Frag-bait... DO YOUR JOB!
Sure, install a second network... but what about power faliures? Or if both networks go down?
The company I work for is too small for redundant networks & servers, but I make sure that the there is a manual fall over: fax server doesn't pick up? Fax machines will. db server down? All the forms you need to do your job are available near your desk, and there are tons of extra in on-site storage.
I'm *very* surprised patient data was allowed ride on a network that had multiple single points of failure.
Hopefully the network engineers no longer work there, or are being properly trained on how to do their job. What would be scary is if they were properly trained already and didn't have the funding to do the proper maintance.
There should be an appropriate amount fear for just this type of failure that enough redundant infrastructure is available for critical data to ride on.
Three different versions of Cisco IOS in three different locations. All talking (trying to, anyway) to a Linux box running FreeS/WAN. In order from OLDEST to NEWEST IOS release: One VPN works fine. One works if you keep something pinging across it. The third doesn't work at all.
Someone suggests using a Cisco as the "hub", instead of the Linux box. Now NONE of it works. Fancy that...
It's called a dead man's switch. Just have a simple ping going out every couple of seconds across the network from vital nodes, and if ten of them fail in a row, or a hundred, or whatever, then you know someone needs to take a look at it.
Hell, have it go out every couple of microseconds. That's nothing compared to the volume of traffic a network of this size must be expected to handle.
The only surefire protection against Microsoft infections is abstinence. - The Onion
I have personally seen applications that used so much bandwidth accessing data accross the network that a completely stable network was reduced to a non-functioning state. Case in point is one of my customers (I work for SBC Datacomm) who built the WAN/LAN they are now running on approx 3 years ago, with the knowledge that an outside software company would write a web-based application to run accross it. The app uses DB2 data from the mainframe and then uses XML to present the data and edit it by the end user. Long story short the application was WAY bloated and was found (by me when the network was reduced to jelly) to pull about 4 megabits per second when it was processing data....with something like 2000 users all accessing it at the same time you can do the math and see that is a recipe for disaster. Tim Grossner Field Engineer, SBC Datacomm
I don't buy this BS about scrambling for paper forms not used in 6 years. Having worked for a world reknown hospital ( name with held ) I know that for a fact that each hospital must have in place a manual paper system in the event of a computer failure. These processes are most dreaded as they result in errors and a create deal of lost revenue, but are required for certification.
Of course, if the hospital isn't certified in a state or requires only the 'B.J. Clinton' certification of finger pointing...
I read in a book about the number zero that I mentioned here before that the real cause was someone accidentally left a zero in a line of code, rather than a person pressing zero and crashing the entire network. Perhaps someone tried to execute a command that led to this faulty code being used by the ship's computers?
Maybe this was proven to be false later, I dunno.
Kind of funny though...
Yes, there is always the possibility you might be born blind, but most people don't have that genetig defect. They have two eyes which work very well, even if one of them happens to be broken by a random toothpick accident.
Redundancy is always good in a system where uptime is king. That is why so much of nature has organisms based around semi-redundant designs.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
No. His notation is confusing, but is math is correct.
I got the impression that the secondary network would be inactive, unless the primary failed. Therefore an event that brought the first down, would not affect the second.
Unless of course, whatever broke the first, took the second down when it came online...
On a similar note, who wants to bet they'd put both networks on the same power source?
Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.
Seastead this.
First, I don't have all the details of what happened, nor do I have any idea of what the network looked like prior to the outage. However, I have a general design philosphy based on my experience with teaching hospitals and telco networks.
The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.
All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.
Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.
Sig??? I don't need no stinkin Sig!
If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.
Identical networks may or may not offer any backup depending on how they're managed. If there is a strict policy regarding how each network is tested and debugged before changes are implemented on the other, then it might help. Otherwise, in a large network such as this, you'll perpetually be scratching your head as to why the two are different.
Redundancy merely implies redundant functionality, a backup link. This helps only if the backup links use different infrastructure to get from one place to another. But, again, if the two networks are bridging traffic with spanning trees then I still don't see how this helps the situation much.
Diversity is the solution. Use seperately powered routers, different links, even seperate wiring closets. It's not cheap. It's not easy to manage. But it will provide a connection with far more reliability than the others.
This Hospital network seems to me to be something that just plain grew without much planning. Somehow, it became the greate big switched network of everything. This works until someone makes a short circuit link from one node to another and then the spanning tree falls in to its belly-button.
I've seen the people with all the right certifications dive right in to that recursive problem and run themselves in to testing circles. The problem is that we don't teach diagnostic thinking in schools or in training classes. I'm not even sure that we can. Problems like this demand a scientific method approach (as outlined so nicely in Robert Pirsig's book "Zen and the Art of Motorcycle Maintenance"). It's slow. It's tedious. And in really tough problems such as this, it's the only method left that will repair the situation. I know of very few people who know how to do diagnostic thinking this way.
It's sort of like the difference between hacking a bunch of code together, doing limited testing and then saying "It Works" --or thinking of a concept, carefully planning the code around it, planning all the testing of each segment of the code, demonstrating that the final assembly of the software works, and then tentatively calling the product "Functional."
I feel sorry for the hospital staff who had to endure this. I hope their misfortune serves as an object lesson to pointy haired bosses about giant switched networks where everybody can see everything. But somehow, I'm almost certain the object lesson from this will be lost on them as they blame a black box rather than the people maintaining it.
Nearly fifty percent of all graduates come from the bottom half of the class!
Remember the phone network outage from maybe 10years ago.
A fault in the initial system caused the network to go down, and the backup was switched on.
Unfortunatly the backup had exactly the same fault, the software had to be corrected before the network could be brought back online.
thank God the internet isn't a human right.
You need a RAID controller which can handle slightly different drives, and have at least one different drive in each row. Even better if you're using a configuration where two drives by different manufacturers have whole copies of the data, so failure of two drives is not fatal.
Someone failed their vision test...
See that percent sign? The little "%" thingy?
Go Wireless, Use copper for Backup
I'm not talking 802.11, but miltary grade Spread Spectrum. It would cost a lot less then laying new copper. And if some a$@hole inadvertantly starts a DOS attack you could just flip off the main antena array at your NOC for 10 minutes and let the network reset itself. Also throttle your nodes to say 10 mbit. That way one node can't take down your entire network.
If a storm or other activity takes out the antena array you still have the old copper. Keep a switch(physical switch, not hub like switch) so that you could walk over to a pannel a switch your node over to copper in a jiff. If they both fail then go carrier pigeon, CB's, or cellphones. Nothing like a good old analog message in a pinch.
You say things that offend me and I can deal with it. Can you?
Yes... if you need as close to 100% redundancy as possible, the only answer is complete physical resiliency.
Start thinking about the OSI model and it's relevance to this. Yes, you were taught about it for a reason. You can create resiliency at a higher level (e.g. IP) but if you're relying on a single physical or datalink structure the network will always be prone to failure from physical or datalink issues.
I work for an ISP. We would never give an SLA over 99% (e.g. 99.999) unless physical redundancy was included.
The only problem is, it's so expensive it's hard to convince anyone the extra 0.999% is worth it... until they experience what happens when you ignore it.
First of all, this was apparently a flat layer-2 network. From the information I have seen, it was a very large network. Spanning tree is a wonderful protocol and layer-2 networks are not bad things, BUT spanning tree is very complex in a large network, and latency is going to be an issue if there are no routed boundaries to control traffic. I have experience in designing networks for hospitals (and financial institutions and universities and gov't institutions), so I am aware that implementing layer-3 to the edge is not necessarily feasible for many reasons - financial, legacy setups, etc. That being siad, however, there should be some layer-3 at some point to segregate traffic and protect the critical pieces of the network. Identify the critical points of the networks and put redundancy there - i.e. the server farm, critical care monitoring systems, WAN connection. All network equipment vendors have some type of redundancy feature that would take care of automatic failover for these devices.
Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.
The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.
Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.
If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.
The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.
Just my $0.02, and I'm just that blond chick, so what do I know anyway...
So what are you going to do? Bleed on me?
0 train fails = 0.9 * 0.9 = 0.81
... 98%, much better than the previous 90%.
1 train fails = 2 * 0.1 * 0.9 = 0.18
2 train fails = 0.1 * 0.1 = 0.01
which means that the probability of having at least one train going from NY -> LA is
Erm... to quote you, "I think you made some mistakes."
100% - 1% = 99%.
81% + 18% = 99%.
How'd you get 98% out of those numbers?
...upgrade to gigabit.
I suggest Foundry equipment.
Really... I mean who need propritary layer 2 and 3 spanning tree/routing protocols? Anyone caught out using them deserves the pain they suffer.
Don't forget that in the real world some train failures cause derailments which destroy the track which goes in the other direction.
Defensive design requires considering both probabilities and physical reality. Lightning is less likely to damage fiber than copper, but copper might be better in a very hot environment (not that I'd like to run the network of a steel mill). The chance of two identical Cisco networks failing is small, unless the failure involves behavior of Cisco equipment which even Cisco engineers can't change.
That this happened in a teaching hospital, rather than a large corporation, makes their response much different.
They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.
I'd put the focus into refining the paper system.
It's the simplest form of communication, and
The most flexible when responding to a crisis.
If al-q, for instance, takes out your power source,
your 3 meg parallel system is pointless... just how
long are you going to run on battery backup?
I wonder who developed their systems. Can we get them to work on palladium?
;)
I suppose that if the problem was in a microsoft application, they already are
Contact Me (got tired of viruses emailing me).
I don't really understand all of the comments saying a redundant network infrastructure is bad/stupid/etc.
If your network is critical to your business, you should absolutely consider backing up every bit of that network with one (or more?) redundant components. This means every router should have a redundant pair, every physical network link should be redundant (including how it's routed through the building), every firewall, switch, etc. If you have mission-critical servers, they should have two NIC cards. Upgrades should never occur on both "sides" of the infrastructure at the same time, and both sides should be capable of running alone.
Not only does this type of configuration resist failures, but upgrades or configuration changes to the A or B side should never impact the other side, and if it does, you should be able to shut down the offending sections without impacting availability.
If your network staff doesn't understand these concepts, you desperately need to train them better. If the expense cannot be justified by management, then that's a business decision and when failures like this occur, they should not be surprised.
Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
Since Michael asked it like that I will leave behind my network engineer role (professional) and pick up my role as armchair mathmatician.
The item too be doubled is a network. Unreliability and massiveness are qualities of that network. So, using the distributive property of multiplication this would give us the equivalence of one network that is twice as large and twice as unreliable as the original.
This recovery plan should include all the P's - People, Process, Place. If the plan doesnt account for all three it won't work. Why build a plan to keep the network up if the building no longer exists or the workstations dont have power to operate? Why build a redundant process if a flu bug can take out the people? Why have a plan in place if no one is trained to implement it? Why why why?
Tech's fall into the trap of looking at those bits and bytes, while failing to take into account the entire BUSINESS process.
But only if you make sure that both networks are connected as mirrors to each other with a single non-redundant router so one network can bring the other one down.
date; gunzip; strip -v; touch -c; finger; mount -s; fsck -V; more; yes
"Waitress I need two more boat-drinks..."
I lived in Boston until 1999 and had my (ruptured) appendix removed at that hospital. That place is absolutely HUGE, many city blocks in size. It's network must be huge too and that's the problem. A LAN that size HAS to be sub-netted into smaller segments! Now, I'm not a whiz bang Network engineer, but I do know when something's done WRONG, and it sure seems like this is the case here. Building a parallel WRONG network won't solve the problem, it'll DOUBLE the problem! There are many gifted people here....why not come up with a solution for them here? Consider it a public service to a very public oriented hospital.
Healthcare networks (at least the ones I've built) require extreme amounts of failover and a high tolerance for error. As an example a 'small' hospital radiology department ( 200 studies/hour) has recently gone all digital removing all save one film processing unit, if proper controls are not in existance and a single point of failure exists in the department then an entire hospital could be without diagnostic imaging. Hence it is essential to develop not one failsafe but three (four counting a reversion to manual procedures w/ triage for critical situations). From telemetry being broadcast from a patients room to a central nursing station to LIS (laboratory information systems) moving data to HIS (hospital information systems) failover and failure planning is key. Build a parallel network build five it matters not if you have not done critical assessments and failure planning.
I have been through this dance before (I design large mission-critical computer systems for a living). The words "spanning tree" caught my immediate attention, since I have faced similar issues while trying to build Ethernet networks into an approximation of a mesh topology. It can be done, but it tends to be fragile, and it is REALLY easy to introduce loops if you are not careful. The solution: ATM. (I know - insert derisive laughter here) ATM was designed for mesh topologies, and incorporates a least-cost-routing algorithm to help traffic negotiate the multiple paths between network nodes efficiently. It is a great solution to form the core section of a campus backbone, with edge devices to translate between Ethernet and ATM for traffic to and from the network clients. It will never happen though. ATM is not even on people's radar screens, much less actively considered for deployment. I have had no luck suggesting it as a solution in my network designs either. *SIGH*
Well, mostly transparent to end stations.
Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.
Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.
Apple's DHCP implementation was bitten by this on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.
I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.
- Peter
INsigNIFICANT
Reading this thread perfectly illustrates the largest hurdle to clear when troubleshooting any major network issue. I have no way of knowing how many people were engaged in the resolution of this issue but in my past experience with similar situations things like this there are always way more hands reaching for the cookie jar than the jar can handle. Imagine trying to get everyone that's posted here to agree on a singular next step. Difficult at best and we haven't even talked to management yet!
It seems everybody has fallen under the spell of Ethernet. There ARE other networking technologies out there which have not been "patched" over the years to make viable today. Ethernet was never designed to be redundant, spanning tree is merely a band-aid. As is almost every technology available for Ethernet. Traffic management could have saved this network, Cisco's attempt at Quality of Service, really Class of service, may have made a difference. To build two redundant Ethernet networks is ridiculous. If you are going to spend the money, do it right use a technology which was designed for the very, very large networks. Build a carrier class network. Use a technology like ATM build a redundant mesh. ATM was designed from the ground up to allow for redundancy and Quality of Service, true Quality of Service. Redundant links will NOT be disabled, they will be used in a load sharing manner increasing backbone availability and capacity. The problems are inherent with Ethernet. A enterprise network of this scale should not be built with a cookie cutter. Ethernet is great for a home network and small enterprise. But very large networks should look for alternative technologies.
while i agree that the root of the problem should be fixed... lesson number one in netowkr management is BACKUP EVERYTHING WITH A DUPLICATE..
/routers should be in at least 2's.
While this may be a patch over problems way of handling things it handles one VERY important aspect of doing buisness... FAST EMERGENCY RECOVERY...
truth is if one protocol didnt cause the disaster then maybe a central server would have gone down in a few months causing another like disaster... or maybe a top level switch begins to malfunction causing trickle down netowrk problems, or maybe two hard drives in a RAID unit fail simultaneously... all of these are pretty bad scenarios....
Solution? double them all up within reason.... and then back them all up... mirror your raid's... and have several backup servers... have a secondary bank of switches to swap in an emergency so you can fix the first bank...
i wouldnt go so far as to back up the client equipment, but realisticly, if possible, everything in the server room down to your T1-T3 connections
how many serious companies do you know that operate with only 1 T1 in house?
--Enter the sig--
--Idiots, Every single one of YOU, A flaming mass of conglomerated morons, hey wait a second, isnt that how RAID works?
You're awful smug. Most small business, high schools, etc.. are not going to invest a lot of money in the network no matter what. They look at it as another sales gimmick and come-on. As far as badly designed networks: Sometimes you just make the best of what you are given and if it turns out badly, well you do the best you can..If the business says I have $1200 to spend on the two idf's serving 400 users my choices of switches and gear is pretty much decided for me, and it will suck, and you will come in and complain about poor planning and badly designed networks. Doesn't mean you have a duck's fart of an idea what went on, but you get to whine which is fun for you I guess.
Oh, you think I'm joking? There are copiers which are basically a scanner -- and they can make large numbers of copies very quickly by using several printers simultaneously.
(I don't know if that was actually a problem in this situation)
Let your imaginations wander, and ponder a point in the future when all of our health care facilities will be run on Microsoft... .
Read the EFF's Fair Use FAQ
You do actually have the VA system - VistaA, which is free software and source under the US FOIA (I'd like one of those here). I was in LA earlier this month, at the OSHCA meeting Open SOurce Healthcare Alliance, which has been working for three years on this and similar ideas and practice. Help is gratefully received... http://www.oshca.org/ VistA is maintained by the Hardhats, http://www.hardhats.org/ and has recently been ported to run on Sanchez' Open SOurce (GPL) reimplementation of MUMPS, or M as it is now called. So it is possible to have, and in fact I have on my laptop here, an Open Source hospital information system including the physician order entry system, running on a GPL'd database management system of long pedigree and industrial stability, on top of a GPL'd Operating System. You can get GT.M from Sanchez or SourceForge http://sourceforge.net/projects/sanchez-gtm , VIstA from the VA or WorldVista, and then merely face a cliff-like learning curve for the domain knowledge, the programming language M, and the huge and complex system itself. But that is just work, no philosophical problems at all. The problems I am looking at are less concerned with the actual technical programming or the development of Knowledge Service and decision support components although the latter is a wicked problem and the former non-trivial and not finished yet, but on the socio-political side. Realistically we need healthy companies to make a living by aggregating, installing, supporting, developing, and generally looking after our systems. What we need to get rid of is the lock-in to a vendor whose expected lifespan is an order of magnitude shorter than the lifespan of the data, the organisation that depends on it, and the patient - I bang on about "Ars Longa, Vita Brevis" on that.
There's always "The Formula" (a la Fight Club) to consider. Cost of a.) installing/maintaining said redundancy vs. b.) losses/liabilities incurred by primary system failure without redundancy. Work in the likelihood of failure and the value of Public Relations as a factor. If A > B, you don't make the redundant system. You simply accept the losses or downtime.
In this instance, the hospital needs to thoroughly investigate how the downtime impacted patient care. If the access to records proved to be just an inconvenience, well... who cares. Paper systems might be slow, but they worked for centuries before computers came along.
But if there were serious lapses caused by the outage, they need (at minumum) an isolated workstation that can access and print those records for distribution by hand. Parallel systems alone cannot guarantee 100% up-time. They'll apply the formula based on their own risks and loss control policies and make the decision.
Mail any lucrative^h^h^h^h^h^h^h^h^h job offers to:
Former MIS Director,
Beth Israel Deaconess hospital
Boston, MA 02215
Don't worry, you will pass it eventually. (.01 = 1%)
I'm a signature virus. Please copy me to your signature so I can replicate.
No application can cause a spanning tree loop. It is simply impossible.
A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".
There are 2 likely causes:
Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.
Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.
A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.
This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.
Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
Networks go down.
You cannot always define the root cause.
It is never one person's fault entirely.
Every network will go though some type of major outage in it's lifetime. That is unavoidable.
When your network does go down the thing that matters most is what you learn from that outage. Hopefully afterwards you will use what you have learned to improve.
It is not for us to judge the network as it existed as we can have no way of knowing all vairiables contributing to its design. However, if nothing is leared or there is no improvement in the infrastructure, then you have most definitely left yourself open for comment.
As a network engineer that works on spanning-tree daily, I can certainly appreciate the complexity of the situation. Spanning Tree in itself is a fairly easy protocol to understand. But, when combining it with HSRP, VTP, Trunking protocols, Ether(or gigE)channels, inter-vlan routing, etc things can quickly become out of hand. The problem most certainly could have been resolved quicker with proper documentation of the as-built network. Cisco is no slob at solving these things. With all the manpower they put behind this, I'd have to say the team of StormTroopers sent by Cisco had to actually had to first document the as-is network before they could really pinpoint the problems. I don't really agree with the second additional network.. Now we can have two broken things instead of one. Any time you increase your redundancy you also increase complexity. I think that this will serve as a good lessons learned for the IT staff there.
Of course it is the best solution to make a fully redundant identical network. After all how else is Cisco going to maximize the profits?
Here is a suggesting, why not contract some consultants who do not tie their paycheck to how much product they manage to convince you to buy from their employer.
An independent consultant
You're referring to alternate physical paths. They are talking about a completely separate network. A very silly idea.
Insert offensive troll-style sig here. Please mod or respond appropriately.
Between the Yorktown being lamed by a 0, to the hypothetical bridge with a 300 lb guy on it, to the Hospital's network being brought down by whatever ... somebody ... Somebody ... SOMEBODY knows the truth. The guy that did it. Somebody did something, and BANG! the system got fuxored.
Instead of spending DAYS letting the corpse recovery crew autopsy the network - just say something. Admit that you screwed the pooch, admit it early and admit it often. Be eager to accept and admit that you fuxored the system and be eager to explain exactly what you did. (*)
This does two wonderful things for you:
1. Because they don't have to spend days finding someone to blame (because you eagerly accept the blame) and because they already know what the problem is (because you told them what you did) they can get it fixed in about 1/4th to 1/10th the time (because they already know what the problem is and don't have to dick around trying to figure out who to blame it on.)
2. When something really, really, really bad happens (think the Battleship Ohio(?) main gun explosion, or the $1.3T lost in derivative trading by that banker in England, or Apollo 13) you have already established a history of eagerly admiting when you screwed up and eagerly accepting responsibility for your mistakes so you basically get one 'get out of jail free' cards. Just say 'hey I always eagerly admit it when I blow it - if this one was me I would have already said something.'
(*) - Note : this only works in places where they let people make mistakes and don't destroy your future for them.
Glonoinha the MebiByte Slayer
Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions.
I call sensationalist bullshit. It takes at most 15 minuites to switch over to a fully paper hospital here.
Either that or their hospial is really really shity.
I live in a giant bucket.
would you suggest ? The could try and get some of that cool fairy dust the IBM commercial talks about but I am betting it is really hard to find
Well, that's it you see! Alan Ralsky thought it said spamming tree protocol and tried to use the network!
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Having a second network available as a backup has several problems:
One) How do you connect it to the systems? Move the wires? Impractical. The edge devices that actually connect to the systems have to be part of both networks.
Two) Who's to say that it will not have the same design flaw as the primary? Have the second network be designed by a different individual, AND with different design goals. Similar goals can produce similar results. Have the second network be designed strictly as a backup.
Anyway, in the vast majority of cases, rather than having a second network, you're probably better off having that second person review the work of the main person. Having and MAINTAINING a second network is only valid under a narrow range of situations. OTOH, perhaps a hospital is a high risk enough environment to warrant that.
Personally, I think having a backup paper/whiteboard/people system (which they appeared to have), is the right solution, as this is also useful under emergency situations (earthquake, extended power outage, war, etc).
My two-bits worth (was there ever a coin in America called a bit?)
Brad
Well, this explains what happened when I was there after being hit by a truck. The doctors were great but the place was very disorganized. Hrm.
It would be the fault of the fat person. You always blame anything on the fat person, because they're always the ones screwing the rest of us up.
... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...
Comparing it to Windows will be a moot point, since El Dorado is going to have a 40% larger code base than XP.
I don't know what they do with their M-16 rifles in Israel but if I had one... I'd stove it up your ass so your head could some company.
Stupid fuck...
I would say it sounds like they need a better network, not necessarily another network. To me it is silly and unnecessary to build a whole other "backup" network. I would simply upgrade their current infrastructure. I would upgrade servers, get more bandwidth (multiple T1's or T3 if necessary), get better routers/switches, etc.
While it wouldn't have allowed all workstations back online, throwing down a Cisco WiFi network within the buildings to create an "emergancy network" would have taken a number of hours, and gotten enough of the net back to allow for patient tracking and record keeping.
First one to say security would have been broken in a short time evidently hasn't used the automated rolling WEP implementation Cisco has.
I'm supprised Cisco didn't have a LAN/WAN setup in a crate, complete with servers to handle the authentication, sitting somewhere ready to deply in an emergancy (think 9/11).
...networking glitch is brought down by hospital!!!
I was an operations manager for a large hospital for several years, and planning for this such as that should be a number one goal for IT staff.
The first rule in anything to do with hospitals is to ensure that they have disaster plans in place and that these are tested on a regular basis. The disaster plans should include scenarios such as total power outage, failures of vital equipment etc.
The second rule I used was to ensure that in critical areas there was a second independant network path that if needed could be isolated from the rest of the network. Usually this mean putting in a run of fibre that bypassed buildings etc.
The third rule is to ensure that vital equipment can be run without need for a network. Nothing should be so dependant on networking that if there is a failure it will stop it from working. If networking is a requirement (eg Medical Imaging) that network should be independant from the main network.
The fourth rule is to ensure that there is a secondary method of accessing electronic patient records in the event of an extended downtime. I wrote an application that would dump the most needed patient information and leave it available on PC's in critical areas in query only mode. This allowed access to most of the patient details for using the patient forms.
beat him, harshly and thoroughly
While it is true that an application could not have caused this problem, is it however possible that a poorly designed application could have allowed to problem to continue as opposed to a well written application may have been able to prevent the triggering of this problem. For example, setting the TTL too high. If this was taken into account correctly wouldn't that have prevented the whole problem in the first place?
Google whoring works better when you log in, wussy =).
Cheers,
~Tris.
-----[0_o]-----
We are not amused.
I am also planing a hospital network and what we have is a network problems because ww are out of money. We can not afford one decent network !!!!
...
;))
You guys are talking about production environments with big budgets
I think our patients should pray often
Just use twin co-ax... there should be plenty of that lying around these days.
--WooooHoooo--
I work for a company that supplies real-time market data information. Our office is just 3 blocks from the World Trade Center. The data center is actually just across the street from the WTC. On 9/11 this data center was damaged and unusuable, but since we had a duplicate data center in NJ, we were able to be up and running a week later. We could have started sooner, but the markets were closed, so there was no data to send out. We also have three separate environments, development, staging and production. Only QA approved software is allowed to run in the production environment. A hospital should have a backup network in cases of catastrophe.
I'm sure someone has already pointed this out, but I didn't feel like reading - or even scanning - the 400+ posts. Sorry, it's a lazy day.
Anyhow, yes, having a second identical network would make sense if it is affordable. This would be a test lab. However, would you want to recreate every node that is on the production net? Probably not.
In this case it wasn't the network that failed, but a single application that generated a ton of network traffic when it was opened. Reminds me of that old poem about computers not doing what the user wants, but only what it's told. Don't blame the net for bad software.
Would it be fair to say that the bridge collapsed because a 300 lb man was on it?
CowboyNeal struck again... Of course it's his fault, he does it just for fun, the sadistic bastard.
BIDMC just recently announced they had job openings in the field of networking...
The last time I had a problem with a spanning tree algorithm I lost 12 points on my CS final!
Ok, so seriously, I'd be embarassed if I screwed up a spanning tree algorithm on a test. If it took Cisco engineers 6 days to fix it, it musta been something really quirky, most likely the software not configuring something right. I can't imagine an application problem that would hose a network past a power toggle.
paintball
This is what happens when people do not want to pay the money for quality network engineers. If you're not willing to invest in your network, it ends up becoming a kludge. Education is the key. If your network engineers aren't knowledgable enough to solve a STP problem and have to rely on Cisco's TAC (many of which are just as unknowledgable), then you're walking on a thin line. I realize hospitals are low on money, but any mid-level engineer should have been able to solve this problem within a few hours.
The above is specious. I know nothing about the network or campus in question. I'm sure the folks on hand know what to do. Good luck.
Friends don't help friends install M$ junk.
This article is the very reason why I argue against using spanning tree. I have seen many similar outages. It does not help when the very thing that is there to help prevent the system form going down can cause it to go down.
I always opt to have an identical switch that I can fail key systems over to manually.
Spanning tree caused them to be down for days.
Why not build a switched network with no loops then if a switch fails it only affects the systems on that switch. And if you have the budget there will be a second exact switch powered up right under it in the rack. You unplug all the patches and plug them into the other switch. Down time = way less then days. The network becomes far simpler ergo much simpler to maintain and fix.
Just my 2 Cents.
Why not buy M$ wireless 802.11b install W2K/XP on every computer and set up an MS exchange server. Who needs BSD when you have M$ :)
<I>just kiddi'n the uptime of the above mentioned network would be measured in nanoseconds, and then they will have to switch MS paper'n'pen method</I>
Live for the present, learn from the past, and dream of the future!
First off.. They should hire me.
Whomever designed a network of that size without redundancy in the first place is just stupid beyond compare.
If they say.. Oh.. it was the finance people who said we could not have the money for redundancy.. Then you jump your prices in your quote in the first place and tell them its single homed and build it redundant anyway.
Spanning tree has no forgiveness in it at all, probably someone put in a bad route or something and everything exploded.
I have designed many networks, and ALL of them have at least SOME level of redundancy.. most are complete hardware mirrors, but some are just extra paths or just extra cards in the switches to move cables to in case of x y or z problem.
This is most likely someone entered bad data into one single switch somewhere and it took Cisco forever to find it.. and of course guilty party didn't want to admit to doing it because he knows its his job.
Just three more hours seapeople and you can finally take me away from this crappy God Damned planet full of hippies
My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.
The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.
Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.
The kidneys are internally redundant. You only need a 10% kidney function to contintue to survive. Ditto for Liver and other organs (aside from heart). They take years of abuse via smoking or drinking before they finally start to wear out to the point of causing system collapse.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Algorhyme
I think that I shall never see
A graph more lovely than a tree.
A tree whose crucial property
Is loop-free connectivity.
A tree that must be sure to span
So packets can reach every LAN.
First, the root must be selected.
By ID, it is elected.
Least-cost paths from root are traced.
In the tree, these paths are placed.
A mesh is made by folks like me,
Then bridges find a spanning tree.
---Radia Perlman
"Those who make peaceful revolution impossible, make violent revolution inevitable" - JFK
I certainly agree with you, and I don't expect small comapnies to hire a team of network engineers with 6 figure salaries to handle their network. The small to medium sized company probably uses the network for file sharing, email, and internet access, certainly they could get by for a few days if they had to. On the other hand, this hospital couldn't access patient records because their network failed. So, if I'm a nurse and my patiend is in pain, how do I find out if I can give him or her morphine? Has he/she had some already? What if he/she is alergic? All this (I assume) is in their patient file, on the network, which they can't access.
Swannie
Moderation totals: -1:smug
:q!
If I might be so bold as to pose an alternative probability of failure.... given that if one train uses the track and it's P=10%, then if a 2nd train is added going the opposite direction, and if both trains use the same track, then the probability of failure is 100%, as they will collide.
It's always nice to see those people doing useful work for a change.
Instead of making parallell networks, they can simply user parallell servers with the linux terminal server project, if a server dies, the other one will take over operations 8)
Sounds like a standard UPS system to me. You have the grid feeding banks of batteries. The batteries feed the hospital. The generators are between the grid and the batteries, but they are not wired in such a way as to allow a generator failure to disrupt pawer from the grid. If the grid fails, no one notices because the batteries are what feed the hospital. After a few minutes, the generators start and they keep the batteries full. Once the grid is back on, the generators shut down.
I'd rather you do it wrong, than for me to have to do it at all.
Because Cisco can't fucking get it right, especially where multiple VLANs are concerned.
Search cisco.com for "spanning tree caveats"; filter the results by IOS release versions and check the number of open or unresolved caveats for which there is no workaround.
It shouldn't take you more than a week to go through them all.
Pain is merely failure leaving the body
So true. Redundancy is king.
... well lets just say my job is only 20-30% code and the rest is test and requirement.
/. a while back talking about open source ATC. I laughed at him then and I'm still laughing now :)
And as for testing
Oh and if that wasn't enough if all else fail with our system there is a seperate fall back system (written by another contractor) that will step in and take over the displays.
The only nitpick thing I have is that a sub system in Standby mode quite often will actually do its own processing of the data because if one machine corrupts the data you still have one good box. Only when you first bring a redundant box to standby ready will you actually see a data synch.
You post reminded me of someone on
When ATC systems go down they route traffic around the down sectors because 1500 tracks in a small airspace is impossible to control safely without computer systems.
Actually there is more truth to that than you know. They can't keep any files locally and simply have to not rely on the systems for anything critical. Recently they had their computers taken away for 3 weeks (refurbishing offices), which was a terrible inconvenience, but it didn't bring work to a halt. Just made everyone's lives harder than they had to be.
Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.
My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.
Happy Fun Ball is for external use only.
And how will you know if the backup network even works? Of course you could test it. But will it work under the kind of extreme live stress that would take down the primary network? And what if the issue is simply load than neither network can fully handle? Could you run both networks in tandemn correctly? It sounds to me like the original problem was that the network was designed by someone who thinks of the switches as magical black boxes that will take care of everything ... someone that assumes perfect abstraction. That 3 million dollars to build a parallel network I think could be better spent by hiring competent people to build a correct network that includes redundancies structured in the right places. No matter what you do, there will be some single points of failure, such as the very logic used to switch over to the backup network if that's what you have (which would be a big waste if it sat there idle). The network engineering people need to know and understand those single points of failure and have plans to deal with failures at those points.
now we need to go OSS in diesel cars
802.1x? If they were running old CatOS code 802.1x packets from an XP box or other OS running Port Based Network Access Control could have killed the network. The MAC address used falls into the range that the switch thinks is Spanning Tree. It gets forwarded out of all ports, and the levels build up until the network grinds to a halt.
It sounds like they need to put a number of routers in and break the Spanning Tree domain into small chunks - and ensure they're running code that copes with 802.1x, or put in the known workarounds.
sounds like the typical cisco reponse....
Bullshit. John Halamka is exceptionally qualified. He has written Books named Real World UNIX and Best of CP/M.
While building an identical network is a nice idea, it's silly. Instead, start using WiFi. Also, comparmentalize this network, IE, separate nodes so that if inventory ordering has a problem then personnel and radiology don't go down.
If a bad app comes up or a virus infestation occurs, have a duplicate server ready with the latest safe backup data. Also, have all clients off until a technician can make sure that each individual client is safe to bring back on. Start with mission critical systems, like radiology, patient records, etc.
The benefit of this approach is that first, you can set it up so that a client only connects when said connection is needed, not persistant.
Second, it's pretty easy to kill wireless access even against backdoors. There are no passwords, no need to unplug each server - all you do is cut power to each access point. Since it's compartmentalized, you may not even have to kill every system.
Third, you have an excuse to transition to WiFi, which, if you manually add another software layer of security, is a Killer App for hospitals, provided it doesn't have cell phone-like interference.
Furthermore, you can keep the ethernet up as a backup solution. Set up a seperate honeypot on each to help keep records secure. The WiFi honeypot will prevent wardrivers, and the wire one will prevent malicious people from using the wired solutions in hidden locations - which will be plugged into the wall but disabled at the regular server level and in each client.
Obviously, this isn't practical for my local Northwest Medical Center which has 200 beds at the most, but for a large urban hospital this type of flexibility, simplicity and redundancy shouldn't be considered handy, it should be considered the rule, if not even the rule of law.
Any idea brought forth in an open society is exposed to criticism. If I claim to be able to make psychic predications, it should come as no suprise that many people will seek to prove otherwise, or just outright laugh at me. If I want to make a statement that I believe blue shirts cause violence, people are going to want to see statistics and evidence, right? No rational person would believe such things without evidence. It is up to you, the reader, to study the facts and decide for yourself what is true and what is false. This is your right. You don't have to believe things that are obviously false, no matter what people in power tell you.
Fortunately we have the freedom to criticize many ideas today. Almost no idea is censored in modern western countries. The few extreme elements of society like drug-users, pedophiles, and homosexuals are each day considered more and more mainstream, and many of their ideas are becoming the "norm." But while countries work to legalize things like prostitution and drug usage, at the same time they make stricter and stricter laws against so-called "hate speech."
Why? Why is information about White Pride censored when virtually anything else is published openly? Why can any idea be exposed to criticism except when it has to do with race?
The fact is Jews, liberals, and people in power know exactly what the message of White pride means and how powerful it is. Unlike their attempts at social engineering, our message is based on fact and reason. This is what makes it dangerous to them. It doesn't matter how much propaganda about "equality," "reparations," and "diversity" they hammer us with. When people see the evidence, and evaluate the facts for themselves, they will come to the same conclusions that other informed White people have. No amount of Jewish lies will stop the truth. They know this and fear it. This is why they try to suppress us.
So what should you do about this? Open your mind, and visit White Pride web sites like the National Alliance, White Civil Rights, and Stormfront. Get a copy of David Duke's My Awakening. Read what they have to say and make your own conclusions -- does what they say agree with the evidence available? Have your own experiences verified what they are saying? No one is going to tell you what to think, because it is up to you to make your own decisions.
Try asking yourself questions like:
- Why do we send billions of dollars of "aid" and weaponry to Israel every year?
- Why do non-whites commit far more crimes than whites even after all these years of affirmative action and welfare handouts?
- Are racial quotas in the workplace fair?
- Why are we told there are no differences between blacks and whites when we can clearly see the physical differences in their bodies?
- Why is Africa still in the stone age?
- Why is illegal immigration accepted and encouraged in the USA?
- Why is news about the Israeli spy ring caught in the USA only reported in foreign newspapers?
- Why is the government afraid to report the truth about the Anthrax letters?
- Why is the government continually increasing its control over our lives?
- Why is our media so dedicated to corrupting our children's morals?
- Why does the number of people killed in the Jewish holocaust keep changing?
- Why has the Wichita massacre gone unreported?
- And so on....
The truth will not be stopped!In fact, on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time.
They knew there was a problem, but as with anything, they decided to wait. Case Closed. Don't blame the engineers, blame the people who decided it wasn't important enough to overhaul.
-- MrMud
Just a prediction. After spending countless millions on this super backup network the same sinario will occur again. The now prepared admins will transfer the old network to the new; only to find the application that brought down the network works just as well on the new. Can anyone say resource limits! Why can one user put such a heavy load on the network that it brings down the network? Why can one network segment put such a heavy load on another segment as to bring down the entire network? And even more importantly, how will increasing the bandwidth, or adding a backup network resolve this problem? Perhaps a better approach would be to look at methods of controlling bandwidth usage.
I've been doing network implentations long enough to realize one very important thing. The less spanning tree in a network, the more stable the network. This is one of the reasons layer 3 switching has become so cheap. Most people just don't take the time to use it. The largest network I've built (over 12,000 ports) hasn't lost a day of uptime in the past two years because it's all layer 3! Admittedtly, the network administration has a part to play in this in that most IT departments think they're being sold a layer 3 switch, just so a vendor can sell a more expensive switch. In reality, layer 3 = stability.
... yeah I think it's a fabulous idea. STP would prevent loops and... oh... never mind.
This sounds like the same thing that has been going around, although the inability to recover is astounding.
Windows XP (home and professional) includes a feature called the "Network Bridge". Many people think this is nothing new, NT could do IP forwarding (basic routing with RIP), but XP includes an 802.1d transparent bridge with spanning tree algorithm. This has been bringing down dorm nets, because a student with XP on a laptop, with ethernet and 802.11b WiFi adapter, can easily and inadverdently create a bridge, and cause a bridge loop. Although XP supposedly includes support for spanning tree algorithm, the amount of problems out there suggests that it is either buggy, or the wireless access points don't support it properly.
IMHO, NO ONE needs a transparant bridge, certainly not as a default option when adding a second adapter through the "Network Setup Wizard". At the least, there should be a popup that says "Are you sure, this might bring down your campus network..."
If you manage Windows XP machines, your only recourse is to add the following registry key to your laptops to disable forwarding if a bridge gets configured:
r vi ces\BridgeMP]
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Se
"DisableForwarding"=dword:00000001
(Somehow there is a space in "services" above that I can't get to go away, it doesn't belong there.)
Slashdotters would already know that the technology's already there to have the failover for this. The problem has to do with planning and ongoing management.
Obviously during the planning stages of a network it needs to be decided is this service critical enough that we need to have some sort of failover.This needs to be revisited as a network grows and is depended on for more services.
I know that if this happened to my place of work then...
1. It'd be an IT failure, not a technology failure or a user error.
2. My CIO would be on the street. Period.
Sorry, technology be damned - our jobs are technology and production quality. The fact that "one user" could be the straw that breaks the network PROVES major flaws in their network design. Flaws that MUST have been well known by the CIO.
Of course, CIOs are busy, and they can't have all the answers. That's why CIOs should hire RESPONSIBLE PEOPLE and have THIRD PARTY ACCOUNTING of their critical systems - which, of course, includes their network design.
Stop pretending to be a CIO and be RESPONSIBLE. Stop blaming users. Stop blaming technology. Stop blaming your vendor's questionable firmware. Take responsibility and build quality in your network design.
Pretenders are EVERYWHERE. Even CIO are pretenders.
John asked me to dispel a few misconceptions about the Caregroup/BIDMC network. He wrote and I quote:
It is cowardly, and a betrayal of whatever it means to be a Jew, to act as a white man
-James Baldwin
A. SMB
Meanwhile, the hospital was figuring out how to run at its usual pace without the 100,000 e-mails it usually sends a day.
So thats where they're doing all those penis enlargements!
"I'm tired of all this 'Aren't humanity great' bullshit. We're a virus with shoes" - Bill Hicks
Build a second parallel network because the network designers didn't know wtf they were doing? How are you going to fail over to this network? STP? (insert obnoxious chortle here)
10 bridged hops = big flat network = they needed layer 3 switching in the first place, ergo, the network was badly designed. The very fact that a root bridge STP reconverge occurred indicates a poorly framed implementation plan and obviously no backout plan.
Find somebody who knows what the hell they are doing and have them do a network audit.
Cisco Systems, the hospital's network provider...
There's always some error in calculs, in this case, the traindriver forgot to lace its shoes.
#include "coucou.h"
You would think that network redunacy would be a something the net engineers would try to get done each year...I can only speak from my experience as a rural school telecom director...we are doing a complete campus rebuild and in the overall scheme the cost to add complete redunancy was maby 4% of the overall cost...that is not much...I think it was both the engineers and the administrators fault...engineers sould have pushed harder and the administraors should have got there head out of the collective asses!!! bg
--|gillham|--
Do you know of any ASPs using this as their backend software?
OK, this is my first post, which I know demotes my relevance in this forum. What's more, I don't know much about the technical details of networks. But I am a physician, I was there, and I think I might be able to provide some perspective. I tried to read the large part of the posts made before writing this, and I hope that my post is not too irritating to the locals. So, here are some things to think about.
.5M) makes just under seven figures.
As at least one commenter suggested, the reason that the network wasn't better prepared is because the IT budget is woefully unfunded. You might be unaware of the ridiculously poor fiscal situation that most academic medical centers are in. Caught between increasing costs for diagnostic tests and pharmaceuticals, and decreasing reimbursements from the government (N.B. this post is in no way intended to take ANY position on government funding of health care), all hospitals are in an increasingly difficult position. When you throw in that academic centers will not turn away indigent / uninsured patients, and that there are ~40 million uninsured Americans, it is nigh impossible to even break even.
I am in a position to know personally just how underfunded the BIDMC IT department is. Without going into details, let me just say that BIDMC can barely replace 5-year-old desktop platforms. It is casually miraculous that BIDMC has been able to computerize laboratory reporting, medical records, and physician ordering, not to mention supply, billing, bed management, and dozens of other things we don't even think of when we think about running the hospital.
The IT department, led by John Halamka, has been turning straw into gold for years. Every year, they cut the IT budget, and every year, the computer system gets better. (Of course, I say this from an end-user perspective.) Don't let his MD and his emergency med training fool you, as it did one poster, into thinking that John is a duffer. If his description of the reason for the network crash, which I didn't understand, didn't convince you that he knows his stuff, let me add my bias. I know John personally. He works ridiculously long hours to keep his ship running. He is constantly on the lookout for ways to improve patient care with the computer network, and constantly soliciting the advice of parties in all specialties. To my knowledge, he holds the distinction of being the only CIO in history that the Hunter Group (a well-known health care consulting group) has not recommended firing.
So, if the CIO's so good, why did the network crash? I think that an agglomeration of your posters have already figured this one out. BIDMC already knew that the risk of network failure was increasing. For several months before the disaster, the IT department was upgrading network hardware and software as fast as their budget could allow. They were trying to prevent this, and their luck ran out before their ship came in.
Redundancy is expensive. New equipment is expensive. New software is expensive. Personnel are expensive. Look at the financials of BIDMC sometime. The hospital lost $26 million dollars this year, and that was considered a victory, because at the beginning of the year, BIDMC was projected to lose $40 million. The hospital hopes to be breaking even by the end of 2004, without compromising quality of patient care. All the prophylaxis that you've suggested "should have been there" needs to be taken into context with the larger financial picture. Should BIDMC have fired nurses to pay for routers? Cut back lab services to buy newer software?
So it happened. Next question: were patients endangered? Obviously I'm bound by all sorts of privacy concerns, but it's fair to say, probably, a little, but:
First of all, think about how the network impacts patient care. They mostly DO impact data retreival, in the following ways:
* Getting diagnostic test results quickly, rather than having to call, or go to the lab
* Getting old patient data off of the online medical record, rather than waiting for the patient's old charts (ALL data is duplicated in the paper record)
* Entering patient care orders without handwriting them and making sure that a nurse sees them
But, they DON'T impact in the most important ways. The following things work without a network:
* The computers that monitor vital signs of sick patients and patients in operations.
* The computers inside emergency medical equipment such as defibrillators and respirators
Nobody drops dead instantly because of a network outage.
The network didn't crash all at once: it was up and down intermittently for about 24 hours. After several attempts to get the network running without shutting it down, they finally decided that they needed to shut the whole thing down and start it up again, piece by piece.
I'm not sure who got the idea that a "scramble" to restart paper ordering implies that BIDMC didn't have a plan in place. The hospital has paper backup systems prepared for everything. But you try to orchestrate a quick return to paper on dozens of inpatient wards, with a thousand patients, in short order. Good luck; that's a system involving hundreds of health care providers and separate physical locations. Suggesting that BIDMC ought to be able to throw the railroad switch and just do it easily is rather unrealistic.
That all being said, once BIDMC gave up on keeping the network up while fixing it, we had the whole switched to paper in a matter of hours. This made the system slower and more error-prone, which is why we switched to computers in the first place!
In theory, such a situation could endanger patient care. Slower data retrieval, and the possibility of missing relevant data, could both cause medical errors and patient injury. But, lest you did not realize, medical errors and patient injury are part and parcel of daily healthcare. So many decisions are made on so many patients in a day, that errors happen all the time. Due to the multisystem nature of health care and the multiple levels of safeguards and error-checking, no patient injury happens for one reason alone.
Don't fool yourself into thinking that when the network's up, nothing ever goes wrong, and once the network's down, scores of patients are unjustly injured. Any difference would be incremental. In any theoretical particular case, it would be virtually impossible to prove that the network outage was the crucial component that caused the error.
Is BIDMC at fault? Well, if there were a snowstorm, would BIDMC be at fault if they didn't have enough snowplows on hand? If someone slipped on a banana peel, would BIDMC be at fault for not hiring enough janitors? If there were a fire, would BIDMC be at fault for not having appropriate fire safety?
Was BIDMC at fault? No more than for any other disaster; you can't be 100% prepared for everything, ever. Will they be sued? Probably. Will the suits be just? Probably not. Will they win? I hope not. Hopefully you agree with me.
And, by the way, the computerization of our hospital is multifaceted, and has taken place slowly over 6 years. It's not like we've had our current network in place for 6 years with no changes. Rather, it has grown geometrically with added functionalities as time goes by.
OK, let's end with some responses to comments that I think are informative, but which qualify as "personal agenda," so if you're not interested, you can stop reading here with my compliments.
First issue.
> Also, it is very common for doctors to reject
> any spending on IT because it will bring their
> 8 figure salaries down to 7 figures and that is
> totally unacceptable!!!
If you're going to pillory doctors, perhaps you should actually know what you're talking about. The average physician makes $180,000 a year. The most well-paid MD in Rochester, NY (a city of
Academic centers pay less than average; many grown-up MDs at my hospital don't even make 6 figures. Nobody who's in it for the money works at an academic center like BIDMC. These hospitals lose money, and every expense, yes, including doctor's salaries, suffers from it. Those who stay perceive intangible benefits beyond the monetary compensation.
Believe me, doctors are not cutting the IT budget to line their pockets. I can't speak for the administrators, some of whom are MDs and some of whom aren't, but BIDMC is a not-for-profit institution, and nobody is walking away with fat profits.
Next issue. Whoever suggested that we were unable to play Quake for 4 days: Probably you were just trying to be clever, but it's worth noting that we can't install software on any of the hospital computers.
And finally, whoever made fun of senior managment for "running around like errand boys": Good for them! This was truly a crisis, and all hands pitched in to try to prevent any patients from being hurt. Laugh at them if you like; they could have stayed in their offices, but like the rest of us, they did whatever they could.
Hopefully you have found this informative. A disclaimer should not be necessary, but since it is, let me say that my opinions are in no way intended to reflect those of BIDMC, its administration or employees, the federal government, John Halamka, you, your dog, or anyone else other than me. Have a nice day.
John H. really appreciated your comments and asked that you give him a call. Hope you have e-mail notification turned on in your
It is cowardly, and a betrayal of whatever it means to be a Jew, to act as a white man
-James Baldwin
Thanks for the note. I do have e-mail notification turned on for replies to my comments, at a much lower threshold than +3. :)
My browsing at +3 is not an indication of my "faith in the moderation system" so much as it is an indication of my limited time. I can only afford to read these comments in so much depth. When I get really interested in a thread, I turn down the threshold to take a closer look.
I actually skimmed these comments at a threshold of 0. I did miss your posted corrections from John (sorry!), but that's only because there were >400 comments and I was moving pretty fast.
There is an eternal tradeoff between efficiency and fidelity. In medicine, we refer to the tradeoff between sensitivity (finding something important) and specificity (not finding something unimportant). It's kind of the same here, and I have chosen specificity over sensitivity.
I'll contact John.
There is no simple answer. Chest radiographs for sbestos cases, even if only suspected asbestosis, have to be kept 30 years. The US Air Force keeps films for 5 years after the last year in which any film is taken. The community in which I worked kept films for 10 years. In Washington State one should keep films on children until they are 22, possibly longer depending upon how your lawyer interprets the state regulations and what the courts say. Some people think you should keep mammography films for the life of the patient or even a few years beyond.
Certainly any images involved in known litigation need to be kept till the case is settled.
The following quote is from page 4-27 of the MSCP Basic Disk Functions
.
Manual which is part of the UDA50 Programmers Doc Kit manuals:
As stated above, the host area of a disk is structured as a vector of
logical blocks. From a performance viewpoint, however, it is more
appropriate to view the host area as a four dimensional hyper-cube, the
four dimensions being cylinder, group, track, and sector.
. .
Referring to our hyper-cube analogy, the set of potentially accessible
blocks form a line parallel to the track axis. This line moves
parallel to the sector axis, wrapping around when it reaches the edge
of the hyper-cube.
- this post brought to you by the Automated Last Post Generator...