Hospital Brought Down by Networking Glitch

Problem was with an application, by Anonymous Coward · 2002-11-27 02:44 · Score: 5, Insightful

according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

Re:Problem was with an application, by cryptowhore · 2002-11-27 02:51 · Score: 5, Insightful

Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

--
Happiness is a slider variable
Re:Problem was with an application, by sugrshack · 2002-11-27 02:52 · Score: 5, Interesting

that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.
Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.
realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

--
I can't believe it's not lard!
Re:Problem was with an application, by Anonymous Coward · 2002-11-27 02:56 · Score: 4, Insightful

So a researcher with a workstation isn't allowed to use the network do to his job? No, this stems from incompetence on the part of the network engineering team.
Re:Problem was with an application, by nolife · 2002-11-27 03:16 · Score: 5, Interesting

Not only that but they gave the impression no one had problems using the old paper method. Actually noting that at times the network was fine but they decided to stick with the backup method until the issue was resolved because it was harder switching back and forth when the network was working. All in all though they made a point that no appointments were missed, no surgeries were cancelled etc.. Meaning business was as usual but using a backup manual method.

I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?

--
Bad boys rape our young girls but Violet gives willingly.
Re:Problem was with an application, by aheath · 2002-11-27 08:45 · Score: 5, Informative

I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."
Re:Problem was with an application, by pyite · 2002-11-27 10:19 · Score: 3, Informative

Technically, hubs are faster than switches for N endpoints when N = 2. The reason is hubs do not have to look at the frame being sent and either store-and-forward or cut-through like a switch does. Your total possible collision locations on a hub is N * (N - 1) / 2 (Gauss' formula for sum of 1 to N, coincidentally), where once again N is the number of endpoints. In a switch, your collision domain always has two endpoints, therefore your total possible collisions is 1, thus you get increased speed.

--
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

No. by Clue4All · 2002-11-27 02:45 · Score: 5, Interesting

do you think the answer to having an massive and unreliable network is to build a second identical network?

No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

--

Is your browser retarded?

Re:No. by Anonymous Coward · 2002-11-27 03:01 · Score: 5, Informative

As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.
Re:No. by barberio · 2002-11-27 03:02 · Score: 5, Insightful

The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.
Re:No. by ostiguy · 2002-11-27 03:27 · Score: 5, Insightful

If a network problem breaks down network 1, what is going to stop it from breaking network #2? If the problem was with the firmware in device#23a, the problem will reoccur on network 2 with device #23b

ostiguy
Re:No. by pubjames · 2002-11-27 03:38 · Score: 5, Interesting

I spoke to an electrician at our local hospital recently. He told me the hospital had three separate electricity systems - one connected to the national grid, one connected to an onsite generator which is running all the time, and a third connected to some kind of highly reliable battery system (sorry can't remember the details) for life support and operating theatres in case both the national grid and the on-site generator fail simultaneously.

If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.
Re:No. by dirk · 2002-11-27 04:00 · Score: 3, Interesting

No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

While in the short term the anser is to fix what is broken, they should have had an alternative network set up long ago. When you are dealing with something as important as a hospital, you should have redunancy for everything. that means true redundancy. there should be 2 T1 lines coming in from 2 different vendors from opposite direction if that is something will endanger lives if it breaks. If something is truely mission critical, it should be redundant. If it is life-threatening critical, every single piece should be redundant.

--

"Information wants to be expensive" - Stewart Brand, the same guy who said "Information wants to be free"

Major American Bank Outage by MS_leases_my_soul · 2002-11-27 02:47 · Score: 5, Informative

A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

We had to rebuild the entire network ... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.

Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.

Re:Major American Bank Outage by passion · 2002-11-27 03:05 · Score: 3, Informative

If triple-redundancy is good enough for San Francisco's BART, and this "major bank", then why can't it be good enough for a hospital, where there are most likely many people on life support, or who need instant access to drug reactions, etc?

--
- passion

Leading question by Junks+Jerzey · 2002-11-27 02:48 · Score: 4, Insightful

do you think the answer to having an massive and unreliable network is to build a second identical network?

Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

Re:Leading question by enkidu55 · 2002-11-27 03:23 · Score: 4, Interesting

Isn't that the whole point in posting a story? To foster your own personal agendas? What would be the point in making a contribution to /. then if everything was vanilla in format and taste. You would think that the members of the /. community would feel a certain sense of pride knowing that their collective knowledge could help another business/community out with some free advice.

IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.

Re:Well! Woopsy! by Iamthefallen · 2002-11-27 02:49 · Score: 5, Funny

Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.

You may now begin to panic in an orderly fashion, thank you.

--
Wax-Museum Fire Results In Hundreds Of New Danny DeVito Statues

Hospital Systems by charnov · 2002-11-27 02:49 · Score: 4, Informative

I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.

--
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.

Re:Hospital Systems by gorf · 2002-11-27 03:17 · Score: 5, Insightful

To be fair, they have gotten much better...

You seem to have forgotten to explain why they were worse.

If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.

...truly terrified me...

What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.

New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.

A second (unreliable) network? by shrinkwrap · 2002-11-27 02:49 · Score: 4, Insightful

Or as was said in the movie "Contact" -

"Why buy one when you can buy two at twice the price?"

Disaster recovery by laughing_badger · 2002-11-27 02:50 · Score: 4, Interesting

do you think the answer to having an massive and unreliable network is to build a second identical network?

No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.

That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

--
Help children born unable to swallow - www.tofs.org.uk

Um.. by acehole · 2002-11-27 02:50 · Score: 4, Insightful

In six years they never thought to have a backup/redundant system in place in case of a failure like this?

Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.

--
Be you Admins? nay, we are but lusers!

Re:Um.. by Anonymous Coward · 2002-11-27 04:47 · Score: 5, Interesting

They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.

Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.

In this case, staying with netware has saved lives.

Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.

2nd network by Rubbersoul · 2002-11-27 02:51 · Score: 4, Insightful

Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.

--
man .sig
No manual entry for .sig.

Re:Well! Woopsy! by hey! · 2002-11-27 02:52 · Score: 4, Interesting

I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

Re:That's why I hate automatic routing by parc · 2002-11-27 02:53 · Score: 3, Insightful

And your change in routing policy is going to affect spanning tree how?

How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.

What is spanning tree protocol? (google whoring) by Anonymous Coward · 2002-11-27 02:53 · Score: 5, Informative

Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.

Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.

To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.

Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.

see this page for mode info

Of course they need another network by virtual_mps · 2002-11-27 02:54 · Score: 5, Insightful

Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml

All Layer 2? by CatHerder · 2002-11-27 02:55 · Score: 5, Informative

If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.

Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!

Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.

Re:That's why I hate automatic routing by Swannie · 2002-11-27 02:56 · Score: 5, Interesting

Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).

This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.

Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)

Swannie

--
:q!

Re:Spanning tree by GLX · 2002-11-27 02:57 · Score: 5, Interesting

This would imply that either:

A) A campus could afford to do Layer 3 at every closet switch

or

B) Live without Layer 2 redundancy back to the Layer 3 core.

I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

Spanning tree is our friend, when used properly.

--
Sig (appended to the end of comments you post, 120 chars)

Re:Spanning tree by TheMidget · 2002-11-27 02:58 · Score: 3, Insightful

I think the answer is to disable spanning tree.

On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...

Complexity brings bugs by stevens · 2002-11-27 02:58 · Score: 5, Interesting

The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.

Re:Reliability is inverse to the number of compone by Xugumad · 2002-11-27 02:59 · Score: 4, Insightful

However, the probability of both failing at the same time is:

0.1 * 0.1 = 1%

So as long as it can run on just one out of two, get you get ten-fold increase in stability.

Re:Spanning tree by AKnightCowboy · 2002-11-27 03:00 · Score: 3, Insightful

I think the answer is to disable spanning tree.

Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).

My best hospital glitch by eaddict · 2002-11-27 03:01 · Score: 5, Informative

was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!

--
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter

Cisco implemenatation of Spanning Tree sucks by xaoslaad · 2002-11-27 03:01 · Score: 4, Interesting

I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

Two wrongs don't make a right.

Re:Cisco implemenatation of Spanning Tree sucks by netwiz · 2002-11-27 03:20 · Score: 4, Informative

Cisco only runs per-VLAN spanning tree if you're using ISL as your trunking protocol. The reason you don't get it on Extreme Networks stuff is because they use 802.1q. In fact, Cisco devices trunking w/ the IEEE protocol run single instances, just like the Extreme product.

There are tradeoffs, of course. STP recalculations (when running) can be kind of intensive, and if you've got to run them for each of your 200 VLANs, it can take a while. However, for my particular environment, per-VLAN STP is a better solution.

The real problem by Enry · 2002-11-27 03:02 · Score: 4, Insightful

There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.

So what's the lessons?

1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.

I don't buy it by hey! · 2002-11-27 03:02 · Score: 5, Insightful

The same explanation was floated in the Globe, but I don't buy it.

People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

Re:I don't buy it by DaveV1.0 · 2002-11-27 03:27 · Score: 5, Informative

Actually, if you read the article carefully, they say that the application the research was running was the straw that broke the camel's back.
"The crisis had nothing to do with the particular software the researcher was using."
"The large volume of data the researcher was uploading happened to be the last drop that made the network overflow. "
While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge.

--
There is no "-1 offended" or "-1 you don't agree with me" mod options for a reason.

done right in the first place by wiredog · 2002-11-27 03:03 · Score: 3, Interesting

You've never worked in the Real World, have you? It is very rare for a network to be put in place, with everything attached in it's final location, and then never ever upgraded until the entire thing is replaced.

In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.

--

Best Slashdot Co

Fix it the first way that works. by tomblackwell · 2002-11-27 03:03 · Score: 3, Insightful

If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.

It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.

Re:the sad part by krinsh · 2002-11-27 03:04 · Score: 3, Insightful

While paper-based may seem like the best solution to you; what you don't realize is that paper-based is just a single phrase for the rest of these 'bases':

sneaker-based when everyone must run throughout passing paper;

warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;

administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;

and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.

Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.

The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.

--
I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.

This assumes.. by nurb432 · 2002-11-27 03:07 · Score: 5, Informative

That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...

As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.

and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..

--
---- Booth was a patriot ----

YES- air traffic management experience... by mekkab · 2002-11-27 03:08 · Score: 5, Interesting

Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.

Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.

How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).

How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.

--
In the future, I would want to not be isolated from my friends in the Space Station.

Been there done that, got the ass beating by nt2UNIX · 2002-11-27 03:09 · Score: 3, Insightful

In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.

I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.

We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.

Test Network GOOD (if you have the money).

The Solutoin by Shishak · 2002-11-27 03:09 · Score: 5, Insightful

Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.

The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.

Sure you'll be a touch slower routing between the segments but you'll have much more reliability.

--
Now I hope and pray that I will But today I am still, just a bill

Add a second network? Not likely to help by markwelch · 2002-11-27 03:11 · Score: 5, Insightful

> Do you think the answer to having an massive and unreliable network is to build a second identical network? <

Of course not. Two solutions are more obvious:

Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.

A third, and somewhat obvious, solution would be to make sure that

crucial data is kept on the local server farm, but also copied in real time to a remote server; and
a backup access mode (such as a public dial-up internet connection, with strong password protection and encryption) is provided for access to either or both servers, in the event of a crippling "local" network outage.

This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.

The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.

I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.

--
-- http://www.MarkWelch.com/ Pleasanton California

I have the solution... by FleshWound · 2002-11-27 03:13 · Score: 4, Funny

I live in the Boston area, and I have the perfect solution: they should hire me. I'll make sure their network never fails.

Well, maybe not. But I still need a job... =)

Networks are fragile. by XPisthenewNT · 2002-11-27 03:14 · Score: 3, Interesting

I am in intern in a networking department where we use all cisco stuff. Spanning tree and some other protocols are very scary because once one switch declares itself a server of a given protocol, other switches "fall for it" and believe the new switch over the router. Getting the network back is not as easy as turning off the offender, because the other switches are now set for a different switch server. Power outages are also very scary because if switches use any type of dynamic protocol, they have to come back up in the right order; which Murphy's Law seems to indicate would never happen.
Networks are fragile, I'm surprised there arn't more massive outages.
The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.

KISS: Keep it simple, stupid!

Life threatening? by saider · 2002-11-27 03:17 · Score: 3, Insightful

I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.

The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.

--

Remember, You are unique...just like everyone else.

Re:Life threatening? by benwb · 2002-11-27 03:41 · Score: 5, Insightful

Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.

QoS and network boundaries by pangur · 2002-11-27 03:18 · Score: 5, Informative

There are several non-exclusive answers to the Beth Israel problem:

1) introduction of routed domains to seperate groups of switches

2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.

3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.

4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.

Thank you.

It's HIPAA by mrneutron · 2002-11-27 03:19 · Score: 3, Informative

Health Insurance Portability and Accountability Act.

Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.

The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.

Re:Spanning tree by stilwebm · 2002-11-27 03:21 · Score: 5, Interesting

I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.

I work at a teaching hospital... by pacsman · 2002-11-27 03:22 · Score: 5, Insightful

The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.

Re:That's why I hate automatic routing by Swannie · 2002-11-27 03:25 · Score: 3, Interesting

Can you make a case why spanning tree is bad? Beyond "It's old", or "I've been burned before?" I've never, personally, heard a good arguement as to why spanning tree is bad.

As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.

Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...

Swannie

--
:q!

Re:Sure it was STP? by jefftp · 2002-11-27 03:33 · Score: 4, Informative

The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.

Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.

Laziness and ignorance are the villians of most network problems.

Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.

Re:Simple Answer by gorilla · 2002-11-27 03:43 · Score: 4, Interesting

Having worked in a hosptial, I'll tell you that's not acceptable.

Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.

Re:Spanning tree by Chanc_Gorkon · 2002-11-27 03:46 · Score: 4, Insightful

Egads no! Dedicated hardware designed for this is the only solution in this kind of case. A PC simply is not. You CAN'T use a hack in a hospital. You should not use a hack like this in a business either, but I understand if it's done this way. Hacks like this can become rather problematic once it's asked to grow. Also most PC's do not have redundancy in power supply and probably doesn't have a raid array (although I have seen a vpr Matrix machine at Best buy with a raid array...Your standard adaptec type included in a lot of MB's now). If I were to do something similar, I would rather do something with AIX or if using Linux, using a server class machine. By the time you do that, you have already spent the money you'd spend on the dedicated stuff.

--

Gorkman

Contribution to causality responsibility by hey! · 2002-11-27 03:50 · Score: 5, Insightful

Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.

Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.

But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.

I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

Fraternal Twins by SEWilco · 2002-11-27 03:51 · Score: 5, Interesting

I hope the "second redundant network" uses equipment by a different manufacturer and has at least one network technician whose primary duty is that network. That person's secondary duty should be to monitor the primary network and look for problems there. Someone in the primary network staff should have a secondary duty to monitor and check the backup network.

The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.

Re:Reliability is inverse to the number of compone by gorf · 2002-11-27 03:53 · Score: 4, Informative

No.

You can only multiply them together like you have done if the two variables are independent.

Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.

Re:friggin windoze users by b1t+r0t · 2002-11-27 04:02 · Score: 5, Funny

I'll let my doctor worry about curing whats wrong with my brian than dealing with high-order complex networking issues, thank you very much.

"Dammit, Jim, I'm a doctor, not a CCIE!"

--

--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft

Its been coming for a log time by bolix · 2002-11-27 04:05 · Score: 5, Informative

I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!

AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.

I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.

The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!

The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.

I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.

Re:Contribution to causality responsibility by timeOday · 2002-11-27 04:09 · Score: 5, Informative

I agree, and let me refer you to a real life example. The USS Yorktown is that very famous Navy ship that was immobilized by a network outage. The whole thing was trigged by some seaman entering a 0 where he shouldn't have, so the Navy made some attempt to pin it on him. But it didn't fly. Operational errors like that are routine. It shouldn't have crashed the app. Having crashed the app, it shouldn't have taken down the whole network.

If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.

Mission Critical Networks 101 by rhoads · 2002-11-27 04:11 · Score: 5, Interesting

One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.

We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.

Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.

These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.

FUNNY NETWORK TRICKS

I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.

-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.

-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.

-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.

And the list of stories goes on. You get the point.

Counterexamples by hey! · 2002-11-27 04:14 · Score: 3, Interesting

As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.

Here are some examples of the ways in which failures can occur that have implied linkages:

(1) Both trains are damaged by an earthquake.

(2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).

(3) The firm has cut the maintenance budget and is neglecting routine maintenance.

(4) The train is sabotaged by disgruntled employees or terrorists.

(5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.

(6) Design flaw means trains do not meet expected performance specifications.

In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

Downtime Procedures by Kraegar · 2002-11-27 04:17 · Score: 5, Insightful

Posting this kind of late, but it needs to be said.

I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.

I don't know how that hospital was able to pass inspections.

I can top that! by Ashurbanipal · 2002-11-27 04:31 · Score: 5, Funny

There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.

The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!

I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".

When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.

You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...

A Case History by Baldrson · 2002-11-27 04:45 · Score: 3, Interesting

A major corporation wanted to go paperless. They had all sorts of IDEF graphs and stuff like that to go with. I was frightened for them and suggested that maybe a better route was to start by just going along the paper trails and, instead of transporting paper, transport physical digital media -- sneaker-net -- to workstations where digital images of the mail could be browsed. Then after they got that down they could put into place an ISDN network to the phone company which would allow them to go from sneaker-net to a network maintained by TPC. If TPC's ISDN support fell apart they could fall back to sneaker-net with physical digital media. Only after they had such a fail-safe "network" in place -- and deliberately fell back on it periodically and randomly to make it robust -- would the IDEF graphs start being generated from the actual flow of images/documents. By then of course there would be a general attitude toward networks and computers that is quite different from that of the culture that typically surrounds going paperless.

Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.

--
Seastead this.

Why not fix spanning tree? by m1a1 · 2002-11-27 04:49 · Score: 3, Insightful

If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.

Interesting response by jhines · 2002-11-27 05:02 · Score: 3, Insightful

That this happened in a teaching hospital, rather than a large corporation, makes their response much different.

They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.

Re:Spanning tree by jroysdon · 2002-11-27 05:18 · Score: 3, Informative

Disabling spanning tree on a network of any size is suicide waiting to happen. Without spanning tree you'll be instantly paralyzed by any layer two loops.

For instance: Bonehead user wants to connect 2-3 more PCs at his desk, so he brings in a cheap hub or switch. Say it doesn't work for whatever reason, so he leaves the cable in and connects a second port from the wall (or say later on it stops working so he connects a second port to test). When both of those ports go active and you don't have spanning tree, you've just created a nice loop for that little hub or switch to melt your network. Just be glad it's going to be a cheap piece of hardware and not a large switch, or you'd never be able to even get into your production switches using a console connection until you find the connection and disable it (ask my how I know). How long does this take to occur? Not even a second.

Spanning tree is your friend. If you're a network technician/engineer, learn how to use it. Learn how to use root guard to protect your infrustructure from rouge switches (or even evil end-users running "tools"). A simple search on "root guard" at Cisco.com returns plenty of useful hits

At my present employer, we're actually overly strict and limit each port to a single MAC address and know what every MAC address in any company hardware is. We know where every port on our switches go to patch panels. If anything "extra" is connected, or a PC is moved, we're paged. If a printer is even disconnected, we're paged. The end-users know this, and they know to contact IT before trying to move anything.

Why do we do this? We've had users bring in wireless access points and hide them under their desks/cubes. We want to know instantly if someone is breaching security or opening us up to such a thing. Before wireless, I'd say this was overly anal, but now, it's pretty much a requirement. The added benefit to knowing if an end-user brings a personal PC from home, etc., on to the network (which means they possibly don't have updated MS-IE, virus scanners/patterns, may have "hacking tools", etc.). This isn't feasible on a student network or many other rapidly changing networks, but on a stable production network it's a very good idea. Overhead seems high at first, but it's the same as having to go patch a port to a switch for a new user - you just document the MAC address and able port-level security on the switch port:

interface FastEthernet0/1 port security action trap port sec max-mac-count

With Syslogging enabled, you'll know when this occurs and if you've got expect scripts to monitor and page you when another mac address is used on that port, and if you've got your network well documented, you can stop by the end-user while they're still trying to dink around hooking up their laptop and catch 'em in the act.

Yes, I know all about MAC address spoofing. Do my end-users? Probably not, and by the time they find out, they're on my "watch list" and their manager knows. Of course, that's where internal IDS is needed and things start to get much more complicated, but at least you're not getting flooded with odd-ball IDS reports if you manage your desktops tight so users can't install any ol' app they want. Higher upfront maintenance cost? Perhaps, but we've never had any end-user caused network issue.

I'm fairly certain that if someone was running a "bad" application like what hosed the network in this story, I'd find it in under 30 minutes with our current network documentation. Would it require a lot of foot traffic? Yes, as the network would possible be hosed so management protocols wouldn't work, but I could isolate it fairly fast with console connections and manually pulling uplink ports.

And on an unrelated note... by Radical+Rad · 2002-11-27 05:42 · Score: 3, Funny

Mail any lucrative^h^h^h^h^h^h^h^h^h job offers to:

Former MIS Director,
Beth Israel Deaconess hospital
Boston, MA 02215

WRONG!: Re:Problem was with an application, by fanatic · 2002-11-27 05:51 · Score: 5, Informative

No application can cause a spanning tree loop. It is simply impossible.

A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".

There are 2 likely causes:

Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.

Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.

A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.

This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.

Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.

--
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody

Re:WRONG!: Re:Problem was with an application, by khafre · 2002-11-27 06:56 · Score: 4, Informative

Actually, it is possible for an application to cause Spanning Tree to fail. Most switches have a management port that allow remote access (via telnet, ssh, SNMP, etc.) to the switch. This management port is normally connected to its own VLAN isolated behind a router so user brodcasts & multicasts in another VLAN can't affect the switch CPU. This port can be overrun with brodcasts and multicasts from user applications providing both the user and the switch are on the same VLAN. If this CPU is consumed by processing broadcasts, it may not have enough CPU time available to process and forward spanning tree BPDUs. If a blocked port becomes opened, a switch loop could form and, BINGO, network meltdown.
Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · 2002-11-27 10:56 · Score: 4, Informative

Third possiblity - and what I'd be confident is the initial cause.

The amount of traffic the researcher was putting onto the network caused spanning tree hello BPDUs to be dropped.

After a period of not receiving hello messages (20 seconds if memory serves), downstream devices believe the upstream device has failed, and decide to re-converge the spanning tree.

During this re-convergence, the network can become partitioned. It is preferable to partition the network to prevent loops in the layer 2 infrastructure. Datalink layer frames eg ethernet, don't have a hop count, so they will loop endlessly - potentially causing further failures of the spanning tree protocol.

Once the bulk traffic source is removed from the network, STP should stabilise within a fairly short period - 5 minutes or so - so there may also have been a bug in Cisco's IOS, which was triggered by this STP event.

Altneratively, the network admins may have played with traffic priorities, causing this researcher's traffic to have a higher priority over STP messages, causing the STP to fail.

Radia Perlman has a good description of STP in her book "Interconnections, 2nd ed" - but then she should - she invented it.

A common logical fallacy... by The+Ape+With+No+Name · 2002-11-27 06:25 · Score: 3, Insightful

... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...

--
Comparing it to Windows will be a moot point, since El Dorado is going to have a 40% larger code base than XP.

These guys got off easy! by raehl · 2002-11-27 07:44 · Score: 3, Funny

The last time I had a problem with a spanning tree algorithm I lost 12 points on my CS final!

Ok, so seriously, I'd be embarassed if I screwed up a spanning tree algorithm on a test. If it took Cisco engineers 6 days to fix it, it musta been something really quirky, most likely the software not configuring something right. I can't imagine an application problem that would hose a network past a power toggle.

--
paintball

Sure, and while we're at it!! by cybercomm · 2002-11-27 08:00 · Score: 3, Funny

Why not buy M$ wireless 802.11b install W2K/XP on every computer and set up an MS exchange server. Who needs BSD when you have M$ :)

<I>just kiddi'n the uptime of the above mentioned network would be measured in nanoseconds, and then they will have to switch MS paper'n'pen method</I>

--
Live for the present, learn from the past, and dream of the future!

It's all about the Benjamins by sjbe · 2002-11-27 08:01 · Score: 5, Insightful

My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.

The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.

Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.

Executives working? by wandernotlost · 2002-11-27 08:23 · Score: 3, Funny

Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus.

It's always nice to see those people doing useful work for a change.

Union "help" by ces · 2002-11-27 09:45 · Score: 3, Insightful

Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.

My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.

--
Happy Fun Ball is for external use only.

Slashdot Mirror

Hospital Brought Down by Networking Glitch

86 of 569 comments (clear)