Slashdot Mirror


Hospital Brought Down by Networking Glitch

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

569 comments

  1. Well! Woopsy! by uberred · · Score: 1, Interesting

    This is almost too good... could someone have hacked in to their network and deliberately taken it down?

    --
    Time is an illusion, lunchtime doubly so. --Ford Prefect
    1. Re:Well! Woopsy! by Iamthefallen · · Score: 5, Funny

      Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.

      You may now begin to panic in an orderly fashion, thank you.

      --
      Wax-Museum Fire Results In Hundreds Of New Danny DeVito Statues
    2. Re:Well! Woopsy! by hey! · · Score: 4, Interesting

      I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

      However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    3. Re:Well! Woopsy! by Ken+Dods'+dad's+dog' · · Score: 2, Interesting

      I have seen this happen before in an organisation I have worked for. It happened when a second Cisco network (installed by a large well known company) was joined to an existing one and the routing protocol problems of the new network corrupted the existing one. Solution was to disconnect the two and force the external company to rebuild the new network from scratch.

    4. Re:Well! Woopsy! by skolya · · Score: 1

      Probably a pissed-off medical transcriptionist, like I used to be.... Been there, done that, don't even have a T-shirt to show for it.

  2. -1 leading questions (n/t) by Karamchand · · Score: 0

    i said n/t

    1. Re:-1 leading questions (n/t) by Anonymous Coward · · Score: 0

      i said n/t

      Fucking retard. What do you think this is, AOL EZ-Board on Teenchat.com?

  3. Problem was with an application, by Anonymous Coward · · Score: 5, Insightful

    according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

    Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

    1. Re:Problem was with an application, by cryptowhore · · Score: 5, Insightful

      Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

      --
      Happiness is a slider variable
    2. Re:Problem was with an application, by sugrshack · · Score: 5, Interesting
      that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.

      Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.

      realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

      --
      I can't believe it's not lard!
    3. Re:Problem was with an application, by Anonymous Coward · · Score: 4, Insightful

      So a researcher with a workstation isn't allowed to use the network do to his job? No, this stems from incompetence on the part of the network engineering team.

    4. Re:Problem was with an application, by GoofyBoy · · Score: 2

      How could one application, which they could shutdown/control, take down an entire network?

      I admit I'm mostly clueless when it comes to network hardware but shouldn't a massive reset/buffer clear have returned the network to a working state? Am I missing something here?

      --
      The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
    5. Re:Problem was with an application, by spudnic · · Score: 1

      It didn't. From the article:

      > The crisis had nothing to do with the particular software the researcher was using. The problem had to do with a system called ''spanning tree protocol,'' which finds the most efficient way to move information through the network and blocks alternate routes to prevent data from getting stuck in a loop. The large volume of data the researcher was uploading happened to be the last drop that made the network overflow.

      --
      load "linux",8,1
    6. Re:Problem was with an application, by rppp01 · · Score: 2, Offtopic

      Well.....I guess you could look at most of the sites we slashdot.....one application (IE, Mozilla, Opera, etc) takes down an entire site for hours and days and sometimes longer.

      --
      They stuck me in an institution, said it was the only solution, to...protect me from the enemy, myself
    7. Re:Problem was with an application, by nolife · · Score: 5, Interesting

      Not only that but they gave the impression no one had problems using the old paper method. Actually noting that at times the network was fine but they decided to stick with the backup method until the issue was resolved because it was harder switching back and forth when the network was working. All in all though they made a point that no appointments were missed, no surgeries were cancelled etc.. Meaning business was as usual but using a backup manual method.

      I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?

      --
      Bad boys rape our young girls but Violet gives willingly.
    8. Re:Problem was with an application, by ipstacks · · Score: 2, Interesting

      Routing is the solution. Anyone that runs a layer two network beyond one switch should be fired. Routing convergence is much faster than spanning-tree (even with the Cisco tweaks). Why would I want layer two when layer routers are capable of wire-speed routing?!

      --
      Which distro does Linus use?
    9. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      Yep, that's a nice regurgitation of what the parent poster said, troll.

    10. Re:Problem was with an application, by nm42 · · Score: 1

      Yes, but on a distributed scale.
      One person, running one copy of IE would be hard pressed to take down a network. Even Netscape only takes down the machine it's running on!

    11. Re:Problem was with an application, by GoofyBoy · · Score: 2


      First sentence says its wasn't the software, but how he/she was using it (uploading a huge amout of data).

      Why not effectively "kill" the upload and wouldn't that clear the problem?

      --
      The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
    12. Re:Problem was with an application, by aheath · · Score: 1

      The most interesting thing about this incident is that it was publicized. This allows us to have a public discussion of the best practices to avoid network outages and the best practices to recover from network outages. According to NetworkWorldFusion http://www.nwfusion.com/news/2002/1125bethisrael.h tml: "Halamka's candor with his peers suggests he wants to spread the word about potential risks of automation. "It is not surprising to those of us who know John that he is willing to share his experiences for the greater good," says Meg Aranow, vice president and CIO at Boston Medical Center. "He is very committed to improving the discipline of healthcare computing," she adds." Hopefully John Hamalka will write up the technical details of the root cause of the outage and the steps taken to address the outage. It might be worth watching the following sites for more information: John Hamalka's web page http://informatics.caregroup.harvard.edu/people/jh alamka/ The Massachusetts Health Data Consortium web page at http://www.mahealthdata.org/

    13. Re:Problem was with an application, by yknott · · Score: 1
    14. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      Nope, the parent said that the user should not have been allowed to use a production network to do his job. However, the network exists to allow people that actually contribute something to an organization to do their jobs, which means the menial network drones are incompetant.

    15. Re:Problem was with an application, by mr.canuck · · Score: 0, Redundant

      Crap!

      The problem was with having unqualified personnel (a physician) serving the function of an IT guy who would be qualified enough to wonder if it's all going to come crashing down around him way, way before it actually does.

    16. Re:Problem was with an application, by MCZapf · · Score: 1
      According to the article, that's exactly what he did. He unplugged the network cable, even. Apparently, though, there was some sort of endless routing loop (the article doesn't really say).

      The Slashdot headline mentions a spanning-tree algorithm problem. This is a problem at the Ethernet level, IIRC. I didn't see any mention of that when I skimmed the article. If I had to guess, though, I'd say that some switches or bridges or whatever they're called lost track of what they were connected to and stopped passing packets along. Or they passed packets along in a loop, so the packets never went away.

    17. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      You must work for the hospital. MAybe if you did your job instead of reading/posting to Slashdot, you'd have a stable network. But probably not.

    18. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      It's one of those publications that nobody pays for, yet everybody gets for some reason. Heavy on the advertising, and the content is generally aimed at management. Definitely not a 'nuts and bolts' mag.

    19. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      I am no network guru, so I may be way off when I say that you sound like you know what you are talking about. If this is the case, then please explain why this is so wrong instead of just laying it out there like we should all just "know" why you are correct. Thanks.

    20. Re:Problem was with an application, by John+Hasler · · Score: 1, Redundant

      A network that can be brought down by an application is not of production quality.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    21. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      You obviously are not working in a realistic environment. Spanning tree convergence can be brought down to less than 5 seconds. Also, the difference between a layer 2 switch (6509 for example) and a layer 3 switch (6509 + msfc) costwise is considerable. Why route on every switch when you can have 1 or 2 routers that service an entire campus?

    22. Re:Problem was with an application, by Goose3254 · · Score: 0

      WHAT? Routing faster? No way. When the frames hit the switch backplane and then are SWITCHED to the proper port that will be worlds faster than the router reading the packets, then sending them to the switch, (even if the router is a blade in the switch chassis, then being forwarded to the port. The lower you stay on the 7 layer model, the faster you go.

      It sounds like the network was operating at or near saturation normally. Designing a network 5 years ago, for traffic loads at that time, then running the "2.1" versions of the same software, without looking at an upgrade to the infrastructure, is most probably the issue.

      As far as saying that the network was crap because it didn't recover when the offender was unplugged; That's not altogether true. Since I don't know the nature of the offending traffic, I know I couldn't make that call. I can relate how the removal of two networked printers crippled large sections of a large network because 2 AS400s and one mainframe printed reports to them, and continued to try to print to them upon failure because the job loop required a verification. 2 weeks of backed up reports rendered the network quite unusable on those sections of the Class B.

      Lesson learned: Mind your prints and queues ;-p

    23. Re:Problem was with an application, by sumdumgai · · Score: 1

      Absolutely, this is a mismanaged network. Any network with this level of critical service requirement should be better design.

      --
      âoeIn theory, theory and practice are the same. In practice, they are not." â Albert Einstein
    24. Re:Problem was with an application, by GarryOwen · · Score: 2, Informative

      You sound a bit old school, routing now days can be as fast as a switch, course routers that fast will cost a hell of alot more. The reason why is most routers nowdays don't actually do a per packet inspection and routing. They route the first packet of stream and then switch all following packets in that stream. Also, if your statement the lower on the 7 layer model you are the faster you go is wrong, otherwise hubs would be faster than switches(layer 1 vs layer 2).

    25. Re:Problem was with an application, by aheath · · Score: 5, Informative

      I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."

    26. Re:Problem was with an application, by darkonc · · Score: 2
      My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

      RTFA: (from the globe artice)

      The crisis had nothing to do with the particular software the researcher was using . . . . . . The large volume of data the researcher was uploading happened to be the last drop that made the network overflow.
      The essential problem was that the network was (almost) overloaded. The data from the researcher was simply enough to complete the overload.. This probably either caused an overload in a fixed-sized table (oops) or it caused a router/switch to run out of memory. This caused the data loop. Shutting down the segment with the data loop caused a large chunk of dataflow to be re-routed along a secondary path --- overloading that path. "and they caused two more, and they caused two more and so on , and so on....".

      For all we know, this researcher could have been doing an FTP transfer (but my {blind} guess is that he was doing some sort of multi-system collaberative computing). His problem was that he put a bit more load onto an already groaning network, and broke it's back{bone}.

      Now, as to preventing research work on a 'production' system: this is a teaching (read research) hospital. Research and production work go hand in hand. From reading the article, it appears that the reason why they're adding a second parallel network isn't because they want redundant connections. It's because they need the extra bandwidth (and knew that they needed it before this happened).

      In fact, on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time. ''Now,'' he said, ''we're going to do it faster.''
      In a sentence, preventing research on a production network would have been a PHB reaction. The only way that that sort of reaction would have had the required effect would have been to apply bandwidth/connectivity quotas to everybody on the campus. (which would have placed extra load on the routers which would, of course, have made the underlying problem worse, which......)
      --
      Sometimes boldness is in fashion. Sometimes only the brave will be bold.
    27. Re:Problem was with an application, by pyite · · Score: 3, Informative

      Technically, hubs are faster than switches for N endpoints when N = 2. The reason is hubs do not have to look at the frame being sent and either store-and-forward or cut-through like a switch does. Your total possible collision locations on a hub is N * (N - 1) / 2 (Gauss' formula for sum of 1 to N, coincidentally), where once again N is the number of endpoints. In a switch, your collision domain always has two endpoints, therefore your total possible collisions is 1, thus you get increased speed.

      --

      "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

    28. Re:Problem was with an application, by pyite · · Score: 1

      In fact, hubs are so simple that they're purely electrical and have no intelligence whatsoever. A simple repeater could be built with a simple hex inverter doing two inversions per direction.

      --

      "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

    29. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      The big question is: how did the network get configured so it was so vulnerable to "level 2" quirks??? Didn't the cio get some independent auditors to check the quality of the network? OK - I can see losing a building's network for a couple hours... but a whole hospital for 3 days? Insane.

      Gee, I wonder what their network SECURITY is like. We're talking medical records... I wonder if they audit THAT? I'm scared!

      With such a network design, it'd be interesting to know if the network suffered from previous failures, or if this was it's first outage. If it wasn't the first outage, it'd be interesting to see what caused the previous outages, and if they learned from those outages or just merely passed them off as quirks. User interviews would be VERY revealing here: "Have your PCs lost connectivity before?"

      I mean this is one of the most important, well respected hospitals in the world. How could it's network be in such a poor state? Is this normal, or just part of this one orginization?

      I've worked in some large companys. And we've had network failures. But never a total outage for most people, and never for more than an hour, and never more than once. And heads rolled. And it wasn't hidable - tons of employees experience such a failure at once, so you can't sweep it under the rug.

    30. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      Today's bridges and switches decrement the Max Age field of a BPDU by 1 as the BPDU is forwarded out its switchports. However, with no tuning in a network, the Max Age parameter is set to 20. The fact that in some instances, 10 Layer 2 hops existed is irrelevant - because by the time the BPDU reaches those edge Layer 2 devices, the Max Age value is what, 10?

      The 7 hop diameter comes from when the 802.1D specification was originally developed--when bridges were software-based, where filters often caused delay. The conservatism comes from the assumption that each device will take up to 3 seconds to process the BPDU and decrement the Max Age parameter based on this time. Today's bridges and switches operate at wire speed - and don't work on the principle of decrementing delay in seconds but rather decrement by a static value of 1. So, the diameter of the hospital's network really wasn't an issue, and the solution of removing all Layer 2 redundancy isn't any solution at all. Wow, if that's what TAC is like today, perhaps buying Cisco stock really isn't a good thing after all.

    31. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      Happily, this is the case of a new CIO doing his best to admit his failures. You just wouldn't see that in the business world.

      This guy is smart enough to relize that he can't blame a specific user, equipment provider or industry and get away with it. It was the failing of his organization that he has been managing, so he felt he should be public about it.

      He might be quietly fired in the end, but at least some other CIOs might learn a very important lesson... thanks to this guy's failure.

    32. Re:Problem was with an application, by Anonymous Coward · · Score: 0

      I have work as a network/systems engineer in a hospital. IT is the VERY last thing to get any attention. The garbage compactor got more attention then the network. Even within the IT group there were individuals who were bending over backwards to "help" the hospital employees by plugging in hubs and switches in peoples offices. We were begging and pleading almost every week to get the whole thing documented but in a 24/7 operation no one was willing to take down anything to help document. We setup standards that sat on the CFO's desk for months only for him to come back and say "I don't understand this can you redo it?" It was incredibly stupid. I no longer work there because it had gotten past the point were I figured I was doing any good and followed the money.

  4. an identical network by Anonymous Coward · · Score: 0

    having an identical network would almost be like raiding several harddrives to have the databacked up (raid 0+1 i think). It would almost guarrantee a connection unless of course they both go down. But how likely is that? :)

    scapegoat

  5. This is what you call... by Anonymous Coward · · Score: 2, Funny

    ... "an old boys' network"

  6. No. by Clue4All · · Score: 5, Interesting

    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

    --

    Is your browser retarded?
    1. Re:No. by Anonymous Coward · · Score: 0

      They don't break on their own either.

    2. Re:No. by Anonymous Coward · · Score: 0

      Shaddup.

    3. Re:No. by passion · · Score: 2

      good idea, the problem is that most institutions don't do enough regression testing to see if *absolutely everything* is working. Oh sure, my cat's webpage with the 3-d rotating chrome logo still loads, but what about the machine that goes ping keeping Mr. Johnson alive just down the hall?

      --
      - passion
    4. Re:No. by Anonymous Coward · · Score: 5, Informative

      As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.

    5. Re:No. by barberio · · Score: 5, Insightful

      The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

      So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.

    6. Re:No. by StillNeedMoreCoffee · · Score: 1

      So your suggesting that the network had help breaking .. tell Homeland Security quick.

    7. Re:No. by ostiguy · · Score: 5, Insightful

      If a network problem breaks down network 1, what is going to stop it from breaking network #2? If the problem was with the firmware in device#23a, the problem will reoccur on network 2 with device #23b

      ostiguy

    8. Re:No. by pubjames · · Score: 5, Interesting

      I spoke to an electrician at our local hospital recently. He told me the hospital had three separate electricity systems - one connected to the national grid, one connected to an onsite generator which is running all the time, and a third connected to some kind of highly reliable battery system (sorry can't remember the details) for life support and operating theatres in case both the national grid and the on-site generator fail simultaneously.

      If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.

    9. Re:No. by hey! · · Score: 2

      Sounds good. Unfortunately, details didn't make it into the Globe article.

      A few questions, if I may. Is the design and scope of the redundant network the same as the original network? Personally I'd consider a smaller network to carry just the most critical information so that efforts to diagnose and recover that network, should become necessary, will be more concentrated.

      Secondly, have the contingencies plans considered the possibility of deliberate subversion, such as a buffer overflow attack on the equipment or DDOS on hosts? Again, this is where I'd consider a restricted network useful, as well as contingency plans to move data by paper or other media.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    10. Re:No. by dirk · · Score: 3, Interesting

      No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

      While in the short term the anser is to fix what is broken, they should have had an alternative network set up long ago. When you are dealing with something as important as a hospital, you should have redunancy for everything. that means true redundancy. there should be 2 T1 lines coming in from 2 different vendors from opposite direction if that is something will endanger lives if it breaks. If something is truely mission critical, it should be redundant. If it is life-threatening critical, every single piece should be redundant.

      --

      "Information wants to be expensive" - Stewart Brand, the same guy who said "Information wants to be free"
    11. Re:No. by Lord+Crc · · Score: 1

      How about choosing a different network hardware supplier for that redundant net? If the redundant net is identical, there's no stopping the same from happening right afterwards if the error is something more than physically broken hardware.

    12. Re:No. by Openadvocate · · Score: 2

      "things don't break on there own"
      mkaeeyy, I'd like some of that hardware.

      I have never seen unbreakable network hardware.
      I have seen network hardware with redundancy to prevent loss of servies in case of a breakdown and I have seen the redundancy fail also.
      :)

      --
      my sig
    13. Re:No. by jthuck · · Score: 1

      But that is redundant supply, not internal infrastructure. To match your argument, you would be advising them to get a T1 from Sprint, a T1 from MCI, and a satellite link, not having multiple internal paths.

      Now to match it back to other arguments, the life support systems were placed on a separate circuit from the other electrical circuits (lighting and whatnot). This would correlate to proper network design, where critical systems would get their own subnet.

    14. Re:No. by Anonymous Coward · · Score: 0

      Off topic

      The literature (papers on medical/health hospital computer systems, mainly databases) have lots of little tiedbits about Beth Israel and their computer systems. They were a rather early adopter of trying to go more or less paperless; oddly enough, even with the computerized system, doctors frequently printed out mounds and mounds of paper anyways from the system because they wanted a hardcopy. Maybe this helped saved them during this crisis. (And before anyone screams "what about the cost of the paper", costs were driven down despite the printing due to general time saved per patient as well as being more efficient (not ordering repeat tests because the first set couldn't be found)).

    15. Re:No. by Anonymous Coward · · Score: 0

      Uhh, no. Because you can build a non-computer redundant system. It's called paper.

      Doctors used to have problems (and still do) using computer systems. Beth Israel is one place that has tried to adopt a less paper strategy, to save both time and reduce repeat tests because something could not be located.

      In the past, when a computer system was still in place, some in the staff still felt compelled to print print print, even though it was supposedly in this safe environment. That was their redundant system, and far better than multiple parallel systems--redundancy is still another level of complexity which introduces another set of problems and often a false sense of security.

    16. Re:No. by w1r3sp33d · · Score: 1

      Its a mixed bag, the major hospital network I have been building up over the last two years has redundant systems for hardware failure. An application or stp problem on one vlan would require isolating and fixing the problem, no amount of spare gear can fix it. It happened nine months ago when a IT staffer wanted more ports on one vlan and powered up an old 5500 and uplinked it to the new 6500 at their core, stp crashed out the entire floor (per vlan stp, one vlan per wing/floor.) Since then the entire infrastructure team has been fired, the hospital feels it's better to have nobody around than a single person who thinks they know what their doing but doesn't. I am not heartbroken, my support calls have dropped by 80% and I can focus on building out their network rather than changing my configs everytime someone "helps me out."

    17. Re:No. by HamNRye · · Score: 2

      This whole thing makes no sense....

      They state that the problem was application, workstation level. The solution, install a second network. WTF?? If it really was a researcher at his workstation, disconnecting his station and possibly a reset of his hub and go. Problem solved, I'm back to finding out what shows are playing at Harvard Square this weekend.

      Now, Fast forward 5 years when the network goes out like this again... If their past maintenance performance is any judge, I'll just assume they did not maintain quarterly testing of the secondary network (It's a pain in the butt, and Hospitals are 24 hour operations) and I'll bet it doesn't work when they need it. The extra switches and such might come in handy, but I'm positive that they could have achieved better reliability for the same money by spending in other areas.

      For our 1,000 person operation, installing a second network would involve about 50-60 hubs, routers, switches, etc. Involve the extra telco racks, and running that cable, that's mighty frikken' expensive. We do have a backup for our backbone, but the entire thing?? Ewww....

      ~Hammy

    18. Re:No. by Idarubicin · · Score: 2
      If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.

      Well, in a modern hospital, being without network access for a few minutes doesn't kill people. Losing power in an operating theatre can make soeone very dead, very quickly. Yes, procedures exist to handle such a situation, but there really isn't a good backup to say, a heart-lung machine.

      I know, there are /.ers that would die without their DSL lines, but most of them don't live in hospitals.

      --
      ~Idarubicin
    19. Re:No. by Anonymous Coward · · Score: 0

      The question is posed in the manner of a true non-medical arm-chair quarterback. Why have electrical backup and not network redundnacy? If the electricity, goes down for a few seconds-to-minutes, the patients on the ventilators DIE. If the network goes down, the physicians have to write their orders on paper -- no one DIES.

      Before asking the easy question, think about it a bit next time. Sure network back-up would be nice, but given how financially strapped most health-care systems are, this is way down the road.

  7. Of course it can help by Anonymous Coward · · Score: 2, Insightful

    Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.

    But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?

    Which puts them right back to where they were.

    Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.

    1. Re:Of course it can help by dprior · · Score: 1

      I'm pretty sure they'll realize their network is down when they are forced to start running around looking for paper forms again. That might clue 'em in.

    2. Re:Of course it can help by Anonymous Coward · · Score: 0

      But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?

      I am not trying to bash you specifically my AC brother from another mother. But everyone rating this as 'Insightful' should have their mod points taken away. They know nothing about network administration, but are taking it upon themselves to show which posts are truly 'Insightful'. They should learn to not moderate if they don't have a clue about the subject. Now, with that out of the way...

      Even the simplest design that includes network management capabilites (and with a network of this size it is definitely actively monitored) will be able to tell you when a switch fails, a router has even the slightest hiccup, or even if a single end-user connection starts to get saturated. If the network management utilities are setup and configured correctly, the Network Admin will know that there is a critical failure before the users even realize that its not just their workstation working slow again.

  8. Well... by REBloomfield · · Score: 1

    If the first one's bust, how's a second going to help? :) Although i must admit that redundancy is a wonderful thing for servers, power supplies, etc, but for infrastructure?? Having identical copies of routers kicking around is extremely useful, but cost effectiveness comes into play. If you can afford it, I can't argue with the logic.

    1. Re:Well... by ibennetch · · Score: 1

      Having identical copies of routers kicking around is extremely useful, but cost effectiveness comes into play

      Not just cost effectiveness...I would think that if something's running that brings down the first network, switching everything over to the second network would cause the second network go to down just like the first. No, IMHO a second network is not any type of redundancy except for network hardware failure, which is easy to fix by having a few spare parts laying around and swapping if a problem pops up.

      Or it could be like the old Asante hubs we used to have (3 or 4 years ago) where I went to high school..every time there was a thunderstorm, we'd loose a few. Asante was really good about replacing them, but we went around to each of the closets and at the very least had to re-start most of the hubs. There was always that one closet we'd forget and get phone calls later that day...

  9. Networked hospitals by Anonymous Coward · · Score: 0

    Hmm, a second parallel system. Would this include parallel wiring closets? I suspect that the cost involved (I once worked on a project team that was merely replacing wiring at a hospital, and it took 6 months) would have them continue to use existing wiring runs. You have now created a single point of failure for *both* networks.

    For those who think that a hospital wouldn't cut corners in that way, think again. I know what we had to do with our project, and I for one will never let anyone I know stay at that hospital. If they were willing to cut there, where else will they cut?

    Anon Coward

    1. Re:Networked hospitals by TheCarp · · Score: 1

      Your missing something about hospitals. I worked at one (in fact, one not far from Beth Isreal) for a couple of years (in IT no less) and in my experience the problem is with budgeting.

      You see...hospitals don't work like companies. They (usually) are not for profit entities. There is no CEO, the central structure is a huge beauracracy. Its alot like a University.

      Generally IT departments are not on the top of the list to recieve funding. Their budgets tend to be alot smaller than they really need. In fact, where I worked, the budget cycle happend, and IT ran out of their budget for buying new servers in about 4 months!

      Now, when some big name research doctor said "give me 4 million to setup a lab to do stuff or im going somewhere else" they happily did it (I also noticed that the work orders I was doing for his little department had a funding code that listed as being from the US Army... neuroscience department... funny that...)

      Which illustrates another point... budgeting is weird. Money comes from all manner of places, central udget, grants. All stuff the IT departments generally don't see any of except from their chargebacks. (and boy do they ever chargeback, we were charging departments $300 just to activate a network drop... activate... litterally just to go into a wiring closet and connect a patch cable! )

      The problem is, nobody dies if IT is underfunded (at least not in a way that can easily be traced back and you can say "See, if IT had been better funded..."). They are also newer departments, not like Radiology thats been around for the better part of the last century.

      -Steve

      --
      "I opened my eyes, and everything went dark again"
  10. friggin windoze users by kraksmoka · · Score: 0, Flamebait

    that's what u get when u sign onto monopolyware. fact is, with all the fancy toys that docs use like MRI and tomography, i haven't met one that knows anything about a computer. in fact they were probably glad their stuff crashed. in fact, it was probably a setup to get the old system back! lousy docs :(

    --
    "You never want a serious crisis to go to waste." - Rahm Emanuel
    1. Re:friggin windoze users by Anonymous Coward · · Score: 1, Funny
      worry about curing whats wrong with my brian than dealing

      Oh, that's so cute. Sounds like true love. How long have you and Brian been together? Where did you meet?

    2. Re:friggin windoze users by b1t+r0t · · Score: 5, Funny
      I'll let my doctor worry about curing whats wrong with my brian than dealing with high-order complex networking issues, thank you very much.

      "Dammit, Jim, I'm a doctor, not a CCIE!"

      --

      --
      "Open source is good." - Steve Jobs
      "Open source is evil." - Microsoft
    3. Re:friggin windoze users by kraksmoka · · Score: 1
      i'm not talking about high-order networking with docs. i'm talkin about using explorer! or a machine based scheduling program. or getting those ct scan images over the web. most of the one's i have met are luddites outside of their profession, and its a shame. it isn't inteligence, or time, but stubbornness.

      what can i say, my point is, that the docs are probably happy about the crash, and happy to have all paper. that disgusts me.

      --
      "You never want a serious crisis to go to waste." - Rahm Emanuel
  11. Major American Bank Outage by MS_leases_my_soul · · Score: 5, Informative

    A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

    We had to rebuild the entire network ... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.

    Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.

    1. Re:Major American Bank Outage by Pig+Hogger · · Score: 2

      Well, for a banker (and any ignorant bean-counting type), a pound of cure is better than an ounce of prevention...

    2. Re:Major American Bank Outage by passion · · Score: 3, Informative

      If triple-redundancy is good enough for San Francisco's BART, and this "major bank", then why can't it be good enough for a hospital, where there are most likely many people on life support, or who need instant access to drug reactions, etc?

      --
      - passion
    3. Re:Major American Bank Outage by Anonymous Coward · · Score: 0

      Nothing but FUD. Where is your proof? If such a major outage happened with such a major bank, surely there is a news article you can dredge up on it?

    4. Re:Major American Bank Outage by Anonymous Coward · · Score: 0

      I can substantiate the story, cause I was there.

      heh.
      erich trowbridge
      ccie 4653

    5. Re:Major American Bank Outage by Churchill · · Score: 1

      And I can substantiate that story as I was across the street at The Other Bank. Funny how these things get around.

      --
      What a life a mess can be.
    6. Re:Major American Bank Outage by Anonymous Coward · · Score: 0
      They are now quite proud to tote their redundancy.

      It's nice that they carry it. Do they also toute it?

    7. Re:Major American Bank Outage by Anonymous Coward · · Score: 0
      A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

      Seems like a good time for the IT staff to ask for a collective raise.
    8. Re:Major American Bank Outage by MS_leases_my_soul · · Score: 1

      Well, I'll tell you what. Walk up to any one of this bank's IT employees and ask them what ever happened to that guy who pushed the bad routing table out into production back in 1998 that caused all the stratocoms to fail. If they were there at the time, they can not forget it, because the bank had just about every single IT person working in shifts 24 hours a day until it was back up. Then ask them about the "War Room" conference call lines they had manned.

      Those of us who could not fix routers manned conference call lines for 8 hour stretches and tested apps as links came back online.

    9. Re:Major American Bank Outage by Anonymous Coward · · Score: 0
      I work in AS400 Technical Support and one of the most important ideas that I try to get through to my customers is this:


      * If something is critical, have two of them. Plain and simple.

      I have seen too many times where a bank was brought down to its knees because it was relying on a printer and the printer failed. One time a bank had to print their checks by hand for a day and they had to have a new piece of printer hardware flown in. It would have been a lot cheaper to have another printer waiting in the wings.

      Now I'm not saying that this hospital should have 2 networks. You should expect to have some network PROBLEMS, but the entire network should not FAIL. This is simply unacceptable. Best thing would be to rebuild the network from the ground up for stability.

  12. MONEY by Botchka · · Score: 1

    I would look at making the original network more reliable and what the hell, if the hospital has money to burn, redundancy is a good thing. I didn't read the article. Was this caused by some knucklehead that was testing in a production environment?

    --
    Money not found! A)bort, R)etry, D)eclare Bankruptcy
  13. Bobby Lost An Eye by Anonymous Coward · · Score: 0

    It's all fun and games until Bobby loses an eye because his doctor couldn't read his forwards.

    -Eezy Bordone

  14. Leading question by Junks+Jerzey · · Score: 4, Insightful

    do you think the answer to having an massive and unreliable network is to build a second identical network?

    Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

    1. Re:Leading question by ibennetch · · Score: 1

      Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

      Maybe it's just a discussion-provoking question? I honestly don't know what's behind this submitter asking that question...

    2. Re:Leading question by Anonymous Coward · · Score: 0

      >> Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

      No.

    3. Re:Leading question by enkidu55 · · Score: 4, Interesting

      Isn't that the whole point in posting a story? To foster your own personal agendas? What would be the point in making a contribution to /. then if everything was vanilla in format and taste. You would think that the members of the /. community would feel a certain sense of pride knowing that their collective knowledge could help another business/community out with some free advice.

      IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.

    4. Re:Leading question by billybob2001 · · Score: 1

      Pardon me, but that's a leading question.

      Now we're all really tired, thanks a lot!

    5. Re:Leading question by hey! · · Score: 2

      I'm sorry if this kind of thing strikes you as cliche. You are correct in characterizing the question as a "leading" question. However what I was trying to lead people to is not a conclusion, but an area of inquiry. Everyone knows techies don't always get the resources or time they need to do things right. If you had the opportunity presented by this kind of disater, what would you do with it?

      I admit the question as a tone of disparagement which was perhaps unwarranted: the layman's article may not have accurately characterized the proposed solution. However, if the solution is as represented, it raises many important design strategy issues that apply not just to networks, but to any kind of mission critical, or in this case life critical system. Redundancy is an easy sell because it is easy for non-technical people to understand. However, underlying the concept of redundancy is an assumption of independence of one component from another's problems that may not be warranted.

      In my opinion, it is the concept of independence rather than redundancy that is key, and it is concept that underlies many design principles.

      The direction I hope to lead the discussion in is more abstract and general, and it applies to the design of any system from a computer network to a nuclear power plant.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    6. Re:Leading question by Alsee · · Score: 2

      >do you think the answer to having an massive and unreliable network is to build a second identical network?

      Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?


      Fine. Just submit a duplicate story and end it with:

      "Shouldn't all life-critical systems like hostpitals have an identical backup systems in case the primary goes down?"

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
    7. Re:Leading question by Anonymous Coward · · Score: 0

      Don't worry about it. The comment will be different when this story gets reposted in a week or so.

    8. Re:Leading question by Anonymous Coward · · Score: 0
      In what way is the issue of an identical twin of a failed network a personal agenda?

      A common redundancy design is to not duplicate in the two networks. For example, it is mentioned that Cisco staff couldn't help -- so how could they help if the second network was also using Cisco equipment? So don't use Cisco in the second network. Also avoid having both nets use the same bus, tree, or ring design -- and although they may share wiring closets in buildings (Fire in a wiring closet? It's amazing how quickly you can connect new routers when you can run temporary cables taped to ceilings and stairwells, and you only lost one floor anyway.), have the critical equipment in different buildings.

      Intentionally try to use different solutions in the two networks.

  15. Spanning tree by skinfitz · · Score: 2, Interesting

    do you think the answer to having an massive and unreliable network is to build a second identical network?"

    I think the answer is to disable spanning tree.

    We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.

    My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.

    1. Re:Spanning tree by zyglow · · Score: 2

      Adding on to the VLAN idea, I'd also change the routing protocol to OSPF. They would be squandering a lot of money to run two networks side by side.

      --
      http://www.forum-addicts.com
    2. Re:Spanning tree by GLX · · Score: 5, Interesting

      This would imply that either:

      A) A campus could afford to do Layer 3 at every closet switch

      or

      B) Live without Layer 2 redundancy back to the Layer 3 core.

      I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

      Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

      Spanning tree is our friend, when used properly.

      --
      Sig (appended to the end of comments you post, 120 chars)
    3. Re:Spanning tree by TheMidget · · Score: 3, Insightful
      I think the answer is to disable spanning tree.

      On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...

    4. Re:Spanning tree by AKnightCowboy · · Score: 3, Insightful
      I think the answer is to disable spanning tree.

      Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).

    5. Re:Spanning tree by hey! · · Score: 2

      Hmmm. But what happens in the rare instance (as here) that you have to bring up a large LAN from a dead stop? IIRC, once the network collapsed, they couldn't get the spanning tree to converge for days. All the equipment was operating correctly.

      Spanning tree is a remarkable protocol, but there are limits to its upward scalability, at least if you don't want problems like this.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    6. Re:Spanning tree by lucifuge31337 · · Score: 0, Troll

      This would imply that either: A) A campus could afford to do Layer 3 at every closet switch
      [...]
      I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.
      Good points, but in this scenario, L3 at all wiring closets seems like it would be much cheaper than a SECOND PARALLEL NETWORK. Most hospitals I've worked in (larger ones) are already running a class of switch in the closets that will support such features with a simple upgrade (Cat 55xx, etc.....). Toss in an RSM and enable VTP.

      --
      Do not fold, spindle or mutilate.
    7. Re:Spanning tree by WetCat · · Score: 1

      Hmmm... what about PC with Linux with ethernet cards to do Layer3 switching? Can it be cheaper than to buy special hardware?

    8. Re:Spanning tree by stilwebm · · Score: 5, Interesting

      I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.

    9. Re:Spanning tree by Chanc_Gorkon · · Score: 4, Insightful

      Egads no! Dedicated hardware designed for this is the only solution in this kind of case. A PC simply is not. You CAN'T use a hack in a hospital. You should not use a hack like this in a business either, but I understand if it's done this way. Hacks like this can become rather problematic once it's asked to grow. Also most PC's do not have redundancy in power supply and probably doesn't have a raid array (although I have seen a vpr Matrix machine at Best buy with a raid array...Your standard adaptec type included in a lot of MB's now). If I were to do something similar, I would rather do something with AIX or if using Linux, using a server class machine. By the time you do that, you have already spent the money you'd spend on the dedicated stuff.

      --

      Gorkman

    10. Re:Spanning tree by rakslice · · Score: 2

      Using a general purpose hardware for routing may be slow, but that doesn't make it 'a hack'.

      Maybe I'm missing something obvious, but what do you need good mass storage on a router for?

    11. Re:Spanning tree by Anonymous Coward · · Score: 0

      You don't need to do L3 at each access switch to
      minimize the impact STP can have upon your network. You can do dual routers in the distribution layer with L3 links to each access switch. Then configure BPDU guard on the switch,
      along with minimizing the number of portfast ports. You'll have to allow HSRP and your routing protocol to control your redundancy but we build 5-nine networks using this design, and it scales.

      I agree that it sounds like mismanagement caused the problem, but STP also shouldn't be used as a redunancy design. Too many people don't understand how to manage STP, and a single loop will bring down the network up to the L3 edge. At most places the L3 edge will include the entire campus. Leaving STP as only a method of protection against loops, instead of a method of redundancy does make it your friend.

    12. Re:Spanning tree by martyros · · Score: 1

      I don't know the details of the spanning tree, but aren't network protocols generally designed to be as stateless as possible (for exactly this reason)? If the network was dead anyway, couldn't they simply have turned off all the switches, routers, etc. for a minute or so, and turned them back on, and let them reconfigure themselves?

      --

      TCP: Why the Internet is full of SYN.

    13. Re:Spanning tree by Anonymous Coward · · Score: 0

      Or, do without Layer 3 switching to begin with. It's crap. Routers should route. Switches should switch. Never combine the two.

      I can do more on my network with a couple 7206s and a few 3500s (per segment) connected to my 12000s than another department can do with their 6500s.

      I have complete redundancy on my network with routers and switches, and mine costs the company a lot less money and is easier to expand. They have all of their stuff in a handfull of segments, in one OSPF area doing layer 3. I have seperate OSPF areas for each segment and I push that into BGP to go to the backbone or when I need to jump to their network.

      Layer 3: It just doesn't work.

    14. Re:Spanning tree by Inet · · Score: 1

      to disable stp means you have to ensure no loops are formed. This is a very hard thing to do in a large installation. What do you against the geeks who think it is cool to plug their own hubs/switches into the network to extend their network ports? -> 2 network outlets pluged into the same switch and you re done. BOOM!

      in a serious design you would separate user groups with VLANs and put routers between the vlans. That way such an event would only blackout a specific user-group.

    15. Re:Spanning tree by Anonymous Coward · · Score: 0

      > Spanning tree is our friend, when used properly

      Tell that to those poor fsckers who are still running old versions cabletron/enterasys SFS.

      If you think "spanning tree protocols" are anywhere near a standard (despite numerous RFCs), then you haven't done much (multi-vendor)internetworking at layer 2.

    16. Re:Spanning tree by Anonymous Coward · · Score: 0

      Was the issue realy spanning tree or was it EGIRP?

    17. Re:Spanning tree by Anonymous Coward · · Score: 0

      I was once supporting a product that couldn't work with cicso spanning tree enabled.. seeing a call come in I decide to give it to the newby support guy, and tell him, just tell them to turn off the spanning tree option on the switch..

      email flies out

      "switch off the spanning 3 option"

      hahaha

    18. Re:Spanning tree by jroysdon · · Score: 3, Informative
      Disabling spanning tree on a network of any size is suicide waiting to happen. Without spanning tree you'll be instantly paralyzed by any layer two loops.

      For instance: Bonehead user wants to connect 2-3 more PCs at his desk, so he brings in a cheap hub or switch. Say it doesn't work for whatever reason, so he leaves the cable in and connects a second port from the wall (or say later on it stops working so he connects a second port to test). When both of those ports go active and you don't have spanning tree, you've just created a nice loop for that little hub or switch to melt your network. Just be glad it's going to be a cheap piece of hardware and not a large switch, or you'd never be able to even get into your production switches using a console connection until you find the connection and disable it (ask my how I know). How long does this take to occur? Not even a second.

      Spanning tree is your friend. If you're a network technician/engineer, learn how to use it. Learn how to use root guard to protect your infrustructure from rouge switches (or even evil end-users running "tools"). A simple search on "root guard" at Cisco.com returns plenty of useful hits

      At my present employer, we're actually overly strict and limit each port to a single MAC address and know what every MAC address in any company hardware is. We know where every port on our switches go to patch panels. If anything "extra" is connected, or a PC is moved, we're paged. If a printer is even disconnected, we're paged. The end-users know this, and they know to contact IT before trying to move anything.

      Why do we do this? We've had users bring in wireless access points and hide them under their desks/cubes. We want to know instantly if someone is breaching security or opening us up to such a thing. Before wireless, I'd say this was overly anal, but now, it's pretty much a requirement. The added benefit to knowing if an end-user brings a personal PC from home, etc., on to the network (which means they possibly don't have updated MS-IE, virus scanners/patterns, may have "hacking tools", etc.). This isn't feasible on a student network or many other rapidly changing networks, but on a stable production network it's a very good idea. Overhead seems high at first, but it's the same as having to go patch a port to a switch for a new user - you just document the MAC address and able port-level security on the switch port:
      interface FastEthernet0/1
      port security action trap
      port sec max-mac-count
      With Syslogging enabled, you'll know when this occurs and if you've got expect scripts to monitor and page you when another mac address is used on that port, and if you've got your network well documented, you can stop by the end-user while they're still trying to dink around hooking up their laptop and catch 'em in the act.

      Yes, I know all about MAC address spoofing. Do my end-users? Probably not, and by the time they find out, they're on my "watch list" and their manager knows. Of course, that's where internal IDS is needed and things start to get much more complicated, but at least you're not getting flooded with odd-ball IDS reports if you manage your desktops tight so users can't install any ol' app they want. Higher upfront maintenance cost? Perhaps, but we've never had any end-user caused network issue.

      I'm fairly certain that if someone was running a "bad" application like what hosed the network in this story, I'd find it in under 30 minutes with our current network documentation. Would it require a lot of foot traffic? Yes, as the network would possible be hosed so management protocols wouldn't work, but I could isolate it fairly fast with console connections and manually pulling uplink ports.
    19. Re:Spanning tree by arkane1234 · · Score: 1

      If you think "spanning tree protocols" are anywhere near a standard (despite numerous RFCs), then you haven't done much (multi-vendor)internetworking at layer 2.

      See, that's the problem. Anybody who integrates a network knows to use the same brand of router for a certain piering point.

      --
      -- This space for lease, low setup fee, inquire within!
    20. Re:Spanning tree by Anonymous Coward · · Score: 0

      The real answer is to use Split Multi-Link Trunks between your closets and the aggregation switches. This allows you to eliminate spanning tree, run a single network with plenty of redundancy, and skip the cost and complexity of routing in every closet!

    21. Re:Spanning tree by GLX · · Score: 1

      You don't need mass storage. You need a reliable, *bootable* flash memory solution, then. RAID5 on magnetic media is obviously impractical management wise if you have hundreds of these.

      Know of any?

      --
      Sig (appended to the end of comments you post, 120 chars)
    22. Re:Spanning tree by Cramer · · Score: 1

      Spanning tree is good for small networks. However, the larger the network -- in both number of nodes, and physical area coverage -- the longer it takes to map the topology and that map becomes extremely complex. Once the network is large enough for loop detection to take longer than the hold down time, it's all over. Add in the volume of normal traffic and the network may never be able to reconverge. They'd have to power the switched network up one node at a time from the STP root outward.

      Personally, I tend to disable spanning tree. I don't like seeing the broadcast spew it generates. My network(s) aren't large enough to matter. And yes, I've introduced loops by accident before -- I had two WAPs in bridge mode on the same channel too close to each other. The switch (cat 2948g) gets really annoyed when I do that. (it's actually kinda funny)

    23. Re:Spanning tree by Cramer · · Score: 2, Interesting

      That's handled by "partitioning" on the same switch. Most switches are smart enough to tell they've been plugged into themselves. And even if they don't, broadcast suppression will catch such setups really well -- all it takes is one broadcast packet to flood both ports. STP prevents loops between switches. In this case, that'd be plugging ports from multiple switches into the same hub.

      There's an even easier way to fix the problem in your example... don't give the idiots access to multiple ports in the same network. :-)

      And I would submit it's not very wise to create a city sized switched ethernet network.

    24. Re:Spanning tree by Chanc_Gorkon · · Score: 2

      Yeah. It's called a router. Yeah you don't need mass storage for it, but what else are you going to store your code on and have it be reliable if you use a pc? Flash memory? Running on what kind of BUS? How many PC's have integrated bootable flashram?

      Like I said, you don't do this kind of stuff especially in a hospital. You should not do it in business either. Yes, to me, this qualifies as a hack. First off performance would be dog slow. It's just too much stuff to put on the one PCI bus most PC's have especially if you run all of the slots full. Second, cost and setup time would just make it cheaper (and safer) to go with a real router.

      --

      Gorkman

    25. Re:Spanning tree by skinfitz · · Score: 2

      Its been interesting reading the replies here to my "drastic" suggestion of disabling spanning tree. Allow me to elaborate...

      We've had some very odd issues in the past with spanning tree, and it's for this reason we normally disable it. I do run it on some segments, but there are other segments that literally cannot have it enabled, otherwise things stop working. For example, Apple Mac's really don't like spanning tree. (Plugging a Mac server into a spanning tree enabled switch can break it).

      On the rare occasion that we have had a loop, we only lose one segment. As when this happens it's noticed, and it could only have happened from one of several locations, we can easily track down the problem.

      VLAN's have proven to be quite good at isolating segments from problems on other segments.

      Still think I'm crazy? ;)

  16. Hospital Systems by charnov · · Score: 4, Informative

    I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
    1. Re:Hospital Systems by charnov · · Score: 1

      Oh yeah...Hey Joe...good luck out there...heh.

      --
      [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
    2. Re:Hospital Systems by gorf · · Score: 5, Insightful

      To be fair, they have gotten much better...

      You seem to have forgotten to explain why they were worse.

      If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.

      ...truly terrified me...

      What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.

      New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.

    3. Re:Hospital Systems by _14k4 · · Score: 1

      Mod the parent up, it makes more sense then the rest of most of these posts. Hospital systems especially. Especially when they are non-profit, etc, they have no excuse to not have the network needed to keep things up.. Hell, I've taken our sybase machines out with a really really bad accidental query.. but, I was using a test system, that was basically an ad hoc machine. It's what it's for.

    4. Re:Hospital Systems by lucifuge31337 · · Score: 1

      I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college.
      Welcome to the real world. It's not all clean and documented like the theoretical drivel they feed you in college. That's why people like me no longer hire people right out of school. After a while, you get tired of training people who's only useful knowledge is vocabulary and (if you're lucky) the ability to find and read useful documentation.

      Hopefully colleges will catch up the the reality of IT some day, but I seriously doubt it.

      --
      Do not fold, spindle or mutilate.
    5. Re:Hospital Systems by laughing_badger · · Score: 2, Funny
      computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching

      Yeah, that ability to compute using both metric and imperial units in parallel really comes in useful ;-)

      --
      Help children born unable to swallow - www.tofs.org.uk
    6. Re:Hospital Systems by passthecrackpipe · · Score: 2
      "To be fair, they have gotten much better (I don't work there anymore either)"

      Is this a rather unfortunate juxtaposition of words, or an intentional statement of cause and effect?

      --
      People who think they know everything are a great annoyance to those of us who do.
    7. Re:Hospital Systems by gorf · · Score: 2, Interesting

      That wasn't a manned flight :-)

      I've heard stories about NASA having competely different teams of programmers in different cities being given the same specs. Of multiple computers running different programs independently controlling separate hydraulics, to the point when if one decides to move something one way, the others can physically force it correct. Now that's redundancy.

      I'll bet that people designing new computerized air traffic control systems have never even heard of a real-time system, never mind know what one is.

    8. Re:Hospital Systems by Anonymous Coward · · Score: 0

      What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years..

      Ahh, so they should be running maybe...Debian! Let's not use these new features, 2.2-kernel is the way to go!

    9. Re:Hospital Systems by Anonymous Coward · · Score: 0

      ...well, NASA, military systems, some automotive systems, and many flight systems all have redundancy built in.

      Anything that could cause loss of life should be redundant as hell. Usually that is the case for many things that are government regulated. Commercial entities (ie. most hospitals) are more likey to not have backup systems.

    10. Re:Hospital Systems by pyrrho · · Score: 1

      ...but... but.... didn't you hear... newer is better.

      ====

      Just kidding, I agree with you 100%, but then I'm not the newest thing myself any more.

      --

      -pyrrho

    11. Re:Hospital Systems by Hast · · Score: 2

      The point of going to university/college isn't to learn the details of how to maintain a specific network. The point is to learn the basics and learn how to learn new material and adapt quickly.

      There will never be a college which teaches you exactly how to do your work at a specific workplace (at least not one worth going to) that's called job experience.

      Sometimes you might need to get someone with a lot of experience. One potential benefit of getting newly gradutated people is that they are already accustomed to learning. So training one of them to suit your needs might prove a lot cheaper than trying to convert someone who already know how to do things "best".

    12. Re:Hospital Systems by lucifuge31337 · · Score: 1

      Spoken like a true Computer Science graduate. Do they teach you that speech as well?

      College is to learn marketable skills. If you aren't going for that reason, don't waste your (daddy's) money. Since my post, some people have privately emailed me to tell me about their college's programs which are VERY HEAVY on internships. There's the only reasonable solution that I can see.

      When I say people coming out of school for CS know nothing, I mean NOTHING. If you went to school to be a biologist and got to your first job not able to identify a microscope or what its use is....but you knew a whole lot about where to find that information and have really good reading comprehenstion.....that's about how useful these CS grads are to me. I've hired carpenters that were more useful in data center buildouts than recent CS grads.

      --
      Do not fold, spindle or mutilate.
    13. Re:Hospital Systems by Hast · · Score: 2

      Oh my, we are cranky aren't we?

      Perhaps you should just try to find your applicants from other universities. I know that I have had to reverse engineer production code (from companies around where I study) and eg implement TCP/IP and webservers on custom hardware in C/C++ as part of one course. Many other courses I've taken also required similar skills. Ie for me to take an existing system and extend it in different ways to do new things. And in a variaty of languages.

      We are also required to have spend a couple of weeks out "in the real world", 12 weeks as of now. And that sure taught me a lot of things. Mainly company politics and how many ways you can spend your day trying to start solving your problem. (For all the normal Dilbert-esque reasons.)

      And I don't really get what you mean about the "if a biologist didn't know how to identify a microscope..." analogy. Are you insinuating that your new recruits didn't know what a compiler and similar was? Then, as I stated above, recruit from a different place. Or get rid of the HR person who hired interviewed them and get someone competent on that job.

    14. Re:Hospital Systems by lucifuge31337 · · Score: 1

      I'm a network engineer, so I'm not commenting on coders. Maybe that's where your confusions lies.

      My analogy can quite literally be extended to the new guy who didn't know what a Cat 5500 series switch looked like and couldn't find it in a rack. Or how about the one who though he could re-wire the patch panel in the data center at lunch time?

      And I don't recruit. Such is the state of resumes submissions that I get to fill low level positions from people just out of school. Fortunately the market is soft and people with actual experience are looking for jobs. And, yes, I _AM_ cranky.

      --
      Do not fold, spindle or mutilate.
    15. Re:Hospital Systems by Hast · · Score: 1
      I guess I took your comment that

      It's not all clean and documented like the theoretical drivel they feed you in college. [...] Hopefully colleges will catch up the the reality of IT some day, but I seriously doubt it.

      As a more general statement than how it was ment. If you're talking about teaching network technicians than you may very well be right. I have no idea about how those types of schools work. There's a huge difference between technical school and a science/engineering school. In that technical teach you how to go out and do things with current technology. Science/Engineering generally teach you how to create new technology.

      If you hire someone with a CS/CE background as a network tech then you are getting the wrong man for the job. CS/CE are typically over-educated for that (ie they have too abstract knowledge).
    16. Re:Hospital Systems by lucifuge31337 · · Score: 2, Interesting

      They are't over-educated for a damn thing. They are under-educated for everything. Don't give out credit where it's not deserved.

      CS programs are supposed to teach both the theory AND the operations of current technology. This should allow CS grads to quickly learn new technology incrementally. That's the point of these programs.

      People coming out of tech schools are fine, but they often have no idea how things REALLY work (just "if "a" happens then I'm supposed to do "b" type of knowledge).

      OK...I'm pretty bored with the thread now.

      --
      Do not fold, spindle or mutilate.
  17. A second (unreliable) network? by shrinkwrap · · Score: 4, Insightful

    Or as was said in the movie "Contact" -

    "Why buy one when you can buy two at twice the price?"

  18. Disaster recovery by laughing_badger · · Score: 4, Interesting
    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.

    That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

    --
    Help children born unable to swallow - www.tofs.org.uk
    1. Re:Disaster recovery by Anonymous Coward · · Score: 0

      There are excellent Wireless options from companies like Wi-LAN that give you redundancy, disaster recovery, and security (not all wireless devices can be hacked).
      Paper and Runners need to be ready and periodic 'fire drills' are essential; however, at relative low cost excellenct disaster recovery can be provided via wireless WANs.

  19. Re:The Israeli Way by Anonymous Coward · · Score: 0

    This is an American hospital in Boston. Geez, if you are going to bash Israel, at least do it with something credible...

  20. Um.. by acehole · · Score: 4, Insightful

    In six years they never thought to have a backup/redundant system in place in case of a failure like this?

    Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.

    --
    Be you Admins? nay, we are but lusers!
    1. Re:Um.. by Anonymous Coward · · Score: 0

      If I cut off your legs, will you be able to walk? Shame on you for not planning this 6 years ago.

    2. Re:Um.. by nolife · · Score: 2

      Thats is an issue with a lot of aspects of IT and in the real world. It is hard to justify the cost of a backup, redundency, plan "B", virus software, firewall, faster network, more printers, wireless security, network intrusion detection, blah blah until you are burned by one or more of them.

      Normally a consultant will try to justify your need for these things to you but of course they are always selling the $perfect_product for that job also so naturally you take the suggestions with a grain of salt.

      The US may have needed a Department of Homeland Security years ago but no one wanted to jump on it until the WTC's.

      --
      Bad boys rape our young girls but Violet gives willingly.
    3. Re:Um.. by AKnightCowboy · · Score: 1
      In six years they never thought to have a backup/redundant system in place in case of a failure like this?

      What business keeps an entirely redundant network in place just IN CASE something like this happens? Networks take millions of dollars to build and you want them to spend twice as much on a network that'll just remain unused 99.999% of the time? Try justifying that to the financial people. Almost all problems are with individual components. It's fine to keep spares on hand for things like that, but keeping an entire redundant network is ludicrous.

    4. Re:Um.. by Anonymous Coward · · Score: 5, Interesting

      They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.

      Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.

      In this case, staying with netware has saved lives.

      Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.

    5. Re:Um.. by GarryOwen · · Score: 1

      Netware servers aren't that great, I worked for a company that had its credit card processing databases on mirrored Novell 4.* something servers. They never went down accept once, and when they did it was catastophic. One dropped, the mirror one took over, and then after our netware guy brought up the first one back up, the mirror died (same issue on both, was an overload on the I/O we later found out). Well since the first one didn't have enough time to resync(this all happened in about 30 minutes) we had a corrupt database and had to pull from backups. Lost about 3 hours of transactions(which was a lot of orders) so stuff was already getting shipped but no CC was ever charged.

    6. Re:Um.. by RustyTaco · · Score: 1
      The US may have needed a Department of Homeland Security years ago but no one wanted to jump on it until the WTC's.
      Just like the former USSR needs the KGB back.

      - RustyTaco
  21. 2nd network by Rubbersoul · · Score: 4, Insightful

    Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.

    --
    man .sig
    No manual entry for .sig.
  22. the sad part by tps12 · · Score: 1, Offtopic

    This event has a lesson for us. Of course, I expect the Slashdot response to be something along the lines of "they should have used Linux," but the true fact is that all technology, even Linux, is unreliable. Rather than dicking around with which OS can provide the best network, we should accept that none of them provide the robustness necessary for things like hospitals and fire departments, and what we really need is to reduce our dependency on technology altogether. If the hospital had been paper-based, this tragedy would not have occurred.

    --

    Karma: Good (despite my invention of the Karma: sig)
    1. Re:the sad part by krinsh · · Score: 3, Insightful

      While paper-based may seem like the best solution to you; what you don't realize is that paper-based is just a single phrase for the rest of these 'bases':

      sneaker-based when everyone must run throughout passing paper;

      warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;

      administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;

      and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.

      Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.

      The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.

      --
      I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
    2. Re:the sad part by PingvinRich · · Score: 1
      If the hospital had been paper-based, this tragedy would not have occurred.

      But what if it rains?
    3. Re:the sad part by Anonymous Coward · · Score: 0

      Yes, sure. Use paper-based patient journals.

      Takes some kind of imagination to figure out how long it takes to retrieve medical records in emergencies and if a patient is admitted at another hospital.

      Not to mention if somebody drops the journal (i some cases they are thick as hell and stuffed with pages).

    4. Re:the sad part by ceejayoz · · Score: 2

      If the hospital had been paper-based, this tragedy would not have occurred.

      Tragedy? It sounds like they handled it quite well, and nobody died because of it.

      The advantage of a paperless hospital is that you don't have to wait an hour for the lab results or X-rays to get to you (or longer, if they get lost). That saves time, letting the hospital save more patients.

    5. Re:the sad part by Anonymous Coward · · Score: 0

      Yep, and they should crunch numbers with an abacus just to be safeguarded against math bugs in processors...

      Or PERHAPS a hospital can run an efficient network, but not BET on it always working and being prepared to fall back on paper running.
      You know, kinda like use the tools you have to do your job efficiently, but also ensure that you can do the job just as well--except slower-- without them. Is this so alien to you?

    6. Re:the sad part by alizard · · Score: 2
      So who gets priced out of medical care with your "solution"?

      Probable results:

      • a new army of minimum wage clerks
      • you might die between when that utterly necessary record with the info required to treat you is delayed, screwed up, or lost. Not sure how much of a loss that would be in your case, of course. People involved with technology who don't understand what it's really for should listen carefully to the call of Darwin.
      • increased costs and reduced efficiency. Remember why they went to computerized records to begin with? It wasn't because of a passionate love for unglamourous back-office technology.
  23. and now.. by Anonymous Coward · · Score: 0

    and now there server gets slashdotted, administrators run around trying to work out what to do - rebooting NT boxes. Well the article is on the boston globe so there server is okay.

  24. Politics by Anonymous Coward · · Score: 1, Insightful
    I work at a med school / hospital and in my experience some of the greatest issues are political ones. The school is not for profit, but the hospital is privately owned. The outcome? The school get's fleeced - imagine paying over $50 a month for a port charge! The hospital should have enough money from that to build an adequate network...but that assumes that the focus is in the correct place. All too often the focus is on politics (in a place full on PhD's and MD's, the whole driving force is political power and reputation.) instead of technology. The network suffers while the Senior Officers buy new handmade mahogany desks, that sort of thing.


    Doesn't really matter. If you had to deal with Med Students as we do, you'd die before you went to the doctor. Trust me.

    1. Re:Politics by Anonymous Coward · · Score: 0

      It doesnt get any better when they become doctors. Die if you stay away, die if you go to them.

      In Australia they admint to killing 20000 patients a year that should have lived. Thats the ones that they admit to!

  25. Comment removed by account_deleted · · Score: 1, Offtopic

    Comment removed based on user account deletion

  26. Reliability is inverse to the number of components by ChimChim · · Score: 1

    Ok, so here's an SAT question for ya:

    IF you have one train going from NY->LA that's likely to break down 10% of the time, and you get a second identical train going in the opposite direction, what's the probability that one of the trains will fail?

    (number of trains) * (probability of failure)
    = 2 * .10
    = 20%

    The more components in the system, the more likely it is that parts of the system will be down. This isn't to say that the extra redundancy isn't useful, but it doesn't give you more reliability...it decreases it. So additional mangement costs are incurred in making sure that enough redundancy is always available to compensate for parts of the system that are down, and replacing bad components.

  27. Sure it was STP? by 53x19 · · Score: 1

    Spanning Tree is pretty robust protocol. Problems usually arise when admins get impatient with convergence times and start messing with the timers.... or enabling features like portfast, backbonefast and the like.

    1. Re:Sure it was STP? by jefftp · · Score: 4, Informative

      The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.

      Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.

      Laziness and ignorance are the villians of most network problems.

      Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.

  28. Re:That's why I hate automatic routing by parc · · Score: 3, Insightful

    And your change in routing policy is going to affect spanning tree how?

    How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
    Hand-editing of routing tables works only in the most simple of networks.

  29. New Technology by Anonymous Coward · · Score: 0

    This is what they were testing.

    1. Re:New Technology by Anonymous Coward · · Score: 0

      "Microsoft Defib 98"... obviously it wasn't hooked up to the Internet so the victim's Passport could be validated.

  30. Short answer? No. by krinsh · · Score: 2

    Should there be a few replacement devices on hand for failures? Yes. Should there be backups of the IOS and configurations for all of the routers? Yes. Should this stuff be anal-retentively documented in triplicate by someone who knows how to write documentation that is detailed yet at the same time easy to understand? Yet another yes.

    If it is so critical, it should be done right in the first place. If a physically damaged or otherwise down link is ESSENTIAL to the operation or is responsible for HUMAN LIFE, then there should be duplicate circuits in place throughout the campus to be used in the event of an emergency; just like certain organizations have special failover or dedicated circuits to other locations for emergencies.

    Last but absolutely certainly not least; the 'researcher', regardless of their position at the school, should be taken severely to task for this. You don't experiment on production equipment at all. If you need switching fabric; you get it physically separated from the rest of the network or if you really need outside access you drop controls in place like a firewall, etc. to severely restrict your influence on other fabric areas.

    --
    I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
    1. Re:Short answer? No. by 42forty-two42 · · Score: 2, Insightful

      The researcher was just entering data in. Not experimenting with the network. Where do you expect him to store his experimental resulst? On a ZIP disk?

    2. Re:Short answer? No. by krinsh · · Score: 1

      eh? maybe I misread the article then. ouch. even so; the impression I got was that they were using something that wasn't normally used on the network. I guess I am as conservative as I claim I am not.

      --
      I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
  31. What is spanning tree protocol? (google whoring) by Anonymous Coward · · Score: 5, Informative

    Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.

    Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.

    To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.

    Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.

    see this page for mode info

  32. Re:The Israeli Way by Anonymous Coward · · Score: 0

    How complete of a moron are you?

    No wait... you've already answered the in your post:

    A total moron!

    This hospital is not in Israel, its in Boston Massachusetts. Try reading the article before wasting everyone's time with your idiocy.

  33. Why fly equipment from california?? by Viol8 · · Score: 1

    Did a company as large as Cisco seriously have no appropriate troubleshooting equipment on the WHOLE of the east coast or anywhere closer the california? What kind of mickey mouse support outfit are they running??

    1. Re:Why fly equipment from california?? by marklyon · · Score: 2, Interesting

      They have a huge hot lab in California where they have pre-configured switches, routers, ect running and ready to go at a moment's notice. When my ISP went down, they sent (same day) three new racks of modems configured with our last known "good" configuration so all we had to do was unplug, pull, connect.

      It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.

      --
      -- Mark Lyon http://www.marklyon.org
    2. Re:Why fly equipment from california?? by GLX · · Score: 2

      Because Cisco is very California-centric, and the fact is that when it comes to their switching and routing gear, there is very little "hardware" that you can bring in to troubleshoot that's little more than commodity software loaded onto a commodity PC.

      The best thing they had was the input of (hopefully) knowledgeable Cisco engineers. God knows if they relied on Cisco TAC Level 1 support they'd still be down today.

      --
      Sig (appended to the end of comments you post, 120 chars)
    3. Re:Why fly equipment from california?? by Anonymous Coward · · Score: 0

      It all depends on support contracts. If you
      actually pay for equipment to be held for 4
      hour replacement, it'll be there. In this case
      the hospital had NO support contract, Cisco went
      in and redesigned the network and stayed there
      for days to make it happen.

      Network design isn't as simple as just throwing something at a problem. Having handled networking for company acquisitions, I know first hand how many variables each company has and how little the support teams are aware of the legacy equipment. You need to analyze the true requirements and design around that.

    4. Re:Why fly equipment from california?? by Anonymous Coward · · Score: 0

      For the record, the back-up equipment and hardware resources came from a local warehouse within in two hours. Advanced engineering services came from Chelmsford, just outside of Boston. And it was the TAC, who have numerous engineers, that have forgotten more about STP than most people will ever know, that recovered this network. It's an amazing thing to watch a group of engineers gain a complete understanding of a network with 150 switches, with more redundant connections, than speculations on this issue. Reconfigure switches and insert routers to decrease the spanning tree diameter, not to mention understand all the other configuration changes made as part of the recovery effort prior to their involvement. I guess if you work on a team for 10 years, with a 100 other Engineers that recovery the most complex networks in the world, on a daily basis. And if your company can have any product they have manufactured in last five years, at the doorstep of any of there millions of customers, within four hours. And your company is willing to invest 500+ man-hours at no cost to their customer to recovery their life-critical network, at that point you can comment on the "support outfit" of Cisco Systems.

  34. Of course they need another network by virtual_mps · · Score: 5, Insightful

    Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml

    1. Re:Of course they need another network by Anonymous Coward · · Score: 0

      Yes it's expensive, but critical functions should be seperated from non-critical functions.

      Tell that to the senior citizens who are screaming about health-care costs...

      Until it breaks, there is never money for stuff like this. Until then, IT projects in many life-critical industries are no different than other industries - get it out the door quick. Sure, sometimes there are pounds of documentation to satisfy govt-reg box-checkers, but rarely is money really invested in getting a quality system.

      NASA can afford to spend tons on redundant systems - that is just tax dollars. Companies have to make a product that can sell in a competitive market...

    2. Re:Of course they need another network by ces · · Score: 2

      Not always true, most tech firms tend to have reasonably well designed networks, as do most companies that do a lot of OLTP such as airlines, banks, and brokerages.

      Large universities seem to have well designed redundant networks as well despite the difficulty of securing funds in that environment.

      --
      Happy Fun Ball is for external use only.
  35. Well the thing is... by Anonymous Coward · · Score: 1, Interesting

    Having worked on several database systems, improper planning and maintenance are the main causes of large, unwieldy and ultimately unstable systems. In large organizations where IT is not a major business area, i.e. a Hospital system, their existing database system has probably been augmented several times to increase functionality (and capacity) - probably by different parties as well. This multiple patching approach results in instability as the database has grown far beyond its orginal intended purpose. However, due to the vast stores of data, and the repeated tinkering with it by various parties, migration is a nightmare.

    Rebuilding the system from the ground up poses several major hurdles. First being the systematic migration of data while the original database is still running! as for hospitals, this database is clearly mission critical!

    The other problem is mimicing the interface and relationships within the database, such as to reduce retraining. Retraining is a major problem when switching systems. All in all, it is a major undertaking to redo the database, and probably not viable, both in time or money for the hospital.

    Saddly, I have to contend that duplication of their system is the best short to medium term solution.

  36. Re:The Israeli Way by Anonymous Coward · · Score: 0

    oh, put a sock in it already.. the juvenile racism and hatred gets old after a while.

  37. Isolate faults by Anonymous Coward · · Score: 0

    The network can be designed (hierarchical) such that a network fault will isolate only a part of network that can be locally fixed and does not affect the entire network. The important network servers should be redundant and can be be made fault tolerant by automatic switchovers during server faults. The main switches and routers can use loopback addressing to other network cards in case a network card on the switch or router goes down.

  38. How many domain controllers? by Hairy_Potter · · Score: 2

    If you're just using a Primary Domain Controller, that could be your problem. I'd recommend adding a backup PDC, as well as a Tertiary Domain Controller, and add an X.25 backup network layer to give you hot-swappability and real-time rollover capabilities.

  39. Comment removed by account_deleted · · Score: 2

    Comment removed based on user account deletion

  40. All Layer 2? by CatHerder · · Score: 5, Informative

    If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.

    Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!

    Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.

    1. Re:All Layer 2? by AKnightCowboy · · Score: 1
      If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network.

      Why do you assume that? Per-vlan spanning tree has been available for quite some time and works fine. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets.

      Yep, and you still use spanning tree. Think multilayer switched networks and vlans not the antiquated central routers and a bunch of switches hanging off of it tying you to whatever subnet your switch happens to be uplinked to. These days you can move across the campus, plug into a completely seperate switch, have your port set automatically via policy manager, and keep working using the same IP address and subnet. You still use spanning tree for loop avoidance and redundancy with these multilayer switches though.

    2. Re:All Layer 2? by CompPsi · · Score: 1

      Herding Cats! That's a Bennis line. Your answer seems the best so far. I'll know for sure when I get certified...

    3. Re:All Layer 2? by mikestro · · Score: 1

      Yep, and you still use spanning tree. Think multilayer switched networks and vlans not the antiquated central routers and a bunch of switches hanging off of it tying you to whatever subnet your switch happens to be uplinked to. These days you can move across the campus, plug into a completely seperate switch, have your port set automatically via policy manager, and keep working using the same IP address and subnet. You still use spanning tree for loop avoidance and redundancy with these multilayer switches though.

      You can, but you are just adding unneeded complexity to the network by spanning segments across the WAN. I also understand that they may be one large campus, but you have two choices - either a network that the doctors design (by their requests/demands), or you have a network based on some simple common sense, based on real requirements (believe me, I know what kind of things doctors have to "HAVE". And if you balk, all they do is say that you are risking patient care, etc. Kind of like playing the race card, IMHO, but...).

      I just don't think it's good practice to have the same network segment span physical locations (WAN) just because some doctor/user may want to keep his same IP address.

      As for prevention, some strategic planning on their part also may have been able to prevent the complete system-wide meltdown that they had. i.e. The question that should have asked, is "is there anyway, that any single event can take down the entire network?)

      Of course, hindsight is always 20-20 :)

    4. Re:All Layer 2? by swb · · Score: 2

      We do some of this, although the logic and rationale most of the time for being able to do "any vlan to any port" has proven in my small environment (500 users, 6 floors, 6 VLANs) to be of somewhat limited value.

      I've trunked the DMZ to a port in our studio, kept closet switches on the core VLAN, and put a port in my office on the core network, but beyond that devices generally "belong" to the network they're on, and being able to dynamically move a given machine between ports and have it auto-home to the subnet it "belongs" to sounds like a lot of work and investment in time/software/record-keeping.

      We had a huge flat, shared-media (not switched) network when I started, now its 100% 100MB switched with a Layer 3 switch at the core. I still get the willies when I think of the legwork alone required for fault isolation.

    5. Re:All Layer 2? by isdnip · · Score: 2

      BIDMC is a big place, too; two adjacent campuses (the old Beth Israel and Deaconess hospitals) and a lot of legacy stuff from pre-merger days. The articles are shy on details but from what I can tell, they had a mix of routable IP and non-routable protocols. The old ones (like LAT, or IPX if you don't route it) depend on bridging, and the routers try to be bridges too, and that's just not something they're good at.

      Indeed, <b>bridging does not scale well</b>. Campus-wide (both campuses, actually) support for any non-routing protocol is hazardous to a network's health. It's tempting to have a little bridged network and just add a little more, and a little more, but when it tips, it tips fast.

    6. Re:All Layer 2? by Anonymous Coward · · Score: 0

      "These days you can move across the campus, plug into a completely seperate switch, have your port set automatically via policy manager, and keep working using the same IP address and subnet."

      I'm sure that this is neat and all. But why exactly do you need this network complexity? How often are devices moved from one campus to the other (Remember this is a big place).

      Why isn't DHCP, DynDNS, subneting and routing sufficient?

      Remember this is a medical network, reliabillity should be the key requirement, not "new fangled" technology.

      I think that the while building a redundant network is a good idea, a better idea is to also subnet the network to localize problems.

      I'm sure the designers had a good reason (eg: the network evolved, or legacy application support). But placing two campuses on the one logical LAN seems like a recipe for disaster to me.

  41. Rebooting is your friend by cperciva · · Score: 1

    As much as we all laugh at the Windows "close all your applications and reboot" way of "solving" problems, there is something to be said for rebooting systems: If all else fails, you can quickly restore the system to a known working state.

    Ideally, rebooting a system should be unnecessary. But practically speaking, people make dumb mistakes -- like the bug which caused the telephone crash of 1990 -- and Bad Things can happen. Rebooting a system should be a last resort; but it should be a last resort which always works.

    1. Re:Rebooting is your friend by ibennetch · · Score: 1

      Maybe you mean this outage:

      Misplaced break statement in AT&T's long distance software, causing 60,000 (that's sixty-thousand) people to loose long distance service for 9 hours?

    2. Re:Rebooting is your friend by Anonymous Coward · · Score: 0

      Your point is valid, but there's some irony in
      your example...

      The 1990 telephone crash was actually caused by
      phone switches that were effectively rebooting
      themselves in response to too many reboot messages
      coming at them at once.

    3. Re:Rebooting is your friend by Anonymous Coward · · Score: 0


      Maybe they should have used Java, then this wouldn't have happened!
      </sarcasm>

  42. :( Hey I submitted this a week ago :O by Flamesplash · · Score: 0, Troll

    So when I submitted this a week ago it gets rejected, but now that Mr. hey! submits it, it gets accepted. I see what's going on. Damn I need more punctuation in my handle.

    --
    "Not knowing when the dawn will come, I open every door." - Emily Dickinson
  43. Re:That's why I hate automatic routing by Swannie · · Score: 5, Interesting
    Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).


    This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.


    Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)


    Swannie

    --
    :q!
  44. OMG! by jmo_jon · · Score: 2, Funny

    The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days.

    Did that mean the doctors couldn't play Quake for four days!?

    1. Re:OMG! by Malicious · · Score: 1

      Doctors don't play Quake.
      Doctors play Links.

      --
      01101001001000000110000101101101001000000110001001 10000101110100011011010110000101101110
    2. Re:OMG! by yamcha666 · · Score: 1

      I guess so. Instead of fragging themselves in Quake, they ended up fragging a few of the patients to get their fix.

  45. Misleading question? by Anonymous Coward · · Score: 0

    I assume by "network" they just mean backbone. Obviously the backbone is what failed, otherwise it wouldn't have brought down the entire network. Obviously they need some redundancy there.

  46. Flat networks. by zerofoo · · Score: 2

    Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?

    -ted

    1. Re:Flat networks. by skinfitz · · Score: 2

      Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?

      The whole point of VLANS is so you can put multiple networks along the same cable. We distribute sets of VLANS to edge switches over fibre (and dark fibre to the remote sites at gigabit speed) where they are then seperated out into 100Mbit ports on the switches.

    2. Re:Flat networks. by Derek+S · · Score: 1
      The whole point of VLANS is so you can put multiple networks along the same cable.

      That's only if you're using them with tag switching. VLANs are still useful even if they don't share cable.

  47. They did rather well really. by 91degrees · · Score: 1

    How many other organisation scan run at all if their network dies? And if the execs really were running around as errand boys, that's just great. Nice to see the senior staff actually caring enough to help keep things going. Really they need a prodcedure to deal with the networks failing rather that a redundant network.

    1. Re:They did rather well really. by Detritus · · Score: 1

      I've noticed this problem with many local retail stores. If their server or network connection goes down, they have no backup manual system. It's rather frustrating to have the cash and product in your hand, only to be told that you can't buy the product because the system is down.

      --
      Mea navis aericumbens anguillis abundat
  48. Re:Reliability is inverse to the number of compone by Kajakske · · Score: 1

    Nice math, but the point here is that only 1 train has to arrive, thus in those 20% we can still safely travel.

  49. Complexity brings bugs by stevens · · Score: 5, Interesting

    The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

    We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

    But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

    Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.

    1. Re:Complexity brings bugs by Pig+Hogger · · Score: 1, Flamebait
      he network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.
      ...
      Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
      If you're not a networking expert, then you definitely are a part of the problem. That's why you can't find solutions...
    2. Re:Complexity brings bugs by Mr+Guy · · Score: 2

      neither I nor the admins
      Developers suspect that there's a simpler way to do it all, but since we're not networking experts

      Sounds like he's a developer, not an IT guy. It's none of his business what the problem is, he's just screwed when it doesn't work.

    3. Re:Complexity brings bugs by Anonymous Coward · · Score: 0

      Asshole.

    4. Re:Complexity brings bugs by Laterite · · Score: 0
      What is with this obsession I observe in the IT industry where everyone *needs* to be an "expert" to troubleshoot or otherwise deliver input on an issue? I see this a lot both in the workplace and on Usenet forums, etc.

      Oh, so you're just an Oracle DBA? What makes you think you can comment on how Solaris handles processes?

      or

      A mere developer? How dare you think about networking? You couldn't possibly have anything useful to offer.

      *Rolls eyes* Let's get over ourselves here, people. A person's job title/function does not indicate all the knowledge they have.

      -Mark

  50. Fucking stupid by Anonymous Coward · · Score: 0

    "Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions."

    By law they have to have a disaster recovery plan, all US hospitals HAVE to. So they "scrambled" to the disaster recovery plan, made copies of the forms, and were up. Big deal.

  51. Re:Reliability is inverse to the number of compone by Xugumad · · Score: 4, Insightful

    However, the probability of both failing at the same time is:

    0.1 * 0.1 = 1%

    So as long as it can run on just one out of two, get you get ten-fold increase in stability.

  52. Re:Reliability is inverse to the number of compone by pknoll · · Score: 2, Informative
    Sure, but that's not the point of redunancy. The question you want to ask is: How likely is it that both redundant components will fail at the same time?.

    That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.

  53. Obviously not. by buss_error · · Score: 2
    do you think the answer to having an massive and unreliable network is to build a second identical network?"

    Obviously, if something fails due to design, then duplicating the design duplicates the problem. While this can be a useful troubleshooting tool, it makes somewhat less sense for production enviroments.

    I would be willing to guess that the network was one giant collision domain, and that the trouble springs from that. But it is just a guess.

    --
    Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  54. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    The probability that one train will fail is still 0.1. It is irrelevant how many trains there are, the probability that any given one will fail will be 0.1(Of course assuming the trains fail independently). The probability that that both train will fail simultaneously is 0.1*0.1

  55. STP by netwiz · · Score: 2

    isn't that hard to troubleshoot. You look at the device ID that most recently made a Topology Change Notification, and then start looking at the hardware diagnostics for that system. If they're showing clean, reboot the switch. If, while the device is rebooting, the network stabilizes, you've found the problem. When the system finishes it's boot, check the hardware diagnostics again (Ciscos only run H/W diags at POST, and a reset is the only way to re-run them); odds are that you'll see there's a failed component.

    A previous poster nailed it too, simply back out the changes you made (obviously the problem you were fixing is of a lower magnitude than a total outage), and things should start working again.

  56. It seems obvious by Woogiemonger · · Score: 1

    Even if a network is engineered perfectly, someone could maliciously or accidentally physically harm it and cause down time. Having a second, perhaps lower-end, backup network, when you have people's lives at stake (missing prescription information could quickly cause a fatality) ..it's a necessity, especially for a hospital with such a good reputation. Plus, the telecomm industry giants such as Cisco are just DYING for more business, so this could also help the economy :)

  57. Re:That's why I hate automatic routing by Anonymous Coward · · Score: 0

    Actually we have a horribly complex network. (Australian national network, multiple extranets to governmental offices, national dialup service, plus DialConnect global roaming hook up. Its 95% static routed.

    Management is a key issue, with tools to aid deployment the next. Static in large networks is not impossible, sometimes you have to set limits and miss out on some "cool" features.

    Probably having old school network engineers is a big part of this setup. They don't like giving up control to automated systems.

  58. My best hospital glitch by eaddict · · Score: 5, Informative

    was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!

    --
    "If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
    1. Re:My best hospital glitch by Anonymous Coward · · Score: 0

      To quote the Simpsons -

      "Ahh, Teamsters. So Lazy. So Surley."

    2. Re:My best hospital glitch by Anonymous Coward · · Score: 0

      Fuck unions. Those fucking lazy bastards should all be shot. Fucking leeches on society.

    3. Re:My best hospital glitch by msfodder · · Score: 1

      Arghhh.. I had a cable tech do the same thing with the power to the adminsitrative idf. He was in there splicing fiber and he decides to test the new cable.. Hits the power button, all switches die and the cable doesn't work, this all with users still trying to access resources, send mail and in the midst of distributed transactions to various DB. This shit should be actionable.

      --
      ..Free Live Free...
    4. Re:My best hospital glitch by Anonymous Coward · · Score: 0

      100% agreed. Shoulda just done what you needed to do and ignored the unions cries. They're obviously incompetent.

  59. Lawsuit by Gary+Franczyk · · Score: 2

    There will probably be many lawsuits after this.

    The line of thinking will be something like this:

    How many people died or will die, or get improper treatment because of this networking glitch? If the hospital is as large as described, certainly a number of persons were given inadequate healthcare while they were there.

    Some may have a good case.

    1. Re:Lawsuit by Waab · · Score: 2

      I'm afraid in our lawsuit-oriented society the line of thinking will be something more like:
      How many happened to be within 2 blocks of the hospital during this glitch and how many of them feel an overwhelming sense of entitlement that might motivate them to join a class-action suit?

      I fear a fairly large number of people will see this as an opportunity to sue, regardless of the quality of care they received during the network outage. I'm sure there are plenty of people who feel their lives weren't saved fast enough or at least weren't saved with the quality of service they feel they deserve.

      Oh, and IANAL.

    2. Re:Lawsuit by JohnnyBolla · · Score: 2

      That's true, many people are litigious asses. In fact, the people that can have that line of reasoning should be lined up and shot.

      --
      Carpe Deez
  60. Cisco implemenatation of Spanning Tree sucks by xaoslaad · · Score: 4, Interesting

    I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

    I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

    I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

    Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

    Two wrongs don't make a right.

    1. Re:Cisco implemenatation of Spanning Tree sucks by netwiz · · Score: 4, Informative

      Cisco only runs per-VLAN spanning tree if you're using ISL as your trunking protocol. The reason you don't get it on Extreme Networks stuff is because they use 802.1q. In fact, Cisco devices trunking w/ the IEEE protocol run single instances, just like the Extreme product.

      There are tradeoffs, of course. STP recalculations (when running) can be kind of intensive, and if you've got to run them for each of your 200 VLANs, it can take a while. However, for my particular environment, per-VLAN STP is a better solution.

    2. Re:Cisco implemenatation of Spanning Tree sucks by photon317 · · Score: 2


      Putting layer-3 switching only (no pure L2 devices) all the way uot to the workstations is prohibitively expensive. Anytime you've got multiple L2 switches in a segment, you should have spanning tree turned on. Turning it off will seem like a gain, till some dumb user plugs two of your network connections into a 4-port hub under his desk and you start getting broadcast storms. Spanning Tree saves you from these types of disasters and a myriad of other possibilities.

      --
      11*43+456^2
    3. Re:Cisco implemenatation of Spanning Tree sucks by silas_moeckel · · Score: 1

      Obviously you have never tried to get Extream gear up and running in a larger envirnement. I have worked with each of them and let me tell you I have never had more problems with Extream gear. Working in a data center as default configured Extream gear will not autodetect on a gigE fiber link with most anything else especialy Cisco. Cisco's feature set is significanlty more expansive. From the sounds of it this campus was rather flat with a lot of little switches. A redesign sounds like putting 65xx's in all the decent sized wiring closets and moving to a hub and spoke with 10gigE unplinks to 6513's. That would get you power over ethernet (important consideration for voip) plently of bandwith and you can easly use the PFC/MFC cards to get routing done at a closet level to a few core routing networks. This does all assume they arent running unroutable applications if they are thats the problem they should fix but it very well may be more expensive and complicated than getting a new from the ground up L2/L3 network installed.

      --
      No sir I dont like it.
    4. Re:Cisco implemenatation of Spanning Tree sucks by wein0 · · Score: 1

      Ahm... wrong...To say Cisco Spantree implementation is wrong is a ridiculous comment. Its like saying Ferrari's implementation of the wheel is flawed... If you take the time and think of a topology with multiple VLAN's working across 100 + switches ( such as the environment I work in ).. it is essential to run multiple instances of spantree ( 1 instance of spantree per VLAN ). Therefore you have the possibility to load balance at Layer 2 and more importantly provide a more redundant topology. By having different root bridges for different VLAN's you also minimize down time in case of hardware failure. i.e. root bridge for ALL VLAN's fails... every VLAN is affected for a period of time. On Cisco Switches use PVST+ ( per vlan spanning tree ) whether you use 802.1q or ISL. I think you will find that anyone that is responsible for a LARGE campus being rather enthused about the ability of using multiple instances of spantree ( u dont have 2 ) the only down side to this is CPU cycles and a tiny bit of bandwidth.

    5. Re:Cisco implemenatation of Spanning Tree sucks by PatJensen · · Score: 2
      There are a few Cisco-related features in both CatOS and IOS that can improve spanning-tree convergence on large networks - but they have to be engineered at all layers from the get go. (core, distribution and access) All of your switches must have versions of software that support them as well.

      Spanning tree backbonefast lets your core layer switches reconverge after a link/switch failure quite rapidly. Used in connection with spanning tree uplinkfast, your distribution and access layer switches can switch over to another redundant copper or gigabit fiber link quickly without waiting for full spanning tree convergence.

      Another feature that seems to be widely used (and probably the most dangerous), is spanning-tree portfast - this gives access layer switches the capability to immediately begin forwarding a workstation's packets on the network. portfast should NOT however be used on trunk, channel or hub links as it can create a bridge loop by a user/site support mistakenly plugging in a crossover cable.

      Hope this helps!

      -Pat

    6. Re:Cisco implemenatation of Spanning Tree sucks by netengr1024 · · Score: 1

      Is this flamebait or what? Maybe it's just an un-informed comment that got a high rank because it sounds informative. Here are a few useful details:

      1) PVST can be turned off on Cisco gear so that you only have one spanning-tree for the whole network if that's what you want. However, you should consider that unless you have every VLAN on every switch, PVST actually saves your processor by creating smaller spanning-trees that reconverge independently. Without PVST, there's only one spanning-tree and it has to reconverge anytime there's trouble with any equipment on the layer-2 network.

      2) I used to say the same thing about Cisco's line of L3 switches because they had nothing to compare with Foundry's gear. However, in the last year or two, they've introduced several new products, including the Catalyst 3550 which happens to be one of my favorites now. These new products compare very nicely with Foundry (and others) and if I'm already using Cisco for WAN connections, it's all the same to me to use Cisco for the LAN to be consistent.

    7. Re:Cisco implemenatation of Spanning Tree sucks by Anonymous Coward · · Score: 0

      When using portfast, to stop users creating ST loops use: 'set spantree portfast bpdu-guard enable' on CatOS switches, or 'spanning-tree portfast bpduguard' on IOS ones. To stop unidirectional traffic problems use: set udld enable.

      Dont embarass yourself by writing rubbish!

      PS Spanningtree issues exist where network designers do a crap job - period. ISL Trunk on Ciscos from edge switches running VTP, PVST and you wont have issues, run a big flat network and expect things to fall over. On a big network use the industry standard 2-layer or 3-layer campus design models. One caveat, beware of asymmetric routing and ARP/CAM table timeout issues.

      PPS. regarding 802.1w, see this taken direct from Ciscos website:
      "Note: The availability of RSTP was first implemented as part of Multiple Spanning-Tree Protocol (MSTP) in CatOS 7.1 and Native IOS 12.1(11)EX and later. It will be available as a standalone protocol with the Rapid-Per-VLAN-Spanning-Tree (Rapid-PVST) mode in IOS 12.1(13)E and in CatOS 7.5. Under this mode, the switch runs an RSTP instance on each VLAN, following the usual PVST+ Cisco approach."

    8. Re:Cisco implemenatation of Spanning Tree sucks by JakiChan · · Score: 1

      I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree...

      Guess what, you'd be using STP in that case too. The only reason you can turn STP off is if you are 100% sure there will not be any redundant paths in your layer two network. If you force spanning tree off and then have any redundant links then the first unknown MAC address or broadcast frame will cause your network will melt down in a broadcast storm.

      I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past.

      Every 6500 with an MSFC or 5500 with an RSM (oooh, scary...my old company still had a 6500 MSM somewhere...) is a "layer 3 switch". In other words, Cisco has a lot of layer 3 switches.

      It sounds like you don't quite know what layer 3 switching is...it's also called routing. And it's not the magic solution to this problem, there are still layer 2 issues to be dealt with.

      implemented a fully switched, beautifully layed out network with redundant links

      And I'll bet a fair amount of money that yes, you are running spanning tree. Cisco implements standardizes protocols pretty well. It's things like CEF that bite you in the ass.

      --
      "Where quality is like a dead stinking rat - you just can't miss it."
    9. Re:Cisco implemenatation of Spanning Tree sucks by xaoslaad · · Score: 1

      Wasn't flamebait. Said I wasn't sure. I was happy to be corrected... was a matter for conversation, not fanboy flamewars... that aside

      I had not been aware that there was much more than there had been for L3 equipment from Cisco. I started looking after so many people had otherwise to say. My bad; it does look as though they have a large quantity of equipment out these days...

      As for Spanning Tree here.... it's off period. Ain't had a problem in nigh 4 years because of that either. Of course, _most_ of our end users would not know which end of a CAT 5 cable to plug into a hub if they had one... maybe in hospitals, information tech companies, etc, that is different. but we do not running ST anywhere, not on any of the layer 2 equipment, and we have not suffered for it.

      maybe this is an unusual case; but I still rather believe that it can be done elsewhere.

    10. Re:Cisco implemenatation of Spanning Tree sucks by Anonymous Coward · · Score: 0

      Maybe you should get your too-hot-to-handle girlie to turn on your spanning tree.

  61. Are you crazy? by AriesGeek · · Score: 2, Insightful

    Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!

    No, disabling STP is NOT an option. Learning how to use STP properly is the option.

    --
    Insert offensive troll-style sig here. Please mod or respond appropriately.
  62. Spanning Tree Protocol problems by Qzukk · · Score: 1

    Its just too complex for people to understand.

    One of the first things we learned when we got to this part of our networking class, was that spanning trees for more than a few nodes is damn near impossible for a human to figure out. We learned how to diagnose the problem if it occurred, we even studied ethernet frame dumps to watch the spanning tree build itself. But, if you weren't there to watch the tree get built, there's no way at all to guess what exactly went wrong with it. You just pull all the bridges and routers, reset them all, and start over.

    This was probably caused by a combination of bad hardware, and some nut plugging two branches of the network together that were already connected somehow. The hardware should have recognized this as a loop and cut it, but for some reason it didn't.

    Well, hopefully they won't repeat the same loop in their backup network.

    --
    If I have been able to see further than others, it is because I bought a pair of binoculars.
    1. Re:Spanning Tree Protocol problems by SpitFU · · Score: 1

      Better yet, this might have been cause by some over zealous security freak who wanted to monitor bandwidth or set up an IDS.

      I had this exact same thing happen but on a smaller scale and with less impact on operations.

      --
      reassign null to be the tape device - it's so much more economical on my time as I don't have to change tapes_BOFH
    2. Re:Spanning Tree Protocol problems by Anonymous Coward · · Score: 0

      We studied the algorithms behind the spanning tree in college. With a little graph theory and maybe a special program, it shouldn't be that hard to break it back down.

      Maybe network engineers need to take the math classes behind the tools that they use in order to understand them better. Mandatory graph-theory for network engineers!

  63. The real problem by Enry · · Score: 4, Insightful

    There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.

    So what's the lessons?

    1) Make sure your solution scales, and be ready in case it doesn't.
    2) Make sure some overall organization can control how networks get connected.

  64. Business Continuity Plans? FDA, 21CFR11? by Anonymous Coward · · Score: 0
    Any medical system used for patient data is fair game for an FDA audit: these are *massive* in scope and they should issue you with a a 483 if you're found not to be 21CFR11 compliant.
    To be compliant you should have massive amounts of validation documents covering everything from how to build *the whole system from scratch* in the event of an error, to your business continuity plan, your disaster recovery plan etc etc etc.
    Your initial User Requirement Spec document when the system was implemented should have included details of failsafes and redundancy and been built in from the word go.


    You would be on very shaky legal ground if you ran a system that was not FDA compliant like this.

    What *really happened here*?

  65. I don't buy it by hey! · · Score: 5, Insightful

    The same explanation was floated in the Globe, but I don't buy it.

    People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

    The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

    One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    1. Re:I don't buy it by anonymous+loser · · Score: 2
      The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

      I had to reread this a couple of times. It looks to me like you're saying that it couldn't be a single application because that would indicate a poorly designed network, then go on to say the network was poorly designed.

    2. Re:I don't buy it by NecroPuppy · · Score: 2, Interesting

      I think he's laying more of the fault at
      the bad network design than any app that
      was run on it.

      I.e., the app was only able to do as much
      damage as it did because the network was
      so bad; if the network have been set up
      'properly', then the app could have only
      done localized damage.

      Does that make sense?

      --
      I like you, Stuart. You're not like everyone else, here, at Slashdot.
    3. Re:I don't buy it by DaveV1.0 · · Score: 5, Informative
      Actually, if you read the article carefully, they say that the application the research was running was the straw that broke the camel's back.

      "The crisis had nothing to do with the particular software the researcher was using."
      "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow. "

      While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge.

      --
      There is no "-1 offended" or "-1 you don't agree with me" mod options for a reason.
    4. Re:I don't buy it by Alsee · · Score: 2, Funny

      Someone moded Kathleen's "Yes" as Offtopic. He can kiss those moderator privileges goodbye.

      If "offtopic" results in a loss of moderation rights I'd hate to see what the consequences would have been for calling her a troll :)

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
    5. Re:I don't buy it by Anonymous Coward · · Score: 0

      I
      think
      that
      you
      hit
      the
      nail
      right
      on
      th e
      head.

    6. Re:I don't buy it by ScuzzMonkey · · Score: 2

      Exactly. That's also implied by the fact the article mentions that an outside consultant had previously recommended a network overhaul and that it had already been approved--just not yet implemented, unfortunately.

      --
      No relation to Happy Monkey
    7. Re:I don't buy it by patter · · Score: 2, Interesting

      While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge

      Makes a lot of sense actually. I've been doing a bit of a campaign for a while to have a seperate domain or the ability to connect my test machines (in complete isolation of course) to only each other and maintain my OWN PDC... of course no one thinks this is a good idea, but some of the tests I need to run can bog down when the network's busy, and they of course are not helping the rest of the network be happy.

      Our network's reasonable, but people should give software folks what they need, not force them to work under the constraints the sales folks do (for example).

      Sure, we have to respect the 'rules' when joining the normal network for email and such, but testing of network applications should almost be on a smaller completely isolated network (to prevent dragging down the whole system when an automated test goes awry).

      Infinite loops don't just happen to stupid people ;). Anyone can get too tired to realise they're sending a billion packets a second because they reversed a conditional or something.

      I know a developer who had to leave one job because the IT folks didn't understand why he couldn't develop windows services without admin equivalence on his local machine (duh).

      --
      -- If at first you do succeed, try to hide your astonishment. -- Harry F. Banks
    8. Re:I don't buy it by John+Hasler · · Score: 2

      > Infinite loops don't just happen to stupid people
      > ;). Anyone can get too tired to realise they're
      > sending a billion packets a second because they
      > reversed a conditional or something.

      That would account for a temporary slowdown, but a robust network would have recovered as soon as he pulled the plug. This one didn't. Are they going to actually fix it, or just throw more hardware at the problem?

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    9. Re:I don't buy it by NumberGod · · Score: 1

      Hmmmmm, in my experience, in disasters like plane crashes etc, there is never a single failure that causes the problem, it is almost always a multi-systems failure, or a chain of events, any one that could have prevented the problem.

      Perhaps they aren't taking this into account?

      I'm sure it'll all be covered in comp.risks once the dust has settled, and the real causes are identified.

      Hopefully they may learn what went wrong, and make their existing network more robust. I don't think that creating an entire new network alongside their existing one will help, it'll only create a new set of problems.

      I guess that this comes under saftey-critical-systems design.

    10. Re:I don't buy it by Anonymous Coward · · Score: 0

      (sigh) have to post this anonymously. Last place I worked, some guys and I were sniffing around the network harmlessly--just browsing the network neighborhood from a Windows box. We found a 38 gig folder called "data". Yep. It was all the customers. Now, we were told that the data could be backed up if lost. We were also asked "please don't delete the data folder". Now, the fact that it was exposed was poor design, but if we had deleted it and they couldn't back it up, it isn't *just* their fault.

  66. "the last drop that made the network overflow" by dagg · · Score: 1
    ... just one more wafer :-) ...

    Overall, as long as patient care wasn't diminished (the degree of diminishment is debateable), it is probably good that things like this occasionally happen. It's a great way to test non-technical systems that usually only get tested in a wide-spread disaster.

    --
    Take this test before the hospital goes under.
    --
    Sex - Find It
  67. Re:Reliability is inverse to the number of compone by Alranor · · Score: 2

    I'm a little confused here:-

    Prob train A fails = 0.1
    Prob train B fails = 0.1

    Prob train A doesn't = 0.9
    Prob train B doesn't = 0.9

    So Prob neither fail = 0.9 * 0.9 = 0.81

    So prob at least one fails = 0.19 = 19%

    One of us has got the maths wrong.
    Can someone who's not trying to remember his stats courses from years back tell me if it's me :)

  68. Re:Reliability is inverse to the number of compone by Skater · · Score: 1

    It depends on how it's set up. I think of it in terms of parallel or serial wiring. Your example is serial, in that if one goes down they both go down, thereby decreasing reliability. If you ask the question a different way, such as "What is the possibility that both trains break down" (i.e., parallel--if one goes down it doesn't affect the other one), the probability is .10*.10, which is .01: more reliable.

    --RJ

  69. done right in the first place by wiredog · · Score: 3, Interesting
    You've never worked in the Real World, have you? It is very rare for a network to be put in place, with everything attached in it's final location, and then never ever upgraded until the entire thing is replaced.

    In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.

    1. Re:done right in the first place by krinsh · · Score: 2

      Yes I have worked in the Real World before but I won't claim to be a super expert on any of this. It's just my opinion. And it should be documented. I've worked in a couple of places where the place closes down if their regulatory agency comes in and doesn't find all the proper documentation for everything, and that includes data processing.

      --
      I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
  70. Fix it the first way that works. by tomblackwell · · Score: 3, Insightful

    If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.

    It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.

  71. Network Utilization Analysis not run yet by chopkins1 · · Score: 2, Interesting

    In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.

    They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."

    Another telling comment about the situation was: "network function was fading in and out".

  72. Re:Reliability is inverse to the number of compone by dago · · Score: 2

    I don't know what SAT is, but I think you made some mistakes.

    if your 10% is the probability that 1 train will fail during NY -> LA trip then you've got the following probability :

    0 train fails = 0.9 * 0.9 = 0.81
    1 train fails = 2 * 0.1 * 0.9 = 0.18
    2 train fails = 0.1 * 0.1 = 0.01

    which means that the probability of having at least one train going from NY -> LA is ... 98%, much better than the previous 90%.

    --
    #include "coucou.h"
  73. Probably stupid question. by Anonymous Coward · · Score: 0

    Why wouldn't a full network power down then power up, fix this. Surely that would be quicker than 4 days.

    Or was it a case of poor management / incorrect configuration, resulting in bad configs in the devices NVRAM?

  74. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    Well... in the case of the network system things would be different. We can tolerate the failure of one network or the other, but not both.

    P(Failure) = 0.1

    P(Net1 Fails) = 0.1
    P(Net2 Fails) = 0.1
    P(Both Fail) = 0.1 * 0.1 = 0.01
    P(Net1 Fails, Net2 Works) = 0.1 * 0.9 = 0.09
    P(Net2 Fails, Net1 Works) = 0.1 * 0.9 = 0.09
    P(Either Net Fails) = 0.09 + 0.09 + 0.01 = 0.19

    Yes, we are more likely to experience a failure, but at the same time we are ten times less likely to experience a catastrophic failure.

  75. Fix it Scotty ! by RedVortex · · Score: 1

    Spanning tree doesn't kick in just for fun, there was a problem, and suggesting another parallel network only means thay haven't found it and that it's going to happen again. And it surely doesn't take days to figure out a problem like there are a huge number of ways to fix it temporarily anyway.

    I dealt with Cisco in the past and believe me, if they're not the ones who created and builted your network, they're not the ones you want around fixing things. And surely not the ones making (sorry) stupid suggestions like this one (parallel network, duh!).

    Cisco have probably one of the greatest hardware around but it will still behave like crap when you don't have the right people managing it.

    The quick answer is, find what happened, fix it and maintain it correctly and document it, then if something happens, before pulling plugs and messing around, RTFM :-)

    1. Re:Fix it Scotty ! by Anonymous Coward · · Score: 1, Interesting

      I wish there were more users like you. I work for Cisco, and MANY times have to ask 'who designed this for you?' or 'can I speak with your network administrator?' only to find out that the customer is the one that came up with the bad design in the first place. When I try to suggest that they screwed up on their topology and we should change it, then either they don't want to, don't have a maintenance window, or have some other lame excuse.

      Please RTFM *BEFORE* buying all the gear, when you are in the design phase. If you bought the wrong gear, or if you didn't buy enough memory or whatever, there's not a whole lot I can do to save you.

      Please RTFM *BEFORE* connecting it up. We (yes the people who you are on the phone with) write sample configs up for a reason. We set them up in the lab and verify that they work BEFORE they are submitted for the website.

  76. Re:The Israeli Way by Anonymous Coward · · Score: 0

    Well, occupying another country is another kind of a job. It's hard to say "manyana" when you've got lethal advesaries like stone throwing kids. No sir, that's when you just have to do something with your M-16.

  77. This assumes.. by nurb432 · · Score: 5, Informative

    That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...

    As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.

    and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..

    --
    ---- Booth was a patriot ----
  78. YES- air traffic management experience... by mekkab · · Score: 5, Interesting

    Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.

    Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.

    How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).

    How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.

    --
    In the future, I would want to not be isolated from my friends in the Space Station.
  79. CCNP/CCIEs not what they are cracked up to be? by Anonymous Coward · · Score: 1, Insightful


    Hrmm, says that many CISCO engineers rushed in to "save the day" and did not get it fixed. I have seen this before. Perhaps those CISCO CCNP/CCIEs are not really that good... Then again, as someone else pointed out, if the current network engineer at the hospital did not have the common sense to revert any changes that were made, or figure out a (relatively) simple spanning tree problem, he should be the 1st to go. Sheesh, people need to recall the fundamentals of networking and protocols before they are made heads of very large networks.

    1. Re:CCNP/CCIEs not what they are cracked up to be? by JohnnyBolla · · Score: 2, Interesting

      True. For the most part, having a Cisco cert means you studied hard on how to pass the cert, it really has little bearing on wheather or not you can do the work. Not to say that a chimp can pass them, but I have met some people that couldn't troubleshoot a toaster problem with CCNPs.
      Yes, I have some Cisco certs.

      --
      Carpe Deez
    2. Re:CCNP/CCIEs not what they are cracked up to be? by Fuzion · · Score: 1

      Doesn't the CCIE exam have a part where you have to demonstrate to an examiner that you are competent, by fixing various problems in a given network?

      --
      "Knowledge makes us accountable." - Che Guevara
    3. Re:CCNP/CCIEs not what they are cracked up to be? by Anonymous Coward · · Score: 0

      Yes. You spend 8 hrs. setting up a router and a switch according to set guidelines.

      If you have a CCIE, chances are you're reasonably competent - the other "paper" tests, not so much.

      But still, parachuting an engineer into a situation and expecting them to fix it quickly is a bit unreasonable...

    4. Re:CCNP/CCIEs not what they are cracked up to be? by Anonymous Coward · · Score: 0

      You evidently have not passed your CCIE lab exam.

      It's like pouring hot liquid lava up your arse.

      Painful.

      But a good test if you can do the grunt work.

    5. Re:CCNP/CCIEs not what they are cracked up to be? by Anonymous Coward · · Score: 0

      CCIE is acknowledged to be just about the hardest technology certification you can do. CCNP/CCDP and CCNA/CCDA are a different league and dont mean anything in comparison. An analogy would be GCSEs (Exams at age 16) and a PhD.

      I've known some sh8thot engineers who failed the CCIE lab. The lab now consists of a full day on various cisco kit, which you firstly have to configure with every protocol/setup you can imagine, then fix 'problems' introduced by the invigilators. Something like 85 - 90 % of candidates fail the exam.

    6. Re:CCNP/CCIEs not what they are cracked up to be? by Leon+da+Costa · · Score: 1

      The Cisco CCIE practical exam has changed from an older 2-day format to a new 1-day format. The old format used to include troubleshooting - the new format is mainly all the configuration of the previous two days and more crammed into a mere 8 hours.

      There was a _lot_ of discussion about the removal of troubleshooting. Speaking out of personal experience (I've done them both), you can test the skills you need to be a good troubleshooter just as well with giving you a very well thought-through exam as with giving you a broken network to fix.

      For those of you interested - please see the Cisco Blueprint for an idea of what you need to study for just the qualification portion of this exam.

  80. Redundancy, Redundancy, Redundancy by ChaosMt · · Score: 2
    If it's critial, YES! When's some is life or death, such as a hospital, it is worth it to be prepared. N+1 redundancy.


    The sad thing is I've seen this so many times before in different medical environments I've been in. They usally aren't very modivated to spend money on *any* infrustucture costs. Hospitals may spend some, but it's usally with the modivation to increase donations; "Oh look! It's shiny!"


    Just like any other critical service, it costs big bucks to be prepared. How much you want to bet they 1) didn't have version control, 2) didn't have change control and ... I could go on. The point is everyone plans for system redundancy and recovery, but just assumes the network is resilent. The network is the comptuer - i.e., the system is the network.


    I am proud of them for one thing in particular. IMHO, your last line of redudancy, backups and recovery, etc. should ALWAYS be tangible. When you are involved with something life, death or riches, dead tree backups are the most reliable form I know. I am glad not everyone has lost their common sense to electron envy.

  81. Absolutely Not! by netwalkr · · Score: 1

    As a Cisco engineer I believe if the network is done right the first time there is no need for that drastic of a Disaster recovery plan. The shear cost would be astronomical and if there is a design flaw in the orginal model why replicate that on the DR side. Just my 2 cents.

  82. Open souce Healthcare Information System by EastCoastLA · · Score: 1

    This type of problem will continue until the healthcare system adapts to common standards, protocols and SOFTWARE. The HIPPA regulations have started to put fear into the hearts of hospitals and software companies who have to be up to standard by next year. Millions of dollars are at stake yet some companies are still clueless. If you doubt this statement ask an IT person at the hospital how prepared are they for the upcoming HIPPA implantation. It will enlighten you.
    Tools that are used by the Hospitals are another issue. Many hospitals are still using proprietary systems developed by vendors, which are thinking of there own interest. Unlike the internet (apache), there are very little open source Healthcare information tools that a large hospital system can use that are HIPPA compliant. With all the great open source tools (java, gcc, KDE, ....) it is surprising that a sourceforge project does not exist that would allow a hospital with minimal hardware to run a Java based HIS or something they can run own their existing legacy hardware. This is the killer application. I would tell Beth Israel Deacnoness to fire their software group. Hire opensource team and start the development of an opensource project to do the job that there current system is not. The savings will be great, but the contribution to healthcare will be legendary.

  83. Been there done that, got the ass beating by nt2UNIX · · Score: 3, Insightful

    In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.

    I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.

    We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.

    Test Network GOOD (if you have the money).

    1. Re:Been there done that, got the ass beating by Anonymous Coward · · Score: 1, Insightful

      We implement udld aggressive mode to get around this, if udld detects a layer one problem, it immedialty err-disable's the port, thereby taking funky links out of the network.

  84. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    Valid point, but to nitpick (hey, I'm bored), your maths is wrong. By your reckoning, 11 trains would give

    11 * 10% = 110% chance of failure!

    The actual maths is

    (probability of one or more failing) = 100% - (probability of none failing)
    = 100% - (90% * 90%)
    = 100% - 81%
    = 19%

    So obviously not significant to your point, but mathematically significant :-)

  85. The Solutoin by Shishak · · Score: 5, Insightful

    Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.

    The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.

    Sure you'll be a touch slower routing between the segments but you'll have much more reliability.

    --
    Now I hope and pray that I will But today I am still, just a bill
    1. Re:The Solutoin by SuicidalSquirrel · · Score: 1

      True, but don't forget it's a hospital. Have you priced a 6509 with an MSFC card lately? I know they're nice & all, but that's a lot of $$. Edge routers, yes, but they might be better served to use 2600 series routers and stick with L2 switches for port density.

      --
      So what are you going to do? Bleed on me?
    2. Re:The Solutoin by Large+Green+Mallard · · Score: 2

      I'm a network admin for a university department.. I think the smartest thing my department ever did was have all our subnets routed. Almost every other department is switched, so the thing with the default gateway for client machines is a switch up to several kilometres away ;)

      This was of course after my current manager with a clue about networking came along and saw the hub serving as a network core that then had 10 bridges hanging off it for segmenting the network into each subnet... Of course, he then bought a Nortel Accelar to use as the network core.. but he's seen the folly of his ways now, and we have a Cisco 3550 doing that now ;)

  86. most real networks are already parallel by Anonymous Coward · · Score: 0

    I guess it depends on the amount of parallelism.

    Most enterprise nets I have worked on are parallel all the way to the access layer switch on the user end with dual homed servers.

    that is parallel.

    If they mean two networks that don't touch one another, I think that is retarded.

    The bottom line is, if they got taken down by spanning tree for a whole day, there network was extremely poorly designed to begin with.

    If you follow a simple network design principle, and make each access layer switch a subnet(vlan), and make that switch the root of spanning tree for that VLAN, you will NEVER have a spanning tree loop.

    never ever ever.

    I'd love to be the VAR for that hospital.

    "Yes, the only way we can prevent this is by building another poorly designed network in parallel!"

    ka-ching.

  87. Simple Answer by DarkZero · · Score: 2

    I'm surprised I'm not seeing the really simple, obvious answer here to the question that's posed in the story.

    do you think the answer to having a massive and unreliable network is to build a second identical network?

    Don't build a second identical network. Just set it up so that whenever a file is saved, it's dumped onto a secondary network that's locked down so tightly that it doesn't run programs, search for documents, or anything like that. It just provides documents and that's it. For instance, it could be just a bare bones, huge-ass listing of links to patient data in a single document, and you would just use Ctrl+F or some such to find the name, and then click through it to see a TXT or HTML document with the patient's data in it. That way, you can have fancy programs and extensive information and such on the normal network without risking the network instability that comes with them.

    1. Re:Simple Answer by gorilla · · Score: 4, Interesting
      Having worked in a hosptial, I'll tell you that's not acceptable.

      Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.

  88. Add a second network? Not likely to help by markwelch · · Score: 5, Insightful
    > Do you think the answer to having an massive and unreliable network is to build a second identical network? <

    Of course not. Two solutions are more obvious:

    1. Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
    2. If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
    A third, and somewhat obvious, solution would be to make sure that
    • crucial data is kept on the local server farm, but also copied in real time to a remote server; and
    • a backup access mode (such as a public dial-up internet connection, with strong password protection and encryption) is provided for access to either or both servers, in the event of a crippling "local" network outage.

    This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.

    The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.

    I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.

    --
    -- http://www.MarkWelch.com/ Pleasanton California
  89. They didn't by Anonymous Coward · · Score: 0

    The opening is it bad to build a second bad network is ridiculous. What cisco did was build a second network that all comps could be moved to if the efforts to correct the first network were not successfully. The network crashed due to poor maintenance. The network was not maintained by anyone that knew much about a switched network nor did they really even know what spanning tree was. This just shows that businesses don't really understand the value of experienced or educated network staff. Any one with of inclining of maintaining a switched network would have been able to foresee that continuing with the current design ( or lack of a design ) would end in a crash.

  90. CLARIFICATION by ChimChim · · Score: 1

    Sorry my point was unclear!

    When i wrote "This isn't to say that the extra redundancy isn't useful" I was saying (without saying) that the redundancy *increases* availability. As you guys promptly clarified, the likelihood that both will go down, and hence be completely unavailable is reduced.

    I was simply pointing out that the gut reaction that 2 is better than 1 doesn't always hold true. If I were them, my first priority would be to figure out why their current network failed so horribly (spanning tree apparently) and, rather than having two equally unreliable networks, create a mroe reliable network, with rendundant backups for availability. In a hospital setting, availability is paramount to other concerns, but they're going to incur more than twice the management costs by doubling the same network.

    thanks for callign me out though ;)

    1. Re:CLARIFICATION by pboulang · · Score: 1
      So you are now trying to say that: redundancy *increases* availability which is perfectly clear, yet you also say that: rather than having two equally unreliable networks, create a mroe reliable network, with rendundant backups for availability. which is again unclear ;) Redundant backups as in a completely identical network, or using common parts throughout the network and have a small number of spares that can swap in to replace HW issues?

      In a hospital setting, availability is paramount to other concerns, but they're going to incur more than twice the management costs by doubling the same network.
      I disagree with you here. You think that if I have twice as much equipment to manage, I need twice as many people to do the management?

      Lastly, I think you are overlooking the point that in this particular instance, a completely redundant network would have alleviated the days of recovery.

      That said, I agree with your intention of suggesting throwing more brains rather than devices to solve these types of problems.

      --

      This comment is guaranteed*

      *not guaranteed

    2. Re:CLARIFICATION by ChimChim · · Score: 2, Interesting

      Yes, i'm not the wizard of words (or apparently math ;) this morning am i?

      My main reason for posting was to appease my instinctual reaction to the (somewhat intuitive) mistake soemtimes made that having twice the stuff makes it twice as good/reliable, etc. Which holds true for availability (10-fold in fact), but you'll get less in the case of reliability, and manageability is also a concern since you'll have to constantly check the backup network (if it's not in active use, failures are harder to find or predict for that matter). Also, failures aren't always randomly dispersed throughout the network, as the model might imply. You have to figure out how much failure each part of the network can sustain.

      So, throwing more hardware, developers, or whatever at the problem isn't a real solution. Figuring out what was wrong in the first place will let them spend their money more wisely, rather than letting all that hardware go to waste, doing nothing. They could possibly get all the redundancy they want with less than twice the hardware and maybe even increase performance of the network during regular usage.

      ok, i've totally over spent my $0.02.

  91. I have the solution... by FleshWound · · Score: 4, Funny

    I live in the Boston area, and I have the perfect solution: they should hire me. I'll make sure their network never fails.

    Well, maybe not. But I still need a job... =)

    1. Re:I have the solution... by Anonymous Coward · · Score: 0

      Well, seeing that you do have a fitting name you can consider yourself hired!

  92. Re:Reliability is inverse to the number of compone by EmagGeek · · Score: 1

    The probability of failure goes like this:

    The probabilility of both trains failing is:

    P(1st train fails) * P(2nd train fails) = 0.01

    The probability of neither train failing is:

    P(1st train doesn't fail) * P(2nd train...) = 0.81

    The probability of exactly one train failing is:

    P(1st train fails) * P(2nd train doesn't)
    + P(1st train doesnt) * P(2nd train does)
    = 2 * (0.1 * 0.9) = 0.18

    (notice this adds up to 1, so far)

    and the probaility of at LEAST one train failiing is P(exactly one fails) + P(both fail) = 0.19

    QED

  93. Re:That's why I hate automatic routing by hplasm · · Score: 1

    Don't use Spanning Tree unless your routers still use Valve (Vacuum Tube) Technology. It's fine until it breaks, and then it can be a twat to make it settle down again. retire it .

    --
    ...and he grinned, like a fox eating shit out of a wire brush.
  94. Networks are fragile. by XPisthenewNT · · Score: 3, Interesting
    I am in intern in a networking department where we use all cisco stuff. Spanning tree and some other protocols are very scary because once one switch declares itself a server of a given protocol, other switches "fall for it" and believe the new switch over the router. Getting the network back is not as easy as turning off the offender, because the other switches are now set for a different switch server. Power outages are also very scary because if switches use any type of dynamic protocol, they have to come back up in the right order; which Murphy's Law seems to indicate would never happen.
    Networks are fragile, I'm surprised there arn't more massive outages.
    The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.

    KISS: Keep it simple, stupid!

    1. Re:Networks are fragile. by Mr.+KaryHead · · Score: 2, Informative

      Networks can be fragile and spanning tree can certainly cause some of the problems. That is why one must design the spanning tree topology. When you say "one switch declares itself a server of a given protocol", I assume you mean "declares itself the root of a VLAN." The root is determined by the lowest advertised bridge ID from each switch. The bridge ID is the bridge priority plus the bridge address. Cisco switches have a default bridge priority. So then it boils down to whichever switch has the lowest bridge address becomes the root, which could be any switch anywhere in your network. The network admin should decide which switch will be the root for a given VLAN and set the bridge priority lower. And then he/she selects another switch to be a backup root and sets its priority to be lower than the default but higher than the root's priority. So you if don't manually set the root then a new switch plugged into the network could very well become the root if all the switches have a default priority and the new switch has a lower bridge address than the current root.

      If this happens, you can just turn off the offender to get your root back. In STP only the root talks. If the other switches don't hear from the root in something like 20 seconds, then they'll elect a new root.

      -Kary

  95. Was it OSPF? by Anonymous Coward · · Score: 2, Interesting

    The article is a little light on technical details, but does anyone know what internal routing protocol they were using? We've got a network with 11 cisco routers running OSPF. The routing changes happen very often, because there's a bunch of dial-ups and a few dozen routes that come and go with short-term connections (like backups from a remote office or running a CC authorization from a remote office). Everything works perfectly if none of our three newest routers are the first powered up. Those three are running IOS 11.0. After several calls to cisco (we buy all cisco internally and for our customer ends, so we get very good support from them) over the past three years, cisco is still stumped as to what the problem could be. The lines in the config file for OSPF are only five lines long, so we (and cisco) are sure there's no problem there. The hospital's problems sounds like it's of the same sort.

  96. Re:Failed SAT huh? by Anonymous Coward · · Score: 0

    Ahh yes, but what about the probability of both trains breaking down at the same time?

    You are confusing a device which is twice as complex performing a given task, and two machines with the same complexity independantly performing the given task.

    complexity = bad
    redundancy = good.

  97. Previous lack of funding for IT? by quark2universe · · Score: 2

    If this hospital is like any of the medical instituions I've worked for, then it's not unreasonable to expect that the IT group has been begging for more money to upgrade the infrastructure because they knew this kind of thing could happen. This usually falls on deaf ears at the doctor and senior administration level of the hospital because they see computers and networks as "magic" and don't take any time to understand the kind of reliance that is now placed on those systems. Also, it is very common for doctors to reject any spending on IT because it will bring their 8 figure salaries down to 7 figures and that is totally unacceptable!!! The story did say they are looking at 3$million for future upgrades, but that ONLY happened after this disaster.

    --

    Believe in things of which no person has ever learned
  98. Remember... by Randolpho · · Score: 1

    The paperless office is still, and always will be, a myth.

    --
    "Times have not become more violent. They have just become more televised."
    -Marilyn Manson
  99. That would be redundant! by Anonymous Coward · · Score: 0


    That would be redundant!

  100. Redundancy and death by FearUncertaintyDoubt · · Score: 2, Insightful
    Of course, as open as they were about the whole incident, the hospital did not disclose whether any patients were affected or even died due to the breakdown (nurses having wrong information, staffing problems caused critical situations to wait too long, etc.).

    A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.

  101. Life threatening? by saider · · Score: 3, Insightful

    I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.

    The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.

    --


    Remember, You are unique...just like everyone else.
    1. Re:Life threatening? by benwb · · Score: 5, Insightful

      Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.

  102. Down here in North Carolina by LWolenczak · · Score: 2

    I used to work for a systems intergrator. Just by general pratice, anything that was mission critical was on a seperate network.... if not two different networks. This is most likely a WinXP machine that somebody played with the stp/vlan settings.

    Speaking of teaching hospitals... Yes, they are large..... I live just a few miles from Wake Forest/Baptis Hospital. They add, or renovate a wing a year.... There are always large crains over the building... and since I'm looking for work... I applied there... Even though they had a polethra of positions open for Network Techs, and since I'm well over qualified, and cheap... you would have thought they would have hired me... they did not... they seem to go for bottom barrel regarding techs... cheapest... most likely they think A+ is the best cert you can get.

    1. Re:Down here in North Carolina by buss_error · · Score: 2
      they did not... they seem to go for bottom barrel regarding techs

      I know at some places I've worked, the question is "Well, if they're that good, then they wouldn't settle for our wage. They'll just leave when a better paying job rolls around. Better to hire someone that will stay."

      --
      Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  103. QoS and network boundaries by pangur · · Score: 5, Informative
    There are several non-exclusive answers to the Beth Israel problem:

    1) introduction of routed domains to seperate groups of switches

    2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.

    3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.

    4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.

    Thank you.

    1. Re:QoS and network boundaries by caluml · · Score: 2

      Have a look at this device. Not quality of service, but guarantee of service. Very cool.

      The FlowFusion 2M and 5M are U4EA's first commercial hardware products to simultaneously manage all three of the factors affecting multi-service networks - throughput, loss and delay.

      While others have addressed bandwidth, U4EA has developed the GoS solution that allows network managers to manage packet delay and device buffers, as well as to isolate problematic streams to avoid random packet loss. Critical applications are guaranteed bandwidth - up to 2 Mbit/s (FlowFusion 2M) and up to 5 Mbit/s (FlowFusion 5M) at the WAN interface - and Quality of Service (QoS), even during extended periods of network overload.

      The FlowFusion units are typically installed between an office LAN and the WAN access equipment via two fast Ethernet ports, and can stand alone or be rack mounted.

      The network administrator is able to define treatment parameters for each application so that mission-critical applications get the exact resources when needed, while maintaining the WAN resource at near 100% utilisation. For the first time, mixed networks can achieve maximum efficiency through a single connection, accelerating the deployment of converged services like VoIP and online videoconferencing.


      http://www.u4eagroup.com/pdf/data%20sheet;%206844. pdf

  104. unit network by azileretsis · · Score: 1

    The article is definitely interesting. Since hospitals affect all of us at one time or another, it's interesting to see how their networks are set up.

    There was some talk here about unit-based network or basically separated parts of networks. How would I get more information about this topic? How would that work? Don't you lose some benefits from having a central network where resources can be allocated to when you need it? What does it mean to be on a unit or divisional network? As a metaphor, I imagine a building that has lock-downs at certain places. But for a network, for lock-downs to work efficiently, there has to be highly effective detection devices (kindof like fire detectors).

    I can't imagine it would be as easy as turning on a switch.

    I agree that this person should not have be on the production server at all but on a development server.

    I also agree they should have had backups available though they did state none of the network has patient critical information. But can you imagine if your patient information had been inacessible?

  105. It's HIPAA by mrneutron · · Score: 3, Informative

    Health Insurance Portability and Accountability Act.

    Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.

    The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.

  106. Enterprise applications need Enterprise CYA by Matey-O · · Score: 2

    They need a smaller test environment that ALL changes have to be checked off on before implementing. They need images of all router configs they can roll back to if necessary, and they need a diff comparison tool (mantrap or somesuch) to see what's changed between their known good configuration and what exists now.

    Oh yeah, and they need a signed piece of paper with the moron's signature saying the change wouldn't impact the network. (a papertrail, as archaic as that seems.)

    --
    "Draco dormiens nunquam titillandus."
  107. Second identical network? by secolactico · · Score: 1
    Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?

    I'm the armchair kind. But I wouldn't this solution have led to TWO identical networks down? Whatever triggered the problem in network A could easily be present in netork B?

    Unfortunately, downtimes are not fun in a hospital. In other places, it means that we can goof off and blame it on the IT department.

    Ok, time to stop trolling...

    --
    No sig
  108. Simple...apply the formula by liquidsin · · Score: 2

    do you think the answer to having a massive and unreliable network is to build a second identical network?"

    Take the number of patients in the hospital, A, multiply by the probable rate of death should the network fail, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a redundant network, we don't build one.

    --
    do not read this line twice.
    1. Re:Simple...apply the formula by jemoody · · Score: 1

      I am Jack's complete lack of surprise.

  109. Cisco engineers to the rescue-ha by Anonymous Coward · · Score: 0


    Why do you think telecom. networks cost so much.

  110. I work at a teaching hospital... by pacsman · · Score: 5, Insightful

    The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.

    1. Re:I work at a teaching hospital... by Anonymous Coward · · Score: 0

      there are plenty of easy fixes here:

      for the lazy with money its called Hp Openview NNM or Micromuse Netcool

      for the poor with time on their hands its called MRTG,NetSaint,etc etc etc,etc

      Network failures only last as long as the problem goes undetected and repair cannot begin....managing..resources..WHAT A CONCEPT!?

    2. Re:I work at a teaching hospital... by DNS-and-BIND · · Score: 2

      Hey, it's better than the X-ray machines controled by Visual Basic apps. As you get ready to be irridated, you watch the technician click through several dialog boxes of errors as she reassures you "it's OK, this always happens".

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    3. Re:I work at a teaching hospital... by eam · · Score: 1

      Wow! We must work at the same hospital ;-)

    4. Re:I work at a teaching hospital... by Anonymous Coward · · Score: 0

      Ohh please, the software that is made for Medical equipment is by far the lousiest software imaginable. Error boxes appear for "not enough free space" on drives that are not even the destination drives, for wrong file name codes, according to some ill conceived manufacturer's concept of what a file name should be. So don' think that you are getting an X-ray of a monkey's spine instead of your left wrist. It usually isn't the tech's fault it is the geekheads at the manufacuter's fault for making such terrible systems to begin with.
      I have been running and writing Sleep Medicine software for well over a decade. And have seen enough poor, fortran, C,VB, and yes Macro programming to go around. I have even offered direct user input to a global manufacturer's Head software engineer, who directly told me that he has his own way of thinking about the software. and is not taking input.

    5. Re:I work at a teaching hospital... by Associate · · Score: 1

      How and how long do hospitals archive CAT and MRI's? I had a CAT scan several years ago and I'm a bit curious.

      --
      Someone hates these cans.
  111. Maybe not so ridiculious by lucifuge31337 · · Score: 2, Insightful

    This sounds like a case of poor network infrastructure management. That being said, you can't pin it all on IT. Organizations like this have networks that grow out of necessity, and are often nearly impossible to make large changes to.

    Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.

    --
    Do not fold, spindle or mutilate.
  112. Take a que from the Ramans by UV_Haze · · Score: 0

    The ramans did everything in threes for a reason. So in response to whether or not building a second identical netowrk is a good idea... I think a third should be implemented also! Expecially in a situation where lives are at stake.

  113. Having Worked at Boston Medical Center by Ratfor77 · · Score: 1

    I can say that hosiptal networks are a nightmare. You have dozens of departmeants, all with different combo's of hardware, software, requirements, & operating systems. Workstations for patient entry. Workstations for patient tracking. VT300's to access legacy VAX/VMS databases that NOBODY knows exactly how to port to a newer platform. Besides, convincing the powers that be that they need to spend BIG $$$$ to modernize and streamline is an endless battle. Kudos to them for having a workable system for 6 years, but they never should have abandoned the paper backup. Just my 1 cent.

    1. Re:Having Worked at Boston Medical Center by iggymanz · · Score: 1

      VT300's to access legacy VAX/VMS databases that NOBODY knows exactly how to port to a newer platform

      Oh, there's still plenty of us former VAX VMS admin/system/application programmers around (who now do Unix based solutions) that would be DELIGHTED to do this.....anyone with legacy systems that need integration/conversion/migration is welcomed to visit my website, read my resume, and contact me.

  114. Any particular bank? by Anonymous Coward · · Score: 0

    Maybe Washington Mutual?

  115. Nope by mnmn · · Score: 1


    do you think the answer to having a massive and unreliable network is to build a second identical network?"

    The answer is to build reliable networks in the first place. From each computer to the other there should be multiple routes. Firewalls should be kept between departments to stop NetBIOS and ICMP broadcast storms and Linux be used to replace M$ systems. All DBASE 5 apps should be replaced with mysql/ncurses equivalents on RAID 1/XFS filesystems. A central computer with daily backups be kept, with multiple power sources for each department.

    Having a full-time network administrator, and shielding him from Sys admin tasks while he keeps a list of network analysers/ monitors the servers and keeps extra routers and cables, helps.

    Quantity is still no alternative to quality. Install 4 networks in parallel and a DDoS attack will take it out. Else something like the slapper worm or even unplugging an important server will still break the system. Install good quality hardware and dont be understaffed in the IT sector.

    --
    "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  116. How about P2P? by cberetz · · Score: 1

    Could a completely decentralized network, e.g. P2P, solve the problem? redundancy is built-in, so to speak.

    _____________________________________
    2 + 2 = 5 for very large values of 2

    1. Re:How about P2P? by a3d0a3m · · Score: 1

      Here is a clue... please use it wisely.

    2. Re:How about P2P? by cberetz · · Score: 1

      Thanks for that constructive addition, script kiddie.

    3. Re:How about P2P? by a3d0a3m · · Score: 1

      hey, that's a step up from kazaa kiddie.

      seriously though, you know not what you talk about. you should read a book before you open your mouth.

    4. Re:How about P2P? by cberetz · · Score: 1

      HA! I was right! The fact that you equate P2P with Kazaa PROVES your experience is limited to that of a script kiddie. Check it out, then we'll talk: http://www.groove.net/

      Chuck

  117. Now we can screw up twice as many times! by watchful.babbler · · Score: 1
    Do you think the answer to having a massive and unreliable network is to build a second identical network?

    According to my former employers at WorldCom, yes!

    Okay, maybe not the best example. But it was always fun to be on conference calls when we had to explain to the customer why their backup network had gone down at the exact same time as the primary ... assuming your idea of fun correlates with the deeper circles of Dante's hell.

    --
    "Freedom is kind of a hobby with me, and I have disposable income that I'll spend to find out how to get people more."
  118. Re:That's why I hate automatic routing by Swannie · · Score: 3, Interesting
    Can you make a case why spanning tree is bad? Beyond "It's old", or "I've been burned before?" I've never, personally, heard a good arguement as to why spanning tree is bad.


    As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.


    Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...


    Swannie

    --
    :q!
  119. Yes, if I'm selling the network ;) by dnoyeb · · Score: 2, Funny

    Of course the answer is to build a completely seperate network if I am the one who you will pay to build it ;)

    This is obvious.

    In truth the network problem was not a physical one so then solution should not be a physical one.

    1. Re:Yes, if I'm selling the network ;) by kaoshin · · Score: 2, Funny

      You can never be TOO safe when lives are at stake. I think at least 4 networks would really be needed.

  120. Globe Article is not entirely correct by Anonymous Coward · · Score: 0

    I work at the hospital (not in IT however.) In reality, the IT department is reworking/fixing the existing infrastructure including much of the hardware, _and_ adding a new redundant network. It doesn't look like this will be a complete standalone parallel network, but more likely a limited one that serves only clinical applications.

    While the data the mentioned researcher dumped into the network caused the crash - it was merely the proverbial straw. The amount of data the network shuffles around is astronomical - for example all imaging is online and images need to be passed all around (different clinicians, backups, etc). These images are huge (a CT scan for example may consist of the equivilent of 100 regular x rays), and need to be stored and transferred in a lossless format.

  121. Redundant Systems by Big_Daddy_CBT · · Score: 1

    We setup redundant systems and an airline training centers to make sure the pilot training side wouldn't fail (apparently the cost of having pilots come back was huge - go figure...). In the end the training center was actually more redundant and reliable than the actual reservation systems at the airline and we enjoyed 98.5% uptime (save for small things like the power company killin power to the entire builiding without notifying us). To make matters more interesting this system was spread over two locations that are 600 miles apart.

    Essentially we had redundant routers put in place nin each center so that if one failed the traffic would kick over to the second. In addition we had developed a small application that resided on the classroom computers that would check the application servers holding the training material. If the primary server was down it simply switched to the secondary server (there were three application servers in one center and two in another).

    Furthermore, our database servers (two in one location, one in another; the primary server was located in the larger center and all machines went there first) had a product called DoubleTake installed which would cause the backup server to assume the identity of the primary server in the event of a failure. DoubleTake also allowed us to mirror image the data on our servers fo consistency in the event of a failure. This was important because if we had a WAN failure the database server in the smaller facility would activate and act as the database server for that facility (this actually happened - we had our IT work farmed out to a large support company, which I shall not name, that actually once failed to notice a T1 line had failed for OVER A MONTH!!!).

    There were a few glitches, such as the need to wait until afterhours to bring back the primary server in the event of a failure (if you didn't you would be bringing up another server with a duplicate IP address due to the DoubleTake software which caused all sorts of problems so both actually had to be brought down), but for the most part it works very well.

    Heck, even if all that failed we had stand alone machines that could run off of a CD. I think that may be a little difficult for a hospital to do though.

    Kris

  122. Duplicate Network? by sjlutz · · Score: 1

    The problem with the duplicate network is that it can fall victum under the same problems the original had. Say the first network goes down because of this problem. Ok.. first you have to re-patch all the network nodes into the new network (probably not an easy task). But the new network, if designed the same way, the professor replugs into the new network and starts his number crunching again. Now the new (2nd) network is down..
    Worse, with the 2nd network as a backup, they may never know what caused the problems, and therefore it wouldn't get fixed.
    It's kinda like putting a "backup" engine on a plane because the fuel is dirty and kills the engine.. it will kill both engines.. cleaning the fuel is a better fix..

  123. No, stupid by Anonymous Coward · · Score: 0

    The answer is keep all life-critical systems on one completely seperated network. Keep all research on another completely seperate network.

    If another researcher brings down the research network, that's fine. No one is going to die. But the life-critical network would be untouched, and that is the whole point of having a parallel network.

    They should have done this in the first place. To not have done it was irresponsible. I would sue their asses off if I were a patient or a family member of a patient that died during those 4 days.

  124. I think they need a less reliable network by og_sh0x · · Score: 1

    I used to work at a county teaching hospital. They had a really ancient, parts-a-muffin, mixed-topology network. Each department had its own separate, incompatible system. These systems were chosen by the department heads, not the IT staff. They then had to use Siemens OPENLink to tie them all together. They had downtimes all of the time. So all the staff was prepared for downtime procedures, because they had to use them once or twice a week, at least for the four months I worked there. So maybe a less reliable network is in order?

  125. "Parallel Network" by Megane · · Score: 2

    The story I heard was that they had already approved the new network and it was still a few months away from being implemented when the old chewing-gum-and-bailing-wire network prematurely fell apart.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  126. 2 pieces of a solution by WPIDalamar · · Score: 2

    Way I see it, there are 2 things that need to get done.
    1) Policy change. Only production machines on a production network.
    2) Topology change. Make it easy to get a non-production network connection so people don't violate #1

  127. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    This is a moot question. Noone in the USA takes trains. Pure fiction!

  128. Re:Reliability is inverse to the number of compone by Medieval · · Score: 1

    -----
    0.1 * 0.1 = 1
    -----

    Someone failed math...

    0.1 * 0.1 = 0.01 .....

  129. bad /. summary by BalloonMan · · Score: 1

    I must agree with all those folks complaining about the snippy zingers in /. articles recently. Please read the article *carefully* before you post a summary, and try to be more objective, "Michael".

    The hospital was already in the process of overhauling their network with the help of a consultant. Now they're going to accelerate that work (doh!).

    NOWHERE does it say they're going to build a "duplicate" network! They're going to add twice the amount of wire, but that's the only real detail cited, and that's hardly enough information to justify the petty jab.

  130. A redundant flaw is still flawed by sc0nway · · Score: 1

    Having an backup network based on the same design will not solve anything. The Arian Rocket disaster, caused by an buffer overflow, had duplicate processing and when the first one crashed the redundant processor took over and crashed too because of the buffer overflow. The solution (IMHO) would be to limit the resources a single individual/process/workstation can have so a single user cannot flood the system, causing a crash. I also believe there should be a development database/system set up that the user could test against before exposing critical production systems (Corporations have them - why can't hospitals).

  131. Healthsouth is building... by mtec · · Score: 1

    ...a completely digital, paperless

    *echo chamber effect* Hospital of the Future *end effect*

    here in Birmingham, AL (we have a big medical presence). I hope Scrushy's reading this...

    --
    Cake or Death? Cake Please!
  132. Redundant Networks for Patient Care by jcm · · Score: 2, Informative

    I spent three years (1995-1998) at Perot Systems as a consultant designing and implementing hospital networks for Tenet Healthcare (2nd largest hospital chain in the US). There was at least one hospital that had the budget and the foresight to see that reliance on the network would do nothing but increase.

    For that hospital, my network design was one that incorporated as much redundancy as possible at the time. For each patient care area, such as nurse's stations and ancillary areas such as radiology, cardiology, surgical theaters, etc. it was decided that each of the two network jacks would terminate in seperate closets. This meant doubling the number of closets required in order to meet distance limitations, but the hospital had already started working on allocating that space for the closets. Also for any important ancillary areas such as the lab, central supply, there also was two seperate networks. For the server farms theirselves, the Patient Care systems all had redundant connections to the primary and backup networks as well.

    As each wall jack terminated into a different closet, each closet had two seperate networks as well. Each closet would house the primary network for half of the jacks served, and the backup network for the other half of the jacks served. The fiber paths from each closet took disparate paths back to seperate data center rooms, one external to the main building of the campus and one inside the main building. At the time layer 3 switches, or switch routers such as the Foundry Big Irons, or Cisco 6500s were not available. So as much as I dislike using Spanning Tree, I had used it at the time. All priorities were manually set though so there was no doubt where the root was and where it would move to in case of failure.

    So, the switches terminated on another switch which was partitioned to several segments. Switch connections were made between the two data center as well. Each segment had a connection to a Cisco 7507 Fast Ethernet port local to that computer room, and another in the second computer room. Forming the core were two sets of two Cisco 7507s. In order to prevent one OSPF network from affecting the other OSPF network static routes were used (would use BGP if I had to do it over again). Outside WAN connections were terminated redundantly on the two patient care networks as well.

    While the primary network in the hospital also supported the non-patient care areas (such as administration, the backup network was only for the patient care areas. That was just to prevent the type of thing that happened in the article, where something non-patient care related ends up taking everything down.

    Reverting to backup paper systems would be nearly impossible once the "tube" systems were sealed up. Much like the movie Brazil, hospitals used to have pneumatic tubes running all over the place, especially between the lab and the nurse stations. Running samples and results back and forth would definately introduce a LOT of delay for a doctor trying to make a life and death decision.

    I am sure that I would I design things different these days (for one, Layer 3 would go all the way to every single edge switch and collapse on a fast switch router) but I think the design probably held together well. I should check back in someday and see how long and well it lasted, if they did replace it.

    Jay

  133. Mis-Managed or Not Managed at all? by Anonymous Coward · · Score: 0

    There really isn't enough information in the story to "ass-ume" any intelligent discussion. The answer to a fail-over network isn't really in the building of a second identical infrastructure, but really in a redundant design that lends itself to automatic failover. HSRP, STP, redundant core and distribution layers are all excellent tools to perform this type of redundancy, but if not set up properly, or managed properly are no good at all. If not constantly monitored for performance and faults, a network is only as good as the hardware itself - Budget/Finance and Administrators often bypass the ongoing expense of maintaining a network infrastructure once it's built (monitoring software/personnel). It will be interesting to see if Cisco issues a case study on the problems that caused this failure.

  134. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    I guess that it was you who failed maths.

    1% = 0.01

  135. There comes a time in troubleshooting.... by AlphaInsight · · Score: 1

    when the time spent debugging the problem surpasses the time you would spend just doing it over. The hard part is determining when to give up on fixing it and moving forward with a new plan. Would any of you like to trace a network fault on some of the "Most dangerous server rooms in the world" (see The Register)?

  136. Re:Reliability is inverse to the number of compone by falzer · · Score: 1

    And 0.01 = 1%.

  137. Thier Network Admins Don't know Shit from Shinola by Anonymous Coward · · Score: 0

    I can offer my services to strightnen out thier problem for $165.00/hr.

    Their Netowrk admins have no clue! First of all...
    Why the hell did they design their network around
    spanning tree? Poor design leads to failures like this.

    Shut Spanning Tree Off. Lay out a plan, use routers
    where needed (yes ethernet routers too). Or VLAN
    if you must.

    Segment the network. Man these people have no clue.

    This is what happens when you have poorly laid out network.

  138. that last .001% is a bugger by briancnorton · · Score: 1

    This is of course redundant, but your webserver having 99.999% uptime is GREAT. A hospital having 99.999% uptime is a disaster. The ONLY way to responsibly manage a network like this is to build a redundant system. Fix what's broken of course, but have the backup. You do your best to make sure your company's database works all the time, but you still make back-ups, dont you?

    --

    People who think they know everything really piss off those of us that actually do.

  139. Contribution to causality responsibility by hey! · · Score: 5, Insightful

    Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.

    Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.

    But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.

    I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  140. campus sized spanning tree in this day & age? by puzzled · · Score: 1

    I don't know much about this stuff, merely having the Cisco Certified Network and Design Professional certifications, and not yet having the CCIE, but here goes.

    In the bad old days before layer three switches became inexpensive networks were either routed or bridged. Spanning tree is a tool used on redundant layer two networks to detect and eliminate loops.

    If these guys were a hardcore Cisco shop and they used Cisco's Inter Switch Link (ISL) VLAN technology it is possible they might have a very complex topology with multiple spanning tree roots. That can't be done with the IEEE 802.1Q VLANs more commonly used today, but this sort of thing was deployed for campus redundancy in the mid 90s.

    The right solution in a case like this is ... messy. If they've really got a situation where they've got one big a$$ subnet with 1024 or 2048 IP addresses in it they're pretty much going to have to *build a parallel network* with proper L3 equipment and an IP address allocation plan, then go floor by floor and convert users to that scheme.

    It sounds crazy, but I've been responsible for a campus with 800 MAC addresses in the core switch's CAM table and it is the easiest, safest route to take.

    --
    I am very easy to get along with, but I don't have time to waste being nice to people who are being stupid. -Theo
  141. CWRU by Ececheira · · Score: 1

    Gee, you must be at CWRU. As an alum, no other school's network could have been as poorly designed (then).

  142. Redundancy is alway by Anonymous Coward · · Score: 0

    Hi guys,

    I work as a system integrator for one the biggest manufacturers of telecom equipment, and I don't understand how something that crucial could not have a redundant network. WHenever we sell something to any telco, the first question they ask is always "What's the backup plan if this or this or this fails?" and the second question is always ("What is the backup to the backup?".

    I can't beleive that a network built for something as important as a hospital could not have a redundant network, or at least redudant nodes (switches, routers,..). If that's the case, then the guy who designed this network should be shot :)

    K

  143. Generic Slashdot response to this question by Anonymous Coward · · Score: 0

    If only the hospital had been using computers with Linux (+1 for insightful) and used an open-source model (+1 for interesting) then it would've been more stable. Plus the open-source community (+1 for underrated) would have it back up and running minutes afterwards if it did crash.

    The only thing Microsoft I want to see in that hospital is Bill Gates with the AIDS (+1 for funny).

  144. Fraternal Twins by SEWilco · · Score: 5, Interesting
    I hope the "second redundant network" uses equipment by a different manufacturer and has at least one network technician whose primary duty is that network. That person's secondary duty should be to monitor the primary network and look for problems there. Someone in the primary network staff should have a secondary duty to monitor and check the backup network.

    The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.

    1. Re:Fraternal Twins by SEWilco · · Score: 1

      (I forgot to point out that one of the reasons to have a dedicated staff is to avoid having the same design mistakes done to both networks. The staff of the second network should participate in all-network meetings which discuss status and problems, but not necessarily the problem-solving meetings. If the two network staff try to solve problems independently then design errors are less likely to be duplicated. The two groups should compare their solutions so they can teach each other and perhaps find even better solutions, be aware that one solution may be better for one set of equipment, and if two solutions are about equally good then each network use their own best solution.)

  145. Re:Reliability is inverse to the number of compone by secolactico · · Score: 1

    HA! They'll both fail when they meet in the middle!!!


    MUAHAHAHAHAHA!

    --
    No sig
  146. Re:Reliability is inverse to the number of compone by gorf · · Score: 4, Informative

    No.

    You can only multiply them together like you have done if the two variables are independent.

    Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.

  147. Unreliable? Eh? by Psiren · · Score: 2

    Seeing as these paper forms hadn't been used for 6 years, I'd have to assume that the network was very reliable. Problems do occur from time to time, but it doesn't mean that the whole thing should be replaced. Just fix the issue and move on.

  148. Ignorance is bliss, I guess. by Mordant · · Score: 1, Flamebait

    Dude, you so don't know what you're talking about; Cisco is the #1 supplier of layer-3 switching gear in the world:

    http://www.cisco.com/en/US/products/hw/switches/ in dex.html

    Nor is it true that 'Cisco equipment runs a new instance of spanning tree each time a new VLAN is created'. You have to know what you're doing, of course, but it's very easy to create a very large layer-2 spanning-tree domain with a good-sized ST diameter. With good network design principles (read more on http://www.cisco.com, attend their Networkers sessions) and an understanding of how the equipment works, this sort of problem should never occur.

    1. Re:Ignorance is bliss, I guess. by xaoslaad · · Score: 1

      I prefaced my message by saying that I'm not up to snuff on spanning tree. We don't use it. Period.

      I have seen examples of spanning tree in classrooms, etc. and some persons say 'greatest thing in the world'
      By far though, I see many many more steering clear of it. I also made my ignorance about Cisco layer 3 equipment clear ;)
      Was only based on talking around as I said. And we all know how much talking around gets you. =)

      I suppose that I should read about (and should I be interested in actually implementing a new network I would have vendors bidding and coming in to show me the benefits of their networking equipment) etc.

      I do have complaints beyond that from my VERY limited use of Cisco switches, but it is not relevant to the story at hand (I will share if you want to enlighten me in my perhaps ignorant beliefs; but not to start a flame war); I merely brought up Extreme because I do use the switces and I do like them.

      The simple truth is that I'm not interested _at this time_ though. The chances of me getting my boss to let me get even one cisco switch in here to play with and really learn is slim.

      That is not to say Cisco does not make a nice product. I do very much like their routers. They are king of the routing world for a reason. I have used Bay, Olicoms, and Cisco, probably one or two others... but nothing in any way compares to Cisco.

      It was not my point to start a flame war about whose network gear is better. Apologies if that is how it came across.

    2. Re:Ignorance is bliss, I guess. by Anonymous Coward · · Score: 0

      You must have a very small (read: one segment) network, if you aren't using spanning-tree; elsewise, you're likely to bring the whole thing crashing down around your ears if you plug the wrong switch into the wrong port on another switch.

    3. Re:Ignorance is bliss, I guess. by Anonymous Coward · · Score: 0

      Look.. and this should be a WRITTEN rule on slashdot instead of unwritten.

      If you don't know what you're talking about, don't talk about it.

      You were even aware of the fact that you didn't know what you were talking about, but proceeded to make outrageously false statements. It would have been so much better to just lurk in the comments and maybe learn something, rather than potentially mis-educate people.

  149. Why not power-cycle whole complex? by dpbsmith · · Score: 2

    The Globe was indeed short on technical details. What puzzles me is that they say the network was down for four days.

    NOT a rhetorical question:

    Why didn't they power-cycle the whole complex? Maybe even literally? Presumably a hospital should be able to handle a short interruption in AC power... and presumably the network equipment wouldn't preserve the "I'm-broken-state" in nonvolatile memory. Why wouldn't a scheduled power outage for 10 minutes at 2 a.m. in the morning have been less disruptive than the network being down for four days?

    Less drastically, couldn't they have called every operator and system administrator in and said "Synchronize your watches... at 2. a.m. power off every piece of computer gear within a hundred feet of your chair off, then at 2:10 a.m. power them on again?"

    1. Re:Why not power-cycle whole complex? by rave77 · · Score: 1

      even tho they have downtime procedures, hospitals hate using them. They lose thousands of dollars an hour in lost labor when the nurses have to call the lab/radiology departments rather than placing the orders online.

      Downtimes that last for days result in lawsuits for millions of dollars.

      I think most of this could be avoided with more IS training on downtime procedures to get things back up, and basic troubleshooting skills (example, dont panic).

      Most sysadmins brag to me about how long their unix servers have been running. Then when the UPS fails (and they all do) they have no idea who to call or how to restart databases. Most dont know what half of the machines in the computer room do. How are they backed up? No idea.

      Then on the other end of the spectrum are the ones that insist on customizing every bit of their systems, thus breaking support's tools and increasing work/time to fix matters.

      My point is this. Planned downtime typically helps more than it hurts. Ask questions before you assume changing something wont matter.

      This sounds awfully obvious to me but it needed to be said.

  150. Interplatform connectivity is a specialty of mine. by MrJerryNormandinSir · · Score: 1

    Spanning Tree is not the answer. Especially when you
    have multiple platforms speaking multiple protocols.
    You need to segment the network, each department
    should be on it's own segment, it's own network address. As for other protocols... yes they
    should be segmented as well. Each department should have it's own ethernet router. In a hospital.. preferably a fiber router.
    Shut Spanning tree off! Damn!
    Ok.. I'd go with a higher end central router.
    A Cisco 7000 series if you are going fiber.
    A 3600 series if you are just running 100 BaseT cat 5. And here's where my expertise comes in.
    Program the routers properly. Do not use any autodiscovering protocols. That goes for all your
    protocols! And if they have Novell... don't SAP
    every minute, SAP maybe every 10 minutes or so.
    Static routes should be used for IP, don't use RIP. And a poorly managed network can
    come crashing down if Spanning Tree is used.
    IN college I called this failure "Packet Avalanche". I bet if I put my Linux based laptop
    on the network and analyzed the traffic there would be collisions up the wazoo.

  151. When You Tune to Channel 9 at 8 O'Clock... by RobotRunAmok · · Score: 2

    ...the TV show you intend to watch is there. It may begin a few seconds late, on purpose or as a result of some discrepancy, but the TV show you want to watch is there.

    For the past few years, networks on the national and local levels have all been switching over to server-based content play-out. TV from Computers! How Exciting! How Wonderful! How... frickin' scary, for those whose jobs it has been to ensure that Buffy plays down at 8, and not 8:02, or 8:15, or - Powers-That-Be Forbid! - Wednesday morning.

    Professional TV Master Control operations traditionally operate (often contractually) to "five 9's" of reliability, 24x7, assessed monthly. Full Stop, Period, End-of-Story. TV Master Control geeks, their supervisors, and the maintenance engineers who support them have ever been a priesthood apart when it comes to worship at the Uptime Altar.

    So what has their industry done, to ensure that all this "new wave" server and automation technology provides them with the same reliability as manual control and tape-based playback? Why, buy two of everything, of course! EV-ER-Y THING!

    The server industry is only getting around to understanding that now, and is beginning to price their wares accordingly. I've attended dozens of vendor meetings over the past ten years where the salesguys, who six months earlier were selling mailservers to sysAdmins, are now selling their new video servers to Master Control guys. (Chum dished into a shark tank is the only comparable visual I can come up with.) What makes the sale is never the reliability of server over tape or (especially) the quality of server over tape, but desire of management to run more channels with fewer bodies. In the past this has led to management re-assessment of just how "inexpensive" server-based playout technology was and, in many cases I have seen, an increase in the number of channels created or planned as a means to justify the hardware costs.

    The only debate point in most TV Master Controls comes down to what components are in-chassis redundant, which are external-chassis "hot" spares, and which are shelf spares.

    My point (and I do have one...) is how it is unconscionable that a hospital where lives are at stake, lacks the war-room mentality that an entertainment operation has. It's real simple at the end of the day to assess which components in a network --info or video or both - chain are critical, and buy two of them and keep it all lit and tested. Lives are at stake, and your signature is on the shift report? You rent a tertiary back-up system to bring online while you do your regular and frequent preventive maintenance on your primary and secondary.

    The guys who take care of Buffy do it. I would have thought that the guys who take care of sick babies and grandmothers would be playing in the same league.

  152. Doh! by jbmoll · · Score: 1

    Dag-blamed technology always be messin things up!

    --
    J Moll - PC Load Letter - I know what it means!-
  153. Its been coming for a log time by bolix · · Score: 5, Informative

    I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!

    AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.

    I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.

    The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!

    The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.

    I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.

  154. Problem was with bad Business Practices... by Alyeska · · Score: 2, Insightful
    Yes, the network failed. Good businesses -- including hospitals -- will allow for system failures through contingency planning.

    I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?

  155. Interconnections by Anonymous Coward · · Score: 0

    The more things change....

    Those confused or interested in a good grounding should be reminded of Radia Perlman and her wonderful seminal book "Interconnections" subtitled something like "The theory of bridges and routers". As the inventor of the spanning tree algorithm and currently a Sun employed networking guru in the Boston area, perhaps a savy CCIE would have consulted her on this and thus shortened the MTTR. Those reading only "quick start" guides to certification, rather than broader texts get what the deserve.

    PS: each chapter starts with a humorous quote to enliven serious topics

    1. Re:Interconnections by netwiz · · Score: 2

      To begin with, it's unlikely a CCIE would have required a consultation w/ the inventor of the protocol, as they'd already have a firm understanding of the inner workings of STP. And there is no "quick start" to a CCIE. That's why there's less than 10,000 of them in the world. And why, even in the depressed tech market, CCIEs are still follwed by headhunters bearing offers of $100K+/yr jobs...

  156. Re:Contribution to causality responsibility by timeOday · · Score: 5, Informative
    I agree, and let me refer you to a real life example. The USS Yorktown is that very famous Navy ship that was immobilized by a network outage. The whole thing was trigged by some seaman entering a 0 where he shouldn't have, so the Navy made some attempt to pin it on him. But it didn't fly. Operational errors like that are routine. It shouldn't have crashed the app. Having crashed the app, it shouldn't have taken down the whole network.

    If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.

  157. Sure doesn't look like hardware was the big issue by Anonymous Coward · · Score: 0

    When Cisco was called on for help, they didn't redirect their customer to a 900 number, they didn't shuffle them off to a service contract salesperson. They just rolled up their sleaves and solved the problem. It may have been Boston area Cisco engineers in the trenches but there were Cisco engineers in San Jose, RTP and probably elsewhere involved in this.

  158. Mission Critical Networks 101 by rhoads · · Score: 5, Interesting

    One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.

    We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.

    Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.

    These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.

    FUNNY NETWORK TRICKS

    I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.

    -- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.

    -- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.

    -- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.

    And the list of stories goes on. You get the point.

    1. Re:Mission Critical Networks 101 by Anonymous Coward · · Score: 0

      I'm constantly amazed at how crappy networking hardware is. The cases you presented just make it more clear to me.

      - Why in the hell would a router becoming completely clogged because of a flood of ARP messages? That doesn't make any sense. How hard would it be to make the router throttle itself?

      - Why in the hell would a router "run out of memory". Damn, I mean didn't they test those conditions before selling the router? Again, it should manage its memory efficiently and throttle when needed.

      I mean, is this old technology or something? Cheap hardware? What is the deal? I would think you could get decent hardware from somewhere? Why is this hardware so crappy? I mean, it's not rocket science, its just a hardware device for routing/switching or whatever, should be a very simple OS running it with a equally simple network stack (*).

      * I say "simple" in relative terms. Compared to say the Linux kernel these routers would be simple. Fixed known hardware, etc...

    2. Re:Mission Critical Networks 101 by Anonymous Coward · · Score: 0

      If you think it's so damn easy to make perfect software/hardware that never breaks down under any circumstances, may I suggest that you start a buisness and get rich by doing so?

      You know maybe 1% of the situation and yet you feel the need to spout off as if you are an expert on the situation.

      Welcome to /.! You'll fit in nicely here.

    3. Re:Mission Critical Networks 101 by Anonymous Coward · · Score: 0

      Where did I say I was an expert? In fact, that's why I posted because I'm wondering why it's so crappy. I'm not "spouting off" as you say, I was repeating the parent poster's issues.

      You are correct, I only know 1% of the situation. If you know so much, then help me out. Why is it so hard?

      If I knew that much I wouldn't have posted anything about it. I imagine you don't know much more either otherwise you wouldn't be so quick to flame.

    4. Re:Mission Critical Networks 101 by GiMP · · Score: 2

      > Why in the hell would a router "run out of
      > memory". Damn, I mean didn't they test those
      > conditions before selling the router? Again, it
      > should manage its memory efficiently and throttle
      > when needed.

      Yes, routers can run out of memory.. just like any other device. Your router should have enough memory to perform well for it's situation.. however, it is unavoidable that under an attack (intentional or non-intentional) your router can run out of memory...

    5. Re:Mission Critical Networks 101 by Anonymous Coward · · Score: 0

      Yes, routers can run out of memory.. just like any other device. Your router should have enough memory to perform well for it's situation.. however, it is unavoidable that under an attack (intentional or non-intentional) your router can run out of memory...

      I don't get this. A "simple" device like a router should never reboot or go down. Its goal should be to keep performing its duty, even if it can't do everything asked of it.

      If memory is filling up then the router needs to take action and operate at a reduced rate even if that means dropping packets (or whatever; there are other things that could be done too). Running completely out of memory and then rebooting would be stupid and shows very poor design. I know that it's just about impossible have a 100% perfect system, but in the case listed above the router appeared to have run of of memory then crashed and rooted.... bad design.

  159. no, identical networks crash in identical ways by Anonymous Coward · · Score: 2, Insightful

    Interesting how even an army of Cisco engineers couldn't fix the problem. Perhaps a testament to how overly(and needlessly) complex cisco's equipment is...and/or, how bad their certification/training is.

    As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.

    Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?

    One would crash, then the second would crash immediately.

    As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.

    I'd also guess improper change control procedures were involved here as well.

    Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.

    1. Re:no, identical networks crash in identical ways by Large+Green+Mallard · · Score: 2

      CCNA = Can Crap - Not Assist :)

  160. Why Not? by Anonymous Coward · · Score: 0

    "Do you think the answer to having a massive and unreliable network is to build a second identical network?"

    The answer to having a massive and unreliable operating system was to build a second, more reliable operating system named Linux. If we can do it with an OS, why not do the same with a network?

  161. Counterexamples by hey! · · Score: 3, Interesting

    As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.

    Here are some examples of the ways in which failures can occur that have implied linkages:

    (1) Both trains are damaged by an earthquake.

    (2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).

    (3) The firm has cut the maintenance budget and is neglecting routine maintenance.

    (4) The train is sabotaged by disgruntled employees or terrorists.

    (5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.

    (6) Design flaw means trains do not meet expected performance specifications.

    In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  162. WELL DUH!!!! by Anonymous Coward · · Score: 0

    Slashdot is all about personal agendas.

  163. Downtime Procedures by Kraegar · · Score: 5, Insightful
    Posting this kind of late, but it needs to be said.

    I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.

    I don't know how that hospital was able to pass inspections.

    1. Re:Downtime Procedures by Anonymous Coward · · Score: 0

      I'm betting this affected research and the lab and the like. Hospital staff (esp nurses) are competent folk who will bloody chase you down so that they know what to do in an emer. I'm sure the staff moved smoothly, and they all knew what to do. I'm also pretty sure that JCHAO or jco or however you spell it requires bi-annual review of downtime procedures.

  164. Backup Network sounds right... by Anonymous Coward · · Score: 0

    Simple solution: get another network to back up the production one...it does not have to be as fast since this is going to be used temporarily until the full production network is up again

    Extra solution: make at least a couple of environments for work...development and production

    Extra extra solution: back up production with a fail over environment (ie. when production falls this fail over will pick up where production left off)

    Ack...just my 2 cents...

  165. Re: Thick Coax links by Ashurbanipal · · Score: 2, Informative

    Etherhose (10b5 thick coax) is a useable networking technology. It has very good resistance to RFI/EMF. Lots of hospitals still run it, on links where 10 Mb/sec is sufficient.

    Etherhose is no longer a good investment because it is labor-intensive to work with (vampire taps, and thick, heavy cabling) and because nobody is developing the technology any more.

    Today, fiber optics might seem a better choice for noise isolation, since the cost has come down to a reasonable level.

    However, glass has the same potential for future obsolescence as etherhose - I have a half-dozen mutually incompatible fiber links here. And termination, splicing, and interconnection of fiber is at least as difficult as working with etherhose... having done both, I'd say drilling for a vampire tap is easier.

    In short, don't replace a working piece of infrastructure needlessly (wait until you project a need for additional bandwidth) and for noise isolation cat 5e in a grounded metal conduit is probably your best bet. Large diameter, professional quality conduit runs through electrically noisy areas are costly but also a very safe investment.

    I wouldn't knock that old etherhose - it does its job quite well, far better than the 10b2 thin coax that replaced it ever did. And it's far more physically sturdy than anything else outside of conduit.

  166. Dr. Network? by jcknox · · Score: 1

    From the Globe articale:

    "It was Dr. John Halamka, the former emergency-room physician who runs Beth Israel Deaconess Medical Center's gigantic computer network."

    So the network admin is a former ER doctor? Since when did:

    1. Network admins make more that ER doctors?
    2. Med schools teach Cisco?

    Sounds like a case of acute administratia to me.

  167. Obviously Not by RAMMS+EIN · · Score: 1

    ``do you think the answer to having a massive and unreliable network is to build a second identical network?''
    Obviously having two, or any number, of unreliable networks doesn't build one reliable one. If some user can take down the first net, he can also take down the second or nth net. If this user has bad intentions, he likely will. If the taking down was due to a program running wild (apparently the case here), it might happen again. More backup does increase reliability here, but never makes it really reliable. What does make a network reliable? Nothing does. They can always be trashed. I think they know that in Twente. At best, a network can be reasonably reliable, and what makes a network reasonably reliable depends on what is reasonable.

    --
    Please correct me if I got my facts wrong.
  168. Oh come on by Flamesplash · · Score: 2

    I was hoping for at least a funny. :)

    --
    "Not knowing when the dawn will come, I open every door." - Emily Dickinson
  169. offtopic? riiight by uberred · · Score: 1

    oh, forget it.

    --
    Time is an illusion, lunchtime doubly so. --Ford Prefect
  170. stp, failure, applications... by zerodvyd · · Score: 1

    all sensationalization of every multi-site net admin's worst nightmares aside:

    fundamental problems exist with the picture this article painted.

    1) a researcher's "data" brought down the network ?

    first off, critical hospital functions should be separate. Their own VLAN at a minimum. This is stability we're talking about here.
    Second, when told that his data crunching was hurting performance, he should have done what he could to stop the application gracefully, not just "pull the plug"

    2) network design.
    This is the VLAN issue. a properly designed multi-campus network has separate networks for separate functions. if they were one big flat network, then yea, him pulling the plug would cause all sorts of hell...as each and every switch flooded layer 2 broadcast frames out every port trying to find his station. layer 2 broadcasts (broadcasts in general) are _BAD_.

    3) ER physician turned network addict.

    I'm not going to bag on anyone, really. However, the article fails to mention his network administration qualifications. How many years experience does he have configuring network gear? did he do this in his spare time? Seriously, seems to me that they need a bona fide Network Engineer or two.

    To answer the question: backup network?
    many NOCs have redundant networks. Some companies do the same for mission critical network gear. VLANs should be sufficient if admin'd correctly.

    As far as Spanning Tree Protocol 'failing'. I've not heard of such, please point me to concrete examples! As far as Cisco's implementation being 'boogered,' I don't think so. It works the way it is supposed to, yes their switches offer the option of turning off part of spanning-tree for end nodes (spanning-tree port-fast), but properly used this doesn't present a problem (see above, design/qualifications).

    A previous poster noted that stability should be of paramount importance in a hospital. They are absolutely correct. However, stability does come at a price, with HIPAA looming over all things EMR (Electronic Medical Records) you have to keep on your toes. Stable may not mean secure, and since that is one of HIPAA's stipulations, you have to go with secure (or relatively so).

    That being said: a Layer-3 Switched network should be more than adequate for a multi-campus network with segregated NOC. a fully redundant near-line or off-line network may be overkill, but not all together unnecessary. With a heavy-iron Cisco Catalyst driving the network at the core and Catalyst 3500 series switches at IDFs this should prove to be a very managable and strong network.

  171. Data storms... by SwedishChef · · Score: 2

    This outage was caused by a researcher's data creating a storm of data which outpaced the network's ability to cope. The problem was allowing the research data to flow unimpeded across vital systems. The solution is to implement methods of controlling bandwidth, not just routing.

    In order to prevent this from happening again, engineers should analyze the system to determine where to put data storage. In this case, almost certainly (although the article is unclear) data was stored in a central location but spanned across several servers and then backed up in another location. One part of the solution is to have distributed data storage spread across the institution and then that data backed up (across a separate network) to a central location.

    The data storm itself could be prevented by using QoS bandwidth management. Of course, every network user believes that he/she should have unfettered access to all the bandwidth available, but quietly implementing some well-known techniques for limiting bandwidth usage would have at least mitigated the damage.

    Finally, routing protocols other than spanning-tree or OSPF should be used. Creative implementation of internal addressing schemes (10.0.0.0 IP addresses) and a combination of BGP and last-resort static routes would certainly help to avoid these sorts of problems. I'm also wondering whether a *nix box running Zebra in critical locations might not reduce the problems. Certainly Zebra can remove the routing load from the Ciscos and, with plenty of RAM and processing speed available on PCs nowadays, could probably improve routing efficiency when a circuit goes down.

    But the key to this problem is bandwidth management not routing management. Of course, the next problem could be routing. One seldom has the budget to solve everything.

    --
    No one ever had to evacuate a city because the solar panels broke!
  172. Cisco STP implementation may have a bug by MECC · · Score: 1

    We have had similar problems with networks 'going down'. We have many vlans, so just one vlan went down, but the it seemed to be a problem with how Cisco does STP for vlans on their newer equipment. Each vlan gets its own spanning tree, but the root identifiers are all the same, and the ethernet addresses for the vlans on our central switch are all the same. Older Cisco equipment had a different MAC address for each vlan. Thus, the root bridge identifiers were all unique, and when two vlans got bridged, loops didn't happen. Now, however, if two vlans get bridged (a computer with a wire in one vlan, and a wireless card in another vlan), the forwarding tables on the switches get confused because there are multiple paths to the same stp root.
    This is really confusing to work through, but it really does look like cisco isn't implementing vlans the right way. We can't turn off stp on our whole network, so we turned on bpduguard on as mant switch ports as possible. That way, if someone starts bridging, the port gets shut off as soon as a switch sees a bpdu packet. The down side is that nobody can plug in a hub or switch to our network.

    Its worth noting that our problem arose when we installed a new central switch, and ran it redundantly. The new switch confused stp root identifiers wherever a bridge occured.

    We have many wireless laptops on our campus, and someone plugged a wireless laptop into a wired connection, which had a differen vlan, and turned on windows network sharing, which started bridging the to interfaces.

    --
    "We are all geniuses when we dream"
    - E.M. Cioran
    1. Re:Cisco STP implementation may have a bug by netwiz · · Score: 2

      Now, however, if two vlans get bridged (a computer with a wire in one vlan, and a wireless card in another vlan), the forwarding tables on the switches get confused because there are multiple paths to the same stp root.

      Excuse me? Since when do end hosts forward BPDUs? Since when do end hosts forward _anything_, for that matter?

      Unless you're going the el cheapo route, there's no reason that individual computers should be forwarding traffic. Okay, I'm sure some of you could show me valid scenarios, but I'll bet that none of them are realistic production environments (unless management has been incredibly stupid).

    2. Re:Cisco STP implementation may have a bug by MECC · · Score: 1

      End hosts will forward BPDUs if they turn on bridging between two interfaces, which Windows and virtually every Unix can do. We actually plugged a wireless Dell running XP into a hub and sniffed it. When windows network sharing was turned on, out came BPDUs.

      I realize it sounds messed up, and the reason you don't see it very often is that its not easy to set up, except in windows xp, which will do it for you if you turn network sharing. I think it does this because XP supports unroutable protocols.

      Its probably not common to need to bridge two interfaces, but we have a 'wireless laptop' initiative, and many professors have wireless laptops that they roam around with, and then bring back to their offices. For whatever reason, they dutifully plug into their wall jack, and sometimes turn on 'network sharing', which actually does bridge two interface, and sends and listens to BPDUs, just like any ethernet bridge should do.

      Windows network sharing gets even more interesting. Apparently, windows also advertises itself as a gateway to the Internet for other windows systems, which start using the bridging computer as a gateway as soon as they see it, without telling you. This has the effect of slowing up the subscribing computers, because they are now going through another computer to get to the network. Remember that a wireless access point is just a hub, so there's really no stopping this behaviour.

      I think I know why Windows XP does this. I think msoft wants people at home to be able to have one computer conected to the internet, and seamlessly provide wireless access to all the other computers in the home. So, when you click one button, XP bridges the wireless card and whatever other interface is available, like a dsl connection. It then also advertises itself as a gateway to the internet, and other windows computers in your home will start using it as a gateway, automatically, without asking you. This makes sense if you want every windows system in the place to go through one computer automatically. In wireless laptop envronment, it has undesireable results.

      --
      "We are all geniuses when we dream"
      - E.M. Cioran
  173. Reduced to? by tomdarch · · Score: 2
    Senior executives were reduced to errand runners.

    What do you mean 'reduced to'? What else are they good for?

  174. I hope they "Fixed the Glitch" by Rascalson · · Score: 1
    One of "The Bobs" : It seems this employee(looks down at list), Milton? Yeah it seems there was a glitch in the payroll system. He was let go during the last cutback and it appears none ever told him, but because of the glitch he continued to recieve a paycheck.

    Lumberg: So you fired him?

    The other "Bob": No, we "Fixed the Glitch"

    --
    prisoner# msce18xxxxx. Currently planning my escape.
  175. I can top that! by Ashurbanipal · · Score: 5, Funny

    There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.

    The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!

    I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".

    When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.

    You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...

    1. Re:I can top that! by rangek · · Score: 1

      It is amazing that this is so common. We had a similar situation in our lab here. Our espresso machine had tripped a circuit breaker in our lab. No biggie, but we have to call a union guy to flip it back for us. So he comes down and doesn't even know where the box is. I tell him and he goes off. A minute later, all of our computers go off. Apparently he wasn't sure what was what and just flipped every breaker in the box to make sure...

  176. Use The Hammer With The Right Shape by SEWilco · · Score: 1
    If Linux is the solution, a Linux-powered network box is better suited to a wiring closet than a big hot PC is. In most situations, whether a router has Linux inside doesn't matter. But there may be situations where the Linux network kernel design is better or worse for the task at hand. Although if the network staff knows best Cisco and Linux, having Linux on the secondary network could mean that more staff already know two networking technologies and configuration methods.

    Their network staff should be looking at all solutions. They know better than we do what their bandwidth and connectivity problems are. I only hope they don't make the same mistakes on both networks.

  177. MOD PARENT UP, INFORMATIVE by Anonymous Coward · · Score: 0

    Pleeeease.

  178. My backup network... by 100MHzperhour · · Score: 0

    If my network fails, I can always rely on my vast web of fishing string with paper cups attached to the ends for reliable, secure data transmissions to any hospital room via voice communication. Actually I have found the hospital urine cups to transmit at 100mbps rather than the 10mbps the paper cups get.

  179. dependency vs. use by Anonymous Coward · · Score: 0
    this is in no way a new (TM) problem and the sad part is the people are NOT LEARNING. There seems to be a sad epidemic in which people go from using technology to aid and assist and instead depend upon it like an addict (or worse). I have been in supermarkets that had their main computer records die and then laughed as the staff vainly attempted to salvage the process using a paper method that was obvious did not sync well with the information, much less the processes they used. The wise thing would be to learn from this and create a backup method. NO, not just backing the data up (it goes well beyond that) but in fact creating a "Manual" process that will integrate with the more automated, electronic only process causing a minimal of downtime. Sure it is foolish to expect that the backup policy would run as smoothly as the tried and true (and accepted) electronic method but if an obvious pattern and flow exists between the two then things will go a lot smoother for ops and the eventual data entry. It would also be wise to setup a "fail safe" method that relies on a periodic snapshot and is completely independent of whatever primary data storage, organization, or retrieval methods in place and then apply a similar method towards the actual network.

    What this means is that if say some error or catastrophy strikes and destroys the data access (the actual data, the database the interface or logic, etc) then you can have a backup copy that is read only and statically "linked" for lack of a better term. A standalone printer can then start chugging away the pertinent records (in the event of a total network takedown) or even better the various departments and nursing stations (for this instance) could be uploaded with the pertinent areas from the snapshot, assuming even the electronic devices are working. If they are not, then you better walk across the street and start printing up those records paying attention to priority based on condition (patient), scheduled events (operations, lab work due, etc), and most importantly the "flash override" of requested records. Since not every detail of the record is usually needed, merely the more relevant areas pertaining to the matter at hand (e.g. history of lung scans for the oncologist) then only those can be printed out and delivered.

    Hand held devices utilizing a backup wireless network could be a method of shooting this data to various departments. If however there is a total disaster, like an EMP attack or more physical attack disabling the entire system, then really you are more screwed by that than anything else so backup is a luxury at that time. In that case you better go by the existing printed reports (you did make hard copies of current procedures and the latest status, didn't you?) and then interview the patient and family... thats what EMS has done, so you can too.

    Fix the PROBLEM, not just one particular manifestation or symptom from it.

    On a related but different note, ever wonder why the suited monkeys at many large government agencies and companies fail to understand the true meaning of redundant processes, data and systems? Thats their job to think about things like that yet they are too busy being used car salesmen and patting their golfing buddies on the back. In war we call them Frag-bait... DO YOUR JOB!

  180. Where was the non-computer backup solution? by RevDobbs · · Score: 1

    Sure, install a second network... but what about power faliures? Or if both networks go down?

    The company I work for is too small for redundant networks & servers, but I make sure that the there is a manual fall over: fax server doesn't pick up? Fax machines will. db server down? All the forms you need to do your job are available near your desk, and there are tons of extra in on-site storage.

  181. How was this allowed? by Anonymous Coward · · Score: 0

    I'm *very* surprised patient data was allowed ride on a network that had multiple single points of failure.

    Hopefully the network engineers no longer work there, or are being properly trained on how to do their job. What would be scary is if they were properly trained already and didn't have the funding to do the proper maintance.

    There should be an appropriate amount fear for just this type of failure that enough redundant infrastructure is available for critical data to ride on.

  182. Oh, is it Bitchin'-About-Cisco day? by Threed · · Score: 1

    Three different versions of Cisco IOS in three different locations. All talking (trying to, anyway) to a Linux box running FreeS/WAN. In order from OLDEST to NEWEST IOS release: One VPN works fine. One works if you keep something pinging across it. The third doesn't work at all.

    Someone suggests using a Cisco as the "hub", instead of the Linux box. Now NONE of it works. Fancy that...

  183. Re:Of course it can help (not a smartass reply) by proverbialcow · · Score: 1

    It's called a dead man's switch. Just have a simple ping going out every couple of seconds across the network from vital nodes, and if ten of them fail in a row, or a hundred, or whatever, then you know someone needs to take a look at it.
    Hell, have it go out every couple of microseconds. That's nothing compared to the volume of traffic a network of this size must be expected to handle.

    --
    The only surefire protection against Microsoft infections is abstinence. - The Onion
  184. Applications causing the outage by tgrossner · · Score: 1

    I have personally seen applications that used so much bandwidth accessing data accross the network that a completely stable network was reduced to a non-functioning state. Case in point is one of my customers (I work for SBC Datacomm) who built the WAN/LAN they are now running on approx 3 years ago, with the knowledge that an outside software company would write a web-based application to run accross it. The app uses DB2 data from the mainframe and then uses XML to present the data and edit it by the end user. Long story short the application was WAY bloated and was found (by me when the network was reduced to jelly) to pull about 4 megabits per second when it was processing data....with something like 2000 users all accessing it at the same time you can do the math and see that is a recipe for disaster. Tim Grossner Field Engineer, SBC Datacomm

  185. Not true. by Anonymous Coward · · Score: 0

    I don't buy this BS about scrambling for paper forms not used in 6 years. Having worked for a world reknown hospital ( name with held ) I know that for a fact that each hospital must have in place a manual paper system in the event of a computer failure. These processes are most dreaded as they result in errors and a create deal of lost revenue, but are required for certification.

    Of course, if the hospital isn't certified in a state or requires only the 'B.J. Clinton' certification of finger pointing...

  186. Offtopic by InadequateCamel · · Score: 2, Informative

    I read in a book about the number zero that I mentioned here before that the real cause was someone accidentally left a zero in a line of code, rather than a person pressing zero and crashing the entire network. Perhaps someone tried to execute a command that led to this faulty code being used by the ship's computers?
    Maybe this was proven to be false later, I dunno.
    Kind of funny though...

  187. To make an analogy to another redundant system. by Inoshiro · · Score: 2

    Yes, there is always the possibility you might be born blind, but most people don't have that genetig defect. They have two eyes which work very well, even if one of them happens to be broken by a random toothpick accident.

    Redundancy is always good in a system where uptime is king. That is why so much of nature has organisms based around semi-redundant designs.

    --
    --
    Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
    1. Re:To make an analogy to another redundant system. by cellocgw · · Score: 1

      Well, not eyes (redundancy in nature). Animals developed two eyes - or eye clusters, in the case of insects -- to allow stereovision and thus depth perception.
      Dual kidneys may be for backup and may be happenstance. There's one liver, one gall bladder, one pancrease, 5 lungs (more or less), and so on.

      --
      https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
    2. Re:To make an analogy to another redundant system. by protonman · · Score: 1

      Animals developed two eyes - or eye clusters, in the case of insects -- to allow stereovision and thus depth perception.

      You're wrong. Just because 2 eyes are used for stereovision, doesn't mean they were "developed" only for stereovision.

      It's evolution baby, things don't get developed for a purpose, things appear and happen to have one or more purposes, and the useful/better (in the evolutionary sense) things stay...

      --
      The man of knowledge must be able not only to love his enemies but also to hate his friends.
    3. Re:To make an analogy to another redundant system. by cellocgw · · Score: 1

      Well, I wrote an excessively compact comment. Yes, eyes may have followed other development paths, and yes, two eyes turned out to be necessary and sufficient.
      How evolution "works" is a large an interesting topic which probably should be covered in some other thread.

      --
      https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
  188. Re:Reliability is inverse to the number of compone by SEWilco · · Score: 1

    No. His notation is confusing, but is math is correct.

  189. Re:Reliability is inverse to the number of compone by Xugumad · · Score: 1

    I got the impression that the secondary network would be inactive, unless the primary failed. Therefore an event that brought the first down, would not affect the second.

    Unless of course, whatever broke the first, took the second down when it came online...

    On a similar note, who wants to bet they'd put both networks on the same power source?

  190. A Case History by Baldrson · · Score: 3, Interesting
    A major corporation wanted to go paperless. They had all sorts of IDEF graphs and stuff like that to go with. I was frightened for them and suggested that maybe a better route was to start by just going along the paper trails and, instead of transporting paper, transport physical digital media -- sneaker-net -- to workstations where digital images of the mail could be browsed. Then after they got that down they could put into place an ISDN network to the phone company which would allow them to go from sneaker-net to a network maintained by TPC. If TPC's ISDN support fell apart they could fall back to sneaker-net with physical digital media. Only after they had such a fail-safe "network" in place -- and deliberately fell back on it periodically and randomly to make it robust -- would the IDEF graphs start being generated from the actual flow of images/documents. By then of course there would be a general attitude toward networks and computers that is quite different from that of the culture that typically surrounds going paperless.

    Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.

  191. In my opinion... by freebase · · Score: 2, Interesting

    First, I don't have all the details of what happened, nor do I have any idea of what the network looked like prior to the outage. However, I have a general design philosphy based on my experience with teaching hospitals and telco networks.

    The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.

    All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.

    Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.

    --
    Sig??? I don't need no stinkin Sig!
  192. Why not fix spanning tree? by m1a1 · · Score: 3, Insightful

    If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.

  193. Identical? by AB3A · · Score: 1
    Let's not confuse identical with redundancy or diversity.

    Identical networks may or may not offer any backup depending on how they're managed. If there is a strict policy regarding how each network is tested and debugged before changes are implemented on the other, then it might help. Otherwise, in a large network such as this, you'll perpetually be scratching your head as to why the two are different.

    Redundancy merely implies redundant functionality, a backup link. This helps only if the backup links use different infrastructure to get from one place to another. But, again, if the two networks are bridging traffic with spanning trees then I still don't see how this helps the situation much.

    Diversity is the solution. Use seperately powered routers, different links, even seperate wiring closets. It's not cheap. It's not easy to manage. But it will provide a connection with far more reliability than the others.

    This Hospital network seems to me to be something that just plain grew without much planning. Somehow, it became the greate big switched network of everything. This works until someone makes a short circuit link from one node to another and then the spanning tree falls in to its belly-button.

    I've seen the people with all the right certifications dive right in to that recursive problem and run themselves in to testing circles. The problem is that we don't teach diagnostic thinking in schools or in training classes. I'm not even sure that we can. Problems like this demand a scientific method approach (as outlined so nicely in Robert Pirsig's book "Zen and the Art of Motorcycle Maintenance"). It's slow. It's tedious. And in really tough problems such as this, it's the only method left that will repair the situation. I know of very few people who know how to do diagnostic thinking this way.

    It's sort of like the difference between hacking a bunch of code together, doing limited testing and then saying "It Works" --or thinking of a concept, carefully planning the code around it, planning all the testing of each segment of the code, demonstrating that the final assembly of the software works, and then tentatively calling the product "Functional."

    I feel sorry for the hospital staff who had to endure this. I hope their misfortune serves as an object lesson to pointy haired bosses about giant switched networks where everybody can see everything. But somehow, I'm almost certain the object lesson from this will be lost on them as they blame a black box rather than the people maintaining it.

    --
    Nearly fifty percent of all graduates come from the bottom half of the class!
  194. too true by oliverthered · · Score: 1

    Remember the phone network outage from maybe 10years ago.
    A fault in the initial system caused the network to go down, and the backup was switched on.

    Unfortunatly the backup had exactly the same fault, the software had to be corrected before the network could be brought back online.

    --
    thank God the internet isn't a human right.
    1. Re:too true by Anonymous Coward · · Score: 0

      >Unfortunatly the backup had exactly the same fault, the software had to be corrected before the network could be brought back online.

      Close, but the actual error was in both the code for the "repair" routine, and the logic of the repair routine in and of itself.

      The phone switches were designed to reboot if an error ocurred, and before reboot tell all the other switches to handle the present calls. So, one day some sort of error ocurrs for no good reason (who knows, maybe an alpha particle hit a memory cell?) and a switch offloads its calls and reboots.

      The problem is, the other switches get flooded with calls and hit the subroutine that tells them to pass the calls that are too many to handle to another switch. Inside that code there's a switch with a missing break, and they reboot after erroring out. So the calls get passed on again, rebooting more switches, and so on, and so on.

      The entire problem could have been solved quicker than the (many) hours it took by simply cutting off all of the phone service on the eastern seaboard for a few minutes while the switches settled down.

      The problem with the logic was making the other switches handle all the calls, and not just the priority (911, fire, police, hospital) calls. Non-priority calls should have been dropped flat on the floor.

      Oh well... :-/

  195. Re:Reliability is inverse to the number of compone by SEWilco · · Score: 1
    The chance of one disk failing at once is unlikely, unless you are using disks by the same manufacturer which might have been manufactured on the same day by the same machines which made identical mistakes in all the drives so all the drives have a large chance of failing at the same time.

    You need a RAID controller which can handle slightly different drives, and have at least one different drive in each row. Even better if you're using a configuration where two drives by different manufacturers have whole copies of the data, so failure of two drives is not fatal.

  196. Re:Reliability is inverse to the number of compone by ceejayoz · · Score: 2

    Someone failed their vision test...

    See that percent sign? The little "%" thingy?

  197. Go Wireless, Use copper for Backup by randomErr · · Score: 2

    Go Wireless, Use copper for Backup

    I'm not talking 802.11, but miltary grade Spread Spectrum. It would cost a lot less then laying new copper. And if some a$@hole inadvertantly starts a DOS attack you could just flip off the main antena array at your NOC for 10 minutes and let the network reset itself. Also throttle your nodes to say 10 mbit. That way one node can't take down your entire network.

    If a storm or other activity takes out the antena array you still have the old copper. Keep a switch(physical switch, not hub like switch) so that you could walk over to a pannel a switch your node over to copper in a jiff. If they both fail then go carrier pigeon, CB's, or cellphones. Nothing like a good old analog message in a pinch.

    --
    You say things that offend me and I can deal with it. Can you?
  198. Physical Redundancy - Hell Yeah by pipsqueak · · Score: 1

    Yes... if you need as close to 100% redundancy as possible, the only answer is complete physical resiliency.

    Start thinking about the OSI model and it's relevance to this. Yes, you were taught about it for a reason. You can create resiliency at a higher level (e.g. IP) but if you're relying on a single physical or datalink structure the network will always be prone to failure from physical or datalink issues.

    I work for an ISP. We would never give an SLA over 99% (e.g. 99.999) unless physical redundancy was included.

    The only problem is, it's so expensive it's hard to convince anyone the extra 0.999% is worth it... until they experience what happens when you ignore it.

  199. Multiple Problems and Multiple Solutions by SuicidalSquirrel · · Score: 2, Insightful

    First of all, this was apparently a flat layer-2 network. From the information I have seen, it was a very large network. Spanning tree is a wonderful protocol and layer-2 networks are not bad things, BUT spanning tree is very complex in a large network, and latency is going to be an issue if there are no routed boundaries to control traffic. I have experience in designing networks for hospitals (and financial institutions and universities and gov't institutions), so I am aware that implementing layer-3 to the edge is not necessarily feasible for many reasons - financial, legacy setups, etc. That being siad, however, there should be some layer-3 at some point to segregate traffic and protect the critical pieces of the network. Identify the critical points of the networks and put redundancy there - i.e. the server farm, critical care monitoring systems, WAN connection. All network equipment vendors have some type of redundancy feature that would take care of automatic failover for these devices.

    Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.

    The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.

    Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.

    If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.

    The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.

    Just my $0.02, and I'm just that blond chick, so what do I know anyway...

    --
    So what are you going to do? Bleed on me?
  200. Re:Reliability is inverse to the number of compone by ceejayoz · · Score: 2

    0 train fails = 0.9 * 0.9 = 0.81
    1 train fails = 2 * 0.1 * 0.9 = 0.18
    2 train fails = 0.1 * 0.1 = 0.01

    which means that the probability of having at least one train going from NY -> LA is ... 98%, much better than the previous 90%.


    Erm... to quote you, "I think you made some mistakes."

    100% - 1% = 99%.
    81% + 18% = 99%.

    How'd you get 98% out of those numbers?

  201. perfect time to... by Anonymous Coward · · Score: 0

    ...upgrade to gigabit.

    I suggest Foundry equipment.

    Really... I mean who need propritary layer 2 and 3 spanning tree/routing protocols? Anyone caught out using them deserves the pain they suffer.

  202. Probability versus Reality by SEWilco · · Score: 1
    You're playing with probabilities of failure of trains.

    Don't forget that in the real world some train failures cause derailments which destroy the track which goes in the other direction.

    Defensive design requires considering both probabilities and physical reality. Lightning is less likely to damage fiber than copper, but copper might be better in a very hot environment (not that I'd like to run the network of a steel mill). The chance of two identical Cisco networks failing is small, unless the failure involves behavior of Cisco equipment which even Cisco engineers can't change.

  203. Interesting response by jhines · · Score: 3, Insightful

    That this happened in a teaching hospital, rather than a large corporation, makes their response much different.

    They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.

  204. Not a parallel network, but a parallel process by Anonymous Coward · · Score: 0

    I'd put the focus into refining the paper system.
    It's the simplest form of communication, and
    The most flexible when responding to a crisis.

    If al-q, for instance, takes out your power source,
    your 3 meg parallel system is pointless... just how
    long are you going to run on battery backup?

  205. Hmm... by cyt0plas · · Score: 1

    I wonder who developed their systems. Can we get them to work on palladium?

    I suppose that if the problem was in a microsoft application, they already are ;)

    --
    Contact Me (got tired of viruses emailing me).
  206. Absolutely a redundant network by Fastolfe · · Score: 2

    I don't really understand all of the comments saying a redundant network infrastructure is bad/stupid/etc.

    If your network is critical to your business, you should absolutely consider backing up every bit of that network with one (or more?) redundant components. This means every router should have a redundant pair, every physical network link should be redundant (including how it's routed through the building), every firewall, switch, etc. If you have mission-critical servers, they should have two NIC cards. Upgrades should never occur on both "sides" of the infrastructure at the same time, and both sides should be capable of running alone.

    Not only does this type of configuration resist failures, but upgrades or configuration changes to the A or B side should never impact the other side, and if it does, you should be able to shut down the offending sections without impacting availability.

    If your network staff doesn't understand these concepts, you desperately need to train them better. If the expense cannot be justified by management, then that's a business decision and when failures like this occur, they should not be surprised.

    1. Re:Absolutely a redundant network by Anonymous Coward · · Score: 0

      as I understand it, the second network is being recommended as a SOLUTION to the problem. Adding redundancy AFTER the disaster is pointless, and reeks of cheerleading on the part of their Cisco engineers.

    2. Re:Absolutely a redundant network by Fastolfe · · Score: 1

      From what I've read, the second network was already in the planning stages.

  207. Two is better than one? by mr_z_beeblebrox · · Score: 2

    Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

    Since Michael asked it like that I will leave behind my network engineer role (professional) and pick up my role as armchair mathmatician.
    The item too be doubled is a network. Unreliability and massiveness are qualities of that network. So, using the distributive property of multiplication this would give us the equivalence of one network that is twice as large and twice as unreliable as the original.

  208. The root cause by Anonymous Coward · · Score: 0
    The root cause in a nutshell: "They believed that they lived in a perfect world where their network couldn't crash." In reality ANY production system or process that is depended upon MUST have a desaster recovery plan in place BEFORE the desaster happens and it should be considered a "living document" that is dusted off, updated and tested on a routine basis.

    This recovery plan should include all the P's - People, Process, Place. If the plan doesnt account for all three it won't work. Why build a plan to keep the network up if the building no longer exists or the workstations dont have power to operate? Why build a redundant process if a flu bug can take out the people? Why have a plan in place if no one is trained to implement it? Why why why?

    Tech's fall into the trap of looking at those bits and bytes, while failing to take into account the entire BUSINESS process.

  209. RE: "...do you think the answer ..." by Coreigh · · Score: 0
    Yes.
    But only if you make sure that both networks are connected as mirrors to each other with a single non-redundant router so one network can bring the other one down.


    date; gunzip; strip -v; touch -c; finger; mount -s; fsck -V; more; yes ;umount -r ;curl -connect-timeout 600 ;sleep

    --



    "Waitress I need two more boat-drinks..."
  210. I lived in Boston until 1999 by Newer+Guy · · Score: 2

    I lived in Boston until 1999 and had my (ruptured) appendix removed at that hospital. That place is absolutely HUGE, many city blocks in size. It's network must be huge too and that's the problem. A LAN that size HAS to be sub-netted into smaller segments! Now, I'm not a whiz bang Network engineer, but I do know when something's done WRONG, and it sure seems like this is the case here. Building a parallel WRONG network won't solve the problem, it'll DOUBLE the problem! There are many gifted people here....why not come up with a solution for them here? Consider it a public service to a very public oriented hospital.

  211. Network metastasis as an art form by Anonymous Coward · · Score: 0

    Healthcare networks (at least the ones I've built) require extreme amounts of failover and a high tolerance for error. As an example a 'small' hospital radiology department ( 200 studies/hour) has recently gone all digital removing all save one film processing unit, if proper controls are not in existance and a single point of failure exists in the department then an entire hospital could be without diagnostic imaging. Hence it is essential to develop not one failsafe but three (four counting a reversion to manual procedures w/ triage for critical situations). From telemetry being broadcast from a patients room to a central nursing station to LIS (laboratory information systems) moving data to HIS (hospital information systems) failover and failure planning is key. Build a parallel network build five it matters not if you have not done critical assessments and failure planning.

  212. Wrong type of network infrastructure by induhvidual · · Score: 1

    I have been through this dance before (I design large mission-critical computer systems for a living). The words "spanning tree" caught my immediate attention, since I have faced similar issues while trying to build Ethernet networks into an approximation of a mesh topology. It can be done, but it tends to be fragile, and it is REALLY easy to introduce loops if you are not careful. The solution: ATM. (I know - insert derisive laughter here) ATM was designed for mesh topologies, and incorporates a least-cost-routing algorithm to help traffic negotiate the multiple paths between network nodes efficiently. It is a great solution to form the core section of a campus backbone, with edge devices to translate between Ethernet and ATM for traffic to and from the network clients. It will never happen though. ATM is not even on people's radar screens, much less actively considered for deployment. I have had no luck suggesting it as a solution in my network designs either. *SIGH*

    1. Re:Wrong type of network infrastructure by chris1howell · · Score: 0, Redundant

      I agree.... It seems everybody has fallen under the spell of Ethernet. There ARE other networking technologies out there which have not been "patched" over the years to make viable today. Ethernet was never designed to be redundant, spanning tree is merely a band-aid. As is almost every technology available for Ethernet. Traffic management could have saved this network, Cisco's attempt at Quality of Service, really Class of service, may have made a difference. To build two redundant Ethernet networks is ridiculous. If you are going to spend the money, do it right use a technology which was designed for the very, very large networks. Build a carrier class network. Use a technology like ATM build a redundant mesh. ATM was designed from the ground up to allow for redundancy and Quality of Service, true Quality of Service. Redundant links will NOT be disabled, they will be used in a load sharing manner increasing backbone availability and capacity. The problems are inherent with Ethernet. A enterprise network of this scale should not be built with a cookie cutter. Ethernet is great for a home network and small enterprise. But very large networks should look for alternative technologies.

  213. Re:What is spanning tree protocol? (google whoring by jerde · · Score: 2, Interesting

    Well, mostly transparent to end stations.

    Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.

    Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.

    Apple's DHCP implementation was bitten by this on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.

    I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.

    - Peter

    --
    INsigNIFICANT
  214. We Don't Have The Details by SEWilco · · Score: 1
    The odd thing is mention of what the researcher doing being "compute-intensive". That should slow down its use of the network, if the processor has work to do.

    • Perhaps the researcher was actually
    • monitoring the network so he had to look at the live network -- but perhaps something about the monitoring affected the network.
    • Maybe he was running all network traffic through a device which couldn't quite keep up with the data rate (but then network staff would have pulled it out).
    • Maybe he forced the routers to feed copies of all data to his equipment -- but the link to his equipment wasn't fast enough.
    • Maybe his monitor link worked fine, Cisco staff knew it worked fine so ignored it as a source of problems, but some quirk (writing to disk while a large data burst appeared?) happened every six hours which caused an unexpected problem in the network equipment.
    • Maybe he was gathering data from a maintenance program inside all PCs, so he was actually slowing down all networked PCs (or crashing some of them).
    • Maybe he's the president's son and the president had ordered that the research must be done.
  215. A step back by Anonymous Coward · · Score: 0

    Reading this thread perfectly illustrates the largest hurdle to clear when troubleshooting any major network issue. I have no way of knowing how many people were engaged in the resolution of this issue but in my past experience with similar situations things like this there are always way more hands reaching for the cookie jar than the jar can handle. Imagine trying to get everyone that's posted here to agree on a singular next step. Difficult at best and we haven't even talked to management yet!

  216. Ethernet and Spanning-tree by chris1howell · · Score: 1

    It seems everybody has fallen under the spell of Ethernet. There ARE other networking technologies out there which have not been "patched" over the years to make viable today. Ethernet was never designed to be redundant, spanning tree is merely a band-aid. As is almost every technology available for Ethernet. Traffic management could have saved this network, Cisco's attempt at Quality of Service, really Class of service, may have made a difference. To build two redundant Ethernet networks is ridiculous. If you are going to spend the money, do it right use a technology which was designed for the very, very large networks. Build a carrier class network. Use a technology like ATM build a redundant mesh. ATM was designed from the ground up to allow for redundancy and Quality of Service, true Quality of Service. Redundant links will NOT be disabled, they will be used in a load sharing manner increasing backbone availability and capacity. The problems are inherent with Ethernet. A enterprise network of this scale should not be built with a cookie cutter. Ethernet is great for a home network and small enterprise. But very large networks should look for alternative technologies.

  217. Redundancy redundancy redundancy by visionsofmcskill · · Score: 1

    while i agree that the root of the problem should be fixed... lesson number one in netowkr management is BACKUP EVERYTHING WITH A DUPLICATE..

    While this may be a patch over problems way of handling things it handles one VERY important aspect of doing buisness... FAST EMERGENCY RECOVERY...

    truth is if one protocol didnt cause the disaster then maybe a central server would have gone down in a few months causing another like disaster... or maybe a top level switch begins to malfunction causing trickle down netowrk problems, or maybe two hard drives in a RAID unit fail simultaneously... all of these are pretty bad scenarios....

    Solution? double them all up within reason.... and then back them all up... mirror your raid's... and have several backup servers... have a secondary bank of switches to swap in an emergency so you can fix the first bank...

    i wouldnt go so far as to back up the client equipment, but realisticly, if possible, everything in the server room down to your T1-T3 connections /routers should be in at least 2's.

    how many serious companies do you know that operate with only 1 T1 in house?

    --Enter the sig--

    --
    --Idiots, Every single one of YOU, A flaming mass of conglomerated morons, hey wait a second, isnt that how RAID works?
  218. Re:That's why I hate automatic routing by Anonymous Coward · · Score: 0

    You're awful smug. Most small business, high schools, etc.. are not going to invest a lot of money in the network no matter what. They look at it as another sales gimmick and come-on. As far as badly designed networks: Sometimes you just make the best of what you are given and if it turns out badly, well you do the best you can..If the business says I have $1200 to spend on the two idf's serving 400 users my choices of switches and gear is pretty much decided for me, and it will suck, and you will come in and complain about poor planning and badly designed networks. Doesn't mean you have a duck's fart of an idea what went on, but you get to whine which is fun for you I guess.

  219. Doh! by SEWilco · · Score: 1
    But their digital copier couldn't talk to its printers through the network...

    Oh, you think I'm joking? There are copiers which are basically a scanner -- and they can make large numbers of copies very quickly by using several printers simultaneously.

    (I don't know if that was actually a problem in this situation)

  220. Obligatory barely-related Microsoft reference by Infonaut · · Score: 2
    Remember reading about the Microsoft-driven Hospital of the Future(tm) a couple of yars back? I was trying to find info about it by doing a Google search. Amazing what * Microsoft hospital * brought up. MS is definitely making a concerted push in the health care industry.

    Let your imaginations wander, and ponder a point in the future when all of our health care facilities will be run on Microsoft... .

    --
    Read the EFF's Fair Use FAQ
  221. Re:Open souce Healthcare Information System Exists by midgley · · Score: 1

    You do actually have the VA system - VistaA, which is free software and source under the US FOIA (I'd like one of those here). I was in LA earlier this month, at the OSHCA meeting Open SOurce Healthcare Alliance, which has been working for three years on this and similar ideas and practice. Help is gratefully received... http://www.oshca.org/ VistA is maintained by the Hardhats, http://www.hardhats.org/ and has recently been ported to run on Sanchez' Open SOurce (GPL) reimplementation of MUMPS, or M as it is now called. So it is possible to have, and in fact I have on my laptop here, an Open Source hospital information system including the physician order entry system, running on a GPL'd database management system of long pedigree and industrial stability, on top of a GPL'd Operating System. You can get GT.M from Sanchez or SourceForge http://sourceforge.net/projects/sanchez-gtm , VIstA from the VA or WorldVista, and then merely face a cliff-like learning curve for the domain knowledge, the programming language M, and the huge and complex system itself. But that is just work, no philosophical problems at all. The problems I am looking at are less concerned with the actual technical programming or the development of Knowledge Service and decision support components although the latter is a wicked problem and the former non-trivial and not finished yet, but on the socio-political side. Realistically we need healthy companies to make a living by aggregating, installing, supporting, developing, and generally looking after our systems. What we need to get rid of is the lock-in to a vendor whose expected lifespan is an order of magnitude shorter than the lifespan of the data, the organisation that depends on it, and the patient - I bang on about "Ars Longa, Vita Brevis" on that.

  222. "The Formula" by Alyeska · · Score: 1
    If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.

    There's always "The Formula" (a la Fight Club) to consider. Cost of a.) installing/maintaining said redundancy vs. b.) losses/liabilities incurred by primary system failure without redundancy. Work in the likelihood of failure and the value of Public Relations as a factor. If A > B, you don't make the redundant system. You simply accept the losses or downtime.

    In this instance, the hospital needs to thoroughly investigate how the downtime impacted patient care. If the access to records proved to be just an inconvenience, well... who cares. Paper systems might be slow, but they worked for centuries before computers came along.

    But if there were serious lapses caused by the outage, they need (at minumum) an isolated workstation that can access and print those records for distribution by hand. Parallel systems alone cannot guarantee 100% up-time. They'll apply the formula based on their own risks and loss control policies and make the decision.

  223. And on an unrelated note... by Radical+Rad · · Score: 3, Funny

    Mail any lucrative^h^h^h^h^h^h^h^h^h job offers to:

    Former MIS Director,
    Beth Israel Deaconess hospital
    Boston, MA 02215

  224. Re:Reliability is inverse to the number of compone by cerberusti · · Score: 1

    Don't worry, you will pass it eventually. (.01 = 1%)

    --
    I'm a signature virus. Please copy me to your signature so I can replicate.
  225. WRONG!: Re:Problem was with an application, by fanatic · · Score: 5, Informative

    No application can cause a spanning tree loop. It is simply impossible.

    A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".

    There are 2 likely causes:

    Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.

    Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.

    A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.

    This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.

    Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.

    --
    "that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
    1. Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · · Score: 1, Informative

      Actually, broadcasts are not the only type of traffic that is flooded by a bridge. Multicasts in general are flooded, as well as unicasts for which the destination MAC address has not been learned.

      Building a separate infrastructure for "mission-critical" apps might be tough...is this only life-critical, or would that apply to the administrative functions, too? Besides the problem of deciding which functions the network should support, you have the problem that it is easy for someone to accidentally connect both networks together (i.e., if there is a person who has systems on both networks, and is re-wiring their cubicle and inadvertantly connects the two networks to a common switch.

      Any large infrastructure like this should be subdivided at layer 3, on at least a building-by-building level, and perhaps floor-by-floor. If a subnet is larger than 2000 nodes, the likelihood of trouble rises quickly.

      Another issue with Spanning Tree is that if you have a new bridge plugged in to the network that manages to convince the other bridges that it is the root (through a poor selection of default values on the part of the vendor, or a pre-existing config that isn't applicable to this network, or a mis-configuration by the end-user), then it will be in the forwarding path of *all* the flooding-based traffic (see the list in my first paragraph above). In such a scenario, broadcast-based discovery protocols like ARP will probably fail since this switch won't be even seeing certain traffic since it won't even make it onto the clogged links running upstream toward the root, many network applications will fail. And if ARP ain't happy, ain't nobody happy.

    2. Re:WRONG!: Re:Problem was with an application, by khafre · · Score: 4, Informative

      Actually, it is possible for an application to cause Spanning Tree to fail. Most switches have a management port that allow remote access (via telnet, ssh, SNMP, etc.) to the switch. This management port is normally connected to its own VLAN isolated behind a router so user brodcasts & multicasts in another VLAN can't affect the switch CPU. This port can be overrun with brodcasts and multicasts from user applications providing both the user and the switch are on the same VLAN. If this CPU is consumed by processing broadcasts, it may not have enough CPU time available to process and forward spanning tree BPDUs. If a blocked port becomes opened, a switch loop could form and, BINGO, network meltdown.

    3. Re:WRONG!: Re:Problem was with an application, by Cramer · · Score: 1

      Except when the switch doesn't understand the layer 3 (and specific layer 2) protocol being used... I live in a Novell network; most of the switches (feeder switches) in the network treat IPX traffic as broadcast traffic. (Yes, Novell generates a lot of true broadcast stuff, but every single packet isn't a broadcast.)

    4. Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · · Score: 4, Informative

      Third possiblity - and what I'd be confident is the initial cause.

      The amount of traffic the researcher was putting onto the network caused spanning tree hello BPDUs to be dropped.

      After a period of not receiving hello messages (20 seconds if memory serves), downstream devices believe the upstream device has failed, and decide to re-converge the spanning tree.

      During this re-convergence, the network can become partitioned. It is preferable to partition the network to prevent loops in the layer 2 infrastructure. Datalink layer frames eg ethernet, don't have a hop count, so they will loop endlessly - potentially causing further failures of the spanning tree protocol.

      Once the bulk traffic source is removed from the network, STP should stabilise within a fairly short period - 5 minutes or so - so there may also have been a bug in Cisco's IOS, which was triggered by this STP event.

      Altneratively, the network admins may have played with traffic priorities, causing this researcher's traffic to have a higher priority over STP messages, causing the STP to fail.

      Radia Perlman has a good description of STP in her book "Interconnections, 2nd ed" - but then she should - she invented it.

    5. Re:WRONG!: Re:Problem was with an application, by Smoke_One · · Score: 1

      Can we say Packet shapper, and QOS! Then a rogue app could never have done this. Only poor network design can cause this. If you design the network correctly you do not need spanning tree protocol anyways. Only cisco uses this troublesome protocol. Many poorly designed networks fall victim to it.

    6. Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · · Score: 0

      You don't understand why STP is used do you ?

      You don't know about other vendor's equipment either. Every medium to high end switch, from any vendor, supports STP. If it didn't, it is unlikely to make it only anybody's purchase short list.

      As soon as you have at least one redundant layer 2 link in your network, you need STP to prevent forwarding loops. Forwarding loops are endless at layer 2 because layer 2 frames don't have a hop limit / TTL count field.

      Many poorly designed networks actually work because of STP. A poorly designed network is likely to have un-intentional loops in it - STP prevents these loops.

      Poor network design should not have caused this - presuming it was ok that the researcher's traffic was acceptable on the production network (which it shouldn't be).

      There is definately a fault in this network - whether it is because of bad design, misconfiguration or a bug in Cisco's equipment is the question.

    7. Re:WRONG!: Re:Problem was with an application, by vawlk · · Score: 1

      so does 3com and with their latest switches, it is on by default.

      Just hope you don't run Appletalk on a network with spanning tree enabled. Not a good result.

    8. Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · · Score: 0

      What a load of rubbish, how would you suggest that Packet shaper and QoS would resolve two layer two paths to the same destination?

      Hint they don't.

      If you do not have two paths to the same destination, how do you provide resiliency against a single point of failure?

      Hint you can't.

      I agree that STP can be a pain in the arse but unfortunately, there is nothing else that can perform a similar function.

      As for you assertion that only Cisco uses STP, I would suggest that you look at any other manufacturer of switches and check out their product specs. Nearly all will use STP, because as previously stated there is no alternative.

  226. IMHO by Anonymous Coward · · Score: 0

    Networks go down.
    You cannot always define the root cause.
    It is never one person's fault entirely.

    Every network will go though some type of major outage in it's lifetime. That is unavoidable.
    When your network does go down the thing that matters most is what you learn from that outage. Hopefully afterwards you will use what you have learned to improve.

    It is not for us to judge the network as it existed as we can have no way of knowing all vairiables contributing to its design. However, if nothing is leared or there is no improvement in the infrastructure, then you have most definitely left yourself open for comment.

  227. redundancy/complexity by to-vie-for · · Score: 1

    As a network engineer that works on spanning-tree daily, I can certainly appreciate the complexity of the situation. Spanning Tree in itself is a fairly easy protocol to understand. But, when combining it with HSRP, VTP, Trunking protocols, Ether(or gigE)channels, inter-vlan routing, etc things can quickly become out of hand. The problem most certainly could have been resolved quicker with proper documentation of the as-built network. Cisco is no slob at solving these things. With all the manpower they put behind this, I'd have to say the team of StormTroopers sent by Cisco had to actually had to first document the as-is network before they could really pinpoint the problems. I don't really agree with the second additional network.. Now we can have two broken things instead of one. Any time you increase your redundancy you also increase complexity. I think that this will serve as a good lessons learned for the IT staff there.

  228. Of course it is the best solution by mongre · · Score: 1

    Of course it is the best solution to make a fully redundant identical network. After all how else is Cisco going to maximize the profits?

    Here is a suggesting, why not contract some consultants who do not tie their paycheck to how much product they manage to convince you to buy from their employer.

    An independent consultant

  229. No.... by AriesGeek · · Score: 1

    You're referring to alternate physical paths. They are talking about a completely separate network. A very silly idea.

    --
    Insert offensive troll-style sig here. Please mod or respond appropriately.
  230. Here is a novel concept by Glonoinha · · Score: 1

    Between the Yorktown being lamed by a 0, to the hypothetical bridge with a 300 lb guy on it, to the Hospital's network being brought down by whatever ... somebody ... Somebody ... SOMEBODY knows the truth. The guy that did it. Somebody did something, and BANG! the system got fuxored.

    Instead of spending DAYS letting the corpse recovery crew autopsy the network - just say something. Admit that you screwed the pooch, admit it early and admit it often. Be eager to accept and admit that you fuxored the system and be eager to explain exactly what you did. (*)

    This does two wonderful things for you:
    1. Because they don't have to spend days finding someone to blame (because you eagerly accept the blame) and because they already know what the problem is (because you told them what you did) they can get it fixed in about 1/4th to 1/10th the time (because they already know what the problem is and don't have to dick around trying to figure out who to blame it on.)

    2. When something really, really, really bad happens (think the Battleship Ohio(?) main gun explosion, or the $1.3T lost in derivative trading by that banker in England, or Apollo 13) you have already established a history of eagerly admiting when you screwed up and eagerly accepting responsibility for your mistakes so you basically get one 'get out of jail free' cards. Just say 'hey I always eagerly admit it when I blow it - if this one was me I would have already said something.'

    (*) - Note : this only works in places where they let people make mistakes and don't destroy your future for them.

    --
    Glonoinha the MebiByte Slayer
  231. uh wtf by ealar+dlanvuli · · Score: 2

    Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions.

    I call sensationalist bullshit. It takes at most 15 minuites to switch over to a fully paper hospital here.

    Either that or their hospial is really really shity.

    --
    I live in a giant bucket.
  232. what other solution by Anonymous Coward · · Score: 0

    would you suggest ? The could try and get some of that cool fairy dust the IBM commercial talks about but I am betting it is really hard to find

  233. Ahhh... that's it, you see! by sconeu · · Score: 2

    Well, that's it you see! Alan Ralsky thought it said spamming tree protocol and tried to use the network!

    --
    General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
  234. A second network? Maybe. by Brad_Silva · · Score: 1

    Having a second network available as a backup has several problems:

    One) How do you connect it to the systems? Move the wires? Impractical. The edge devices that actually connect to the systems have to be part of both networks.

    Two) Who's to say that it will not have the same design flaw as the primary? Have the second network be designed by a different individual, AND with different design goals. Similar goals can produce similar results. Have the second network be designed strictly as a backup.

    Anyway, in the vast majority of cases, rather than having a second network, you're probably better off having that second person review the work of the main person. Having and MAINTAINING a second network is only valid under a narrow range of situations. OTOH, perhaps a hospital is a high risk enough environment to warrant that.

    Personally, I think having a backup paper/whiteboard/people system (which they appeared to have), is the right solution, as this is also useful under emergency situations (earthquake, extended power outage, war, etc).

    My two-bits worth (was there ever a coin in America called a bit?)

    Brad

  235. I was stuck there by drwho · · Score: 2

    Well, this explains what happened when I was there after being hit by a truck. The doctors were great but the place was very disorganized. Hrm.

  236. Re:Contribution to causality responsibility by Anonymous Coward · · Score: 1, Funny

    It would be the fault of the fat person. You always blame anything on the fat person, because they're always the ones screwing the rest of us up.

  237. A common logical fallacy... by The+Ape+With+No+Name · · Score: 3, Insightful

    ... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...

    --
    Comparing it to Windows will be a moot point, since El Dorado is going to have a 40% larger code base than XP.
  238. Re:The Israeli Way by Anonymous Coward · · Score: 0

    I don't know what they do with their M-16 rifles in Israel but if I had one... I'd stove it up your ass so your head could some company.

    Stupid fuck...

  239. sounds like they need a better network! by epidemic99 · · Score: 1

    I would say it sounds like they need a better network, not necessarily another network. To me it is silly and unnecessary to build a whole other "backup" network. I would simply upgrade their current infrastructure. I would upgrade servers, get more bandwidth (multiple T1's or T3 if necessary), get better routers/switches, etc.

  240. Hello? Anyone out there know what WiFi is? by ejoe_mac · · Score: 1

    While it wouldn't have allowed all workstations back online, throwing down a Cisco WiFi network within the buildings to create an "emergancy network" would have taken a number of hours, and gotten enough of the net back to allow for patient tracking and record keeping.

    First one to say security would have been broken in a short time evidently hasn't used the automated rolling WEP implementation Cisco has.

    I'm supprised Cisco didn't have a LAN/WAN setup in a crate, complete with servers to handle the authentication, sitting somewhere ready to deply in an emergancy (think 9/11).

  241. In SOVIET RUSSIA by Anonymous Coward · · Score: 0

    ...networking glitch is brought down by hospital!!!

  242. No contigency planning by CormacJ · · Score: 2

    I was an operations manager for a large hospital for several years, and planning for this such as that should be a number one goal for IT staff.

    The first rule in anything to do with hospitals is to ensure that they have disaster plans in place and that these are tested on a regular basis. The disaster plans should include scenarios such as total power outage, failures of vital equipment etc.

    The second rule I used was to ensure that in critical areas there was a second independant network path that if needed could be isolated from the rest of the network. Usually this mean putting in a run of fibre that bypassed buildings etc.

    The third rule is to ensure that vital equipment can be run without need for a network. Nothing should be so dependant on networking that if there is a failure it will stop it from working. If networking is a requirement (eg Medical Imaging) that network should be independant from the main network.

    The fourth rule is to ensure that there is a secondary method of accessing electronic patient records in the event of an extended downtime. I wrote an application that would dump the most needed patient information and leave it available on PC's in critical areas in query only mode. This allowed access to most of the patient details for using the patient forms.

  243. beat him by Anonymous Coward · · Score: 0

    beat him, harshly and thoroughly

  244. what ever happened to TTL? by pastorBernie · · Score: 1

    While it is true that an application could not have caused this problem, is it however possible that a poorly designed application could have allowed to problem to continue as opposed to a well written application may have been able to prevent the triggering of this problem. For example, setting the TTL too high. If this was taken into account correctly wouldn't that have prevented the whole problem in the first place?

    1. Re:what ever happened to TTL? by zzyrc · · Score: 1

      Switching doesn't alter the packets

    2. Re:what ever happened to TTL? by Ian+Peon · · Score: 2

      To elaborate on what zzyrc said, TTL wont decrement when it passes through a typical layer 2 switch - only a router or other layer 3 device.

  245. Re:What is spanning tree protocol? (google whoring by Anonamused+Cow-herd · · Score: 1
    Note to poster:

    Google whoring works better when you log in, wussy =).

    Cheers,
    ~Tris.

    --
    -----[0_o]-----
    We are not amused.
  246. low budget by Anonymous Coward · · Score: 0

    I am also planing a hospital network and what we have is a network problems because ww are out of money. We can not afford one decent network !!!!

    You guys are talking about production environments with big budgets ...

    I think our patients should pray often ;))

  247. Redundant Network by merky1 · · Score: 1

    Just use twin co-ax... there should be plenty of that lying around these days.

    --
    --WooooHoooo--
  248. 9/11 example by queensac · · Score: 1

    I work for a company that supplies real-time market data information. Our office is just 3 blocks from the World Trade Center. The data center is actually just across the street from the WTC. On 9/11 this data center was damaged and unusuable, but since we had a duplicate data center in NJ, we were able to be up and running a week later. We could have started sooner, but the markets were closed, so there was no data to send out. We also have three separate environments, development, staging and production. Only QA approved software is allowed to run in the production environment. A hospital should have a backup network in cases of catastrophe.

  249. Short answer: Yes. by slasher999 · · Score: 1

    I'm sure someone has already pointed this out, but I didn't feel like reading - or even scanning - the 400+ posts. Sorry, it's a lazy day.

    Anyhow, yes, having a second identical network would make sense if it is affordable. This would be a test lab. However, would you want to recreate every node that is on the production net? Probably not.

    In this case it wasn't the network that failed, but a single application that generated a ton of network traffic when it was opened. Reminds me of that old poem about computers not doing what the user wants, but only what it's told. Don't blame the net for bad software.

  250. Re:Contribution to causality responsibility by dusanv · · Score: 1

    Would it be fair to say that the bridge collapsed because a 300 lb man was on it?

    CowboyNeal struck again... Of course it's his fault, he does it just for fun, the sadistic bastard.

  251. In other news... by GarryOwen · · Score: 1

    BIDMC just recently announced they had job openings in the field of networking...

  252. These guys got off easy! by raehl · · Score: 3, Funny

    The last time I had a problem with a spanning tree algorithm I lost 12 points on my CS final!

    Ok, so seriously, I'd be embarassed if I screwed up a spanning tree algorithm on a test. If it took Cisco engineers 6 days to fix it, it musta been something really quirky, most likely the software not configuring something right. I can't imagine an application problem that would hose a network past a power toggle.

  253. People get what they pay for. by nikpieX · · Score: 1

    This is what happens when people do not want to pay the money for quality network engineers. If you're not willing to invest in your network, it ends up becoming a kludge. Education is the key. If your network engineers aren't knowledgable enough to solve a STP problem and have to rely on Cisco's TAC (many of which are just as unknowledgable), then you're walking on a thin line. I realize hospitals are low on money, but any mid-level engineer should have been able to solve this problem within a few hours.

  254. Sounds right. by twitter · · Score: 2
    The "backup" network should look different from the first so that it is not suceptible to common mode failure. It should be simpler, learing from the last accident, backing up the most important and difficult to replace segments. The Boston article mentions lab results. One way to back up the network is to have a simplified link from the lab to several key locations. "Non essential" functions and other less heavy stuff might just have to do without the backup. It might be inconvenient to walk down a hall or a flight of steps to get info, but that beats everyone having to go to a different building.

    The above is specious. I know nothing about the network or campus in question. I'm sure the folks on hand know what to do. Good luck.

    --

    Friends don't help friends install M$ junk.

  255. Manule fail over is always better in my book. by NinjaWorm · · Score: 1

    This article is the very reason why I argue against using spanning tree. I have seen many similar outages. It does not help when the very thing that is there to help prevent the system form going down can cause it to go down.

    I always opt to have an identical switch that I can fail key systems over to manually.

    Spanning tree caused them to be down for days.
    Why not build a switched network with no loops then if a switch fails it only affects the systems on that switch. And if you have the budget there will be a second exact switch powered up right under it in the rack. You unplug all the patches and plug them into the other switch. Down time = way less then days. The network becomes far simpler ergo much simpler to maintain and fix.

    Just my 2 Cents.

  256. Sure, and while we're at it!! by cybercomm · · Score: 3, Funny

    Why not buy M$ wireless 802.11b install W2K/XP on every computer and set up an MS exchange server. Who needs BSD when you have M$ :)

    <I>just kiddi'n the uptime of the above mentioned network would be measured in nanoseconds, and then they will have to switch MS paper'n'pen method</I>

    --
    Live for the present, learn from the past, and dream of the future!
  257. There is a questions about redundancy? by Steepe · · Score: 1

    First off.. They should hire me.

    Whomever designed a network of that size without redundancy in the first place is just stupid beyond compare.

    If they say.. Oh.. it was the finance people who said we could not have the money for redundancy.. Then you jump your prices in your quote in the first place and tell them its single homed and build it redundant anyway.

    Spanning tree has no forgiveness in it at all, probably someone put in a bad route or something and everything exploded.

    I have designed many networks, and ALL of them have at least SOME level of redundancy.. most are complete hardware mirrors, but some are just extra paths or just extra cards in the switches to move cables to in case of x y or z problem.

    This is most likely someone entered bad data into one single switch somewhere and it took Cisco forever to find it.. and of course guilty party didn't want to admit to doing it because he knows its his job.

    --
    Just three more hours seapeople and you can finally take me away from this crappy God Damned planet full of hippies
  258. It's all about the Benjamins by sjbe · · Score: 5, Insightful

    My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.

    The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.

    Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.

    1. Re:It's all about the Benjamins by passion · · Score: 2

      Well, then to take the optimistic view, I guess that crashing 10+ a day isn't that bad an occurrence... that way, they don't develop an ultimate dependence on a system, and when it crashes, it's an annoyance instead of a mission-critical failure.

      --
      - passion
  259. Yes, but.. by Inoshiro · · Score: 2

    The kidneys are internally redundant. You only need a 10% kidney function to contintue to survive. Ditto for Liver and other organs (aside from heart). They take years of abuse via smoking or drinking before they finally start to wear out to the point of causing system collapse.

    --
    --
    Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
  260. Obligitory Spanninng Tree Poem by crotherm · · Score: 2

    Algorhyme

    I think that I shall never see
    A graph more lovely than a tree.
    A tree whose crucial property
    Is loop-free connectivity.
    A tree that must be sure to span
    So packets can reach every LAN.
    First, the root must be selected.
    By ID, it is elected.
    Least-cost paths from root are traced.
    In the tree, these paths are placed.
    A mesh is made by folks like me,
    Then bridges find a spanning tree.

    ---Radia Perlman

    --
    "Those who make peaceful revolution impossible, make violent revolution inevitable" - JFK
  261. Re:That's why I hate automatic routing by Swannie · · Score: 1
    We're not talking about a small company here, we're talking about a hospitable who probably charges their client (read: their health plans) tens of thousands if not hundreds of thousands of dollars for procedures and operations. I think the least they can do is dump some money into their network, and the resources that support their network. Especially since they appear to rely on it for life-or-death transactions like patient records.


    I certainly agree with you, and I don't expect small comapnies to hire a team of network engineers with 6 figure salaries to handle their network. The small to medium sized company probably uses the network for file sharing, email, and internet access, certainly they could get by for a few days if they had to. On the other hand, this hospital couldn't access patient records because their network failed. So, if I'm a nurse and my patiend is in pain, how do I find out if I can give him or her morphine? Has he/she had some already? What if he/she is alergic? All this (I assume) is in their patient file, on the network, which they can't access.


    Swannie


    Moderation totals: -1:smug ;)

    --
    :q!
  262. Re:Reliability is inverse to the number of compone by Anonymous Coward · · Score: 0

    If I might be so bold as to pose an alternative probability of failure.... given that if one train uses the track and it's P=10%, then if a 2nd train is added going the opposite direction, and if both trains use the same track, then the probability of failure is 100%, as they will collide.

  263. Executives working? by wandernotlost · · Score: 3, Funny
    Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus.

    It's always nice to see those people doing useful work for a change.

  264. Terminal Servers may save them by urbieta · · Score: 1

    Instead of making parallell networks, they can simply user parallell servers with the linux terminal server project, if a server dies, the other one will take over operations 8)

  265. Standard UPS by Bios_Hakr · · Score: 2

    Sounds like a standard UPS system to me. You have the grid feeding banks of batteries. The batteries feed the hospital. The generators are between the grid and the batteries, but they are not wired in such a way as to allow a generator failure to disrupt pawer from the grid. If the grid fails, no one notices because the batteries are what feed the hospital. After a few minutes, the generators start and they keep the batteries full. Once the grid is back on, the generators shut down.

    --
    I'd rather you do it wrong, than for me to have to do it at all.
  266. Why Spanning Tree can be bad by haruchai · · Score: 1

    Because Cisco can't fucking get it right, especially where multiple VLANs are concerned.
    Search cisco.com for "spanning tree caveats"; filter the results by IOS release versions and check the number of open or unresolved caveats for which there is no workaround.
    It shouldn't take you more than a week to go through them all.

    --
    Pain is merely failure leaving the body
  267. So true by ipjohnson · · Score: 1

    So true. Redundancy is king.

    And as for testing ... well lets just say my job is only 20-30% code and the rest is test and requirement.

    Oh and if that wasn't enough if all else fail with our system there is a seperate fall back system (written by another contractor) that will step in and take over the displays.

    The only nitpick thing I have is that a sub system in Standby mode quite often will actually do its own processing of the data because if one machine corrupts the data you still have one good box. Only when you first bring a redundant box to standby ready will you actually see a data synch.

    You post reminded me of someone on /. a while back talking about open source ATC. I laughed at him then and I'm still laughing now :)

    1. Re:So true by mekkab · · Score: 2

      I wonder if that open source ATC comment was for that UK airspace shutdown on May 17th...

      Its nothing open source could fix...
      he shouldn't worry though, we've put a fix in for that (Works damn well, too!)

      --
      In the future, I would want to not be isolated from my friends in the Space Station.
  268. FYI by ipjohnson · · Score: 1

    When ATC systems go down they route traffic around the down sectors because 1500 tracks in a small airspace is impossible to control safely without computer systems.

    1. Re:FYI by gorf · · Score: 1

      I don't know about military operations, but in the UK, AIUI, they don't use computer systems in the first place.

      They're busy putting one in, but AFAIK it isn't operational yet, and has been plagued with problems.

  269. Sad but true by sjbe · · Score: 2

    Actually there is more truth to that than you know. They can't keep any files locally and simply have to not rely on the systems for anything critical. Recently they had their computers taken away for 3 weeks (refurbishing offices), which was a terrible inconvenience, but it didn't bring work to a halt. Just made everyone's lives harder than they had to be.

  270. Union "help" by ces · · Score: 3, Insightful

    Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.

    My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.

    --
    Happy Fun Ball is for external use only.
    1. Re:Union "help" by anjrober · · Score: 1

      This is one of the many problems with Unions. If they expect to be treated like professionals they have to stop acting like children. Professionals do their best whether their clients are treating them like shit or like saints. Professionals stay late and don't get paid a cent more, when the job calls for it. Professionals do a good job without being bribed. Unions are economic disrupters and should be dissolved. These are not skilled professionals, these are thugs.

    2. Re:Union "help" by ces · · Score: 2

      That may be true, but if you treat the tradespeople like shit they will act like cretins.
      The problem is often the only way to get decent, prompt, and/or after-hours service from union trades is by getting them to want to help you. This is accomplished by making friends with them and bribery.

      Unfortunately life often requires you to go out of your way to be nice to people who really don't deserve it.

      --
      Happy Fun Ball is for external use only.
  271. Testing the backup network by Skapare · · Score: 2

    And how will you know if the backup network even works? Of course you could test it. But will it work under the kind of extreme live stress that would take down the primary network? And what if the issue is simply load than neither network can fully handle? Could you run both networks in tandemn correctly? It sounds to me like the original problem was that the network was designed by someone who thinks of the switches as magical black boxes that will take care of everything ... someone that assumes perfect abstraction. That 3 million dollars to build a parallel network I think could be better spent by hiring competent people to build a correct network that includes redundancies structured in the right places. No matter what you do, there will be some single points of failure, such as the very logic used to switch over to the backup network if that's what you have (which would be a big waste if it sat there idle). The network engineering people need to know and understand those single points of failure and have plans to deal with failures at those points.

    --
    now we need to go OSS in diesel cars
  272. Could it have been... by Anonymous Coward · · Score: 0

    802.1x? If they were running old CatOS code 802.1x packets from an XP box or other OS running Port Based Network Access Control could have killed the network. The MAC address used falls into the range that the switch thinks is Spanning Tree. It gets forwarded out of all ports, and the levels build up until the network grinds to a halt.

    It sounds like they need to put a number of routers in and break the Spanning Tree domain into small chunks - and ensure they're running code that copes with 802.1x, or put in the known workarounds.

  273. Typical cisco by Anonymous Coward · · Score: 0

    sounds like the typical cisco reponse....

  274. Umh, He Has a Masters In Med. Informatics From MIT by Anonymous Coward · · Score: 0


    Bullshit. John Halamka is exceptionally qualified. He has written Books named Real World UNIX and Best of CP/M.

  275. Should they build a identical parallel network? by ndnet · · Score: 1

    While building an identical network is a nice idea, it's silly. Instead, start using WiFi. Also, comparmentalize this network, IE, separate nodes so that if inventory ordering has a problem then personnel and radiology don't go down.

    If a bad app comes up or a virus infestation occurs, have a duplicate server ready with the latest safe backup data. Also, have all clients off until a technician can make sure that each individual client is safe to bring back on. Start with mission critical systems, like radiology, patient records, etc.

    The benefit of this approach is that first, you can set it up so that a client only connects when said connection is needed, not persistant.

    Second, it's pretty easy to kill wireless access even against backdoors. There are no passwords, no need to unplug each server - all you do is cut power to each access point. Since it's compartmentalized, you may not even have to kill every system.

    Third, you have an excuse to transition to WiFi, which, if you manually add another software layer of security, is a Killer App for hospitals, provided it doesn't have cell phone-like interference.

    Furthermore, you can keep the ethernet up as a backup solution. Set up a seperate honeypot on each to help keep records secure. The WiFi honeypot will prevent wardrivers, and the wire one will prevent malicious people from using the wired solutions in hidden locations - which will be plugged into the wall but disabled at the regular server level and in each client.

    Obviously, this isn't practical for my local Northwest Medical Center which has 200 beds at the most, but for a large urban hospital this type of flexibility, simplicity and redundancy shouldn't be considered handy, it should be considered the rule, if not even the rule of law.

  276. Oy Vey! by Anonymous Coward · · Score: 0
    Why do so many countries fear the message of White Pride?

    Any idea brought forth in an open society is exposed to criticism. If I claim to be able to make psychic predications, it should come as no suprise that many people will seek to prove otherwise, or just outright laugh at me. If I want to make a statement that I believe blue shirts cause violence, people are going to want to see statistics and evidence, right? No rational person would believe such things without evidence. It is up to you, the reader, to study the facts and decide for yourself what is true and what is false. This is your right. You don't have to believe things that are obviously false, no matter what people in power tell you.

    Fortunately we have the freedom to criticize many ideas today. Almost no idea is censored in modern western countries. The few extreme elements of society like drug-users, pedophiles, and homosexuals are each day considered more and more mainstream, and many of their ideas are becoming the "norm." But while countries work to legalize things like prostitution and drug usage, at the same time they make stricter and stricter laws against so-called "hate speech."

    Why? Why is information about White Pride censored when virtually anything else is published openly? Why can any idea be exposed to criticism except when it has to do with race?

    The fact is Jews, liberals, and people in power know exactly what the message of White pride means and how powerful it is. Unlike their attempts at social engineering, our message is based on fact and reason. This is what makes it dangerous to them. It doesn't matter how much propaganda about "equality," "reparations," and "diversity" they hammer us with. When people see the evidence, and evaluate the facts for themselves, they will come to the same conclusions that other informed White people have. No amount of Jewish lies will stop the truth. They know this and fear it. This is why they try to suppress us.

    So what should you do about this? Open your mind, and visit White Pride web sites like the National Alliance, White Civil Rights, and Stormfront. Get a copy of David Duke's My Awakening. Read what they have to say and make your own conclusions -- does what they say agree with the evidence available? Have your own experiences verified what they are saying? No one is going to tell you what to think, because it is up to you to make your own decisions.

    Try asking yourself questions like:

    • Why do we send billions of dollars of "aid" and weaponry to Israel every year?
    • Why do non-whites commit far more crimes than whites even after all these years of affirmative action and welfare handouts?
    • Are racial quotas in the workplace fair?
    • Why are we told there are no differences between blacks and whites when we can clearly see the physical differences in their bodies?
    • Why is Africa still in the stone age?
    • Why is illegal immigration accepted and encouraged in the USA?
    • Why is news about the Israeli spy ring caught in the USA only reported in foreign newspapers?
    • Why is the government afraid to report the truth about the Anthrax letters?
    • Why is the government continually increasing its control over our lives?
    • Why is our media so dedicated to corrupting our children's morals?
    • Why does the number of people killed in the Jewish holocaust keep changing?
    • Why has the Wichita massacre gone unreported?
    • And so on....
    The truth will not be stopped!
  277. Says it all in the article... by mrmud · · Score: 1

    In fact, on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time.

    They knew there was a problem, but as with anything, they decided to wait. Case Closed. Don't blame the engineers, blame the people who decided it wasn't important enough to overhaul.

    --
    -- MrMud
  278. Re:Just a prediction by Anonymous Coward · · Score: 0

    Just a prediction. After spending countless millions on this super backup network the same sinario will occur again. The now prepared admins will transfer the old network to the new; only to find the application that brought down the network works just as well on the new. Can anyone say resource limits! Why can one user put such a heavy load on the network that it brings down the network? Why can one network segment put such a heavy load on another segment as to bring down the entire network? And even more importantly, how will increasing the bandwidth, or adding a backup network resolve this problem? Perhaps a better approach would be to look at methods of controlling bandwidth usage.

  279. Basic Network design by pwalenta · · Score: 1

    I've been doing network implentations long enough to realize one very important thing. The less spanning tree in a network, the more stable the network. This is one of the reasons layer 3 switching has become so cheap. Most people just don't take the time to use it. The largest network I've built (over 12,000 ports) hasn't lost a day of uptime in the past two years because it's all layer 3! Admittedtly, the network administration has a part to play in this in that most IT departments think they're being sold a layer 3 switch, just so a vendor can sell a more expensive switch. In reality, layer 3 = stability.

  280. Redundant Network... by IOOOOOI · · Score: 1

    ... yeah I think it's a fabulous idea. STP would prevent loops and... oh... never mind.

  281. Sounds like the Windows XP "Network Bridge" by Anonymous Coward · · Score: 0

    This sounds like the same thing that has been going around, although the inability to recover is astounding.

    Windows XP (home and professional) includes a feature called the "Network Bridge". Many people think this is nothing new, NT could do IP forwarding (basic routing with RIP), but XP includes an 802.1d transparent bridge with spanning tree algorithm. This has been bringing down dorm nets, because a student with XP on a laptop, with ethernet and 802.11b WiFi adapter, can easily and inadverdently create a bridge, and cause a bridge loop. Although XP supposedly includes support for spanning tree algorithm, the amount of problems out there suggests that it is either buggy, or the wireless access points don't support it properly.

    IMHO, NO ONE needs a transparant bridge, certainly not as a default option when adding a second adapter through the "Network Setup Wizard". At the least, there should be a popup that says "Are you sure, this might bring down your campus network..."

    If you manage Windows XP machines, your only recourse is to add the following registry key to your laptops to disable forwarding if a bridge gets configured:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Ser vi ces\BridgeMP]

    "DisableForwarding"=dword:00000001

    (Somehow there is a space in "services" above that I can't get to go away, it doesn't belong there.)

  282. failover by Anonymous Coward · · Score: 0

    Slashdotters would already know that the technology's already there to have the failover for this. The problem has to do with planning and ongoing management.

    Obviously during the planning stages of a network it needs to be decided is this service critical enough that we need to have some sort of failover.This needs to be revisited as a network grows and is depended on for more services.

  283. Stop PRETENDING by Anonymous Coward · · Score: 0

    I know that if this happened to my place of work then...

    1. It'd be an IT failure, not a technology failure or a user error.

    2. My CIO would be on the street. Period.

    Sorry, technology be damned - our jobs are technology and production quality. The fact that "one user" could be the straw that breaks the network PROVES major flaws in their network design. Flaws that MUST have been well known by the CIO.

    Of course, CIOs are busy, and they can't have all the answers. That's why CIOs should hire RESPONSIBLE PEOPLE and have THIRD PARTY ACCOUNTING of their critical systems - which, of course, includes their network design.

    Stop pretending to be a CIO and be RESPONSIBLE. Stop blaming users. Stop blaming technology. Stop blaming your vendor's questionable firmware. Take responsibility and build quality in your network design.

    Pretenders are EVERYWHERE. Even CIO are pretenders.

  284. Corrections From John Halamka by good+soldier+svejk · · Score: 1

    John asked me to dispel a few misconceptions about the Caregroup/BIDMC network. He wrote and I quote:


    Since there is so much chatter about a parallel network that is just wrong, could you post an entry (feel free to use my name) that states:

    1. Yes, our network had a flat topology which was required by multiple legacy apps. Our network downtime was caused by having a layer 2 based enterprise network with VLANs that crossed the core.

    2. We restored the network by moving portions of the network from switched to routed and eliminating all loops

    3. Over the next 6 months we will retire all the offending legacy apps and move to a network that is routed at the distribution and core layers.

    4. There was never an attempt or design to build a parallel network. Cisco flew in a few spare 6509's and configured them just in case our existing 5509's were unable to handle the traffic of our new topology. Since the 6509's were configured by a different team than the team fixing the existing 5509's, this was work done in parallel. The press described this as a parallel network, which it was not.

    I hope this resolves the controversy. CIO magazine will devote it's February issue to an in depth look at the architectual flaws in the network. NPR will have a major story next week about the human side. Our hope is that we can broadly share our lessons learned so that other institutions that grew by merger and acquisition will examine their networks and eliminate any flat topologies, preventing the kind of downtime we experienced.
    --
    It is cowardly, and a betrayal of whatever it means to be a Jew, to act as a white man

    -James Baldwin
  285. The blame game by Anonymous Coward · · Score: 0
    Q. Why does this network depend on spanning tree?

    A. SMB

  286. No wonder! by PerryMason · · Score: 2

    Meanwhile, the hospital was figuring out how to run at its usual pace without the 100,000 e-mails it usually sends a day.

    So thats where they're doing all those penis enlargements!

    --
    "I'm tired of all this 'Aren't humanity great' bullshit. We're a virus with shoes" - Bill Hicks
  287. talk about repeating the problem by stinky+wizzleteats · · Score: 2

    Build a second parallel network because the network designers didn't know wtf they were doing? How are you going to fail over to this network? STP? (insert obnoxious chortle here)

    10 bridged hops = big flat network = they needed layer 3 switching in the first place, ergo, the network was badly designed. The very fact that a root bridge STP reconverge occurred indicates a poorly framed implementation plan and obviously no backout plan.

    Find somebody who knows what the hell they are doing and have them do a network audit.

  288. I found the problem by stinky+wizzleteats · · Score: 2

    Cisco Systems, the hospital's network provider...

  289. Re:Reliability is inverse to the number of compone by dago · · Score: 2

    There's always some error in calculs, in this case, the traindriver forgot to lace its shoes.

    --
    #include "coucou.h"
  290. Human Live v/s Bottom line by bgillham · · Score: 1

    You would think that network redunacy would be a something the net engineers would try to get done each year...I can only speak from my experience as a rural school telecom director...we are doing a complete campus rebuild and in the overall scheme the cost to add complete redunancy was maby 4% of the overall cost...that is not much...I think it was both the engineers and the administrators fault...engineers sould have pushed harder and the administraors should have got there head out of the collective asses!!! bg

    --
    --|gillham|--
  291. Re:Open souce Healthcare Information System Exists by FreaKBeaNie · · Score: 1

    Do you know of any ASPs using this as their backend software?

  292. comments from a BIDMC physician (long) by calm_rising · · Score: 1

    OK, this is my first post, which I know demotes my relevance in this forum. What's more, I don't know much about the technical details of networks. But I am a physician, I was there, and I think I might be able to provide some perspective. I tried to read the large part of the posts made before writing this, and I hope that my post is not too irritating to the locals. So, here are some things to think about.

    As at least one commenter suggested, the reason that the network wasn't better prepared is because the IT budget is woefully unfunded. You might be unaware of the ridiculously poor fiscal situation that most academic medical centers are in. Caught between increasing costs for diagnostic tests and pharmaceuticals, and decreasing reimbursements from the government (N.B. this post is in no way intended to take ANY position on government funding of health care), all hospitals are in an increasingly difficult position. When you throw in that academic centers will not turn away indigent / uninsured patients, and that there are ~40 million uninsured Americans, it is nigh impossible to even break even.

    I am in a position to know personally just how underfunded the BIDMC IT department is. Without going into details, let me just say that BIDMC can barely replace 5-year-old desktop platforms. It is casually miraculous that BIDMC has been able to computerize laboratory reporting, medical records, and physician ordering, not to mention supply, billing, bed management, and dozens of other things we don't even think of when we think about running the hospital.

    The IT department, led by John Halamka, has been turning straw into gold for years. Every year, they cut the IT budget, and every year, the computer system gets better. (Of course, I say this from an end-user perspective.) Don't let his MD and his emergency med training fool you, as it did one poster, into thinking that John is a duffer. If his description of the reason for the network crash, which I didn't understand, didn't convince you that he knows his stuff, let me add my bias. I know John personally. He works ridiculously long hours to keep his ship running. He is constantly on the lookout for ways to improve patient care with the computer network, and constantly soliciting the advice of parties in all specialties. To my knowledge, he holds the distinction of being the only CIO in history that the Hunter Group (a well-known health care consulting group) has not recommended firing.

    So, if the CIO's so good, why did the network crash? I think that an agglomeration of your posters have already figured this one out. BIDMC already knew that the risk of network failure was increasing. For several months before the disaster, the IT department was upgrading network hardware and software as fast as their budget could allow. They were trying to prevent this, and their luck ran out before their ship came in.

    Redundancy is expensive. New equipment is expensive. New software is expensive. Personnel are expensive. Look at the financials of BIDMC sometime. The hospital lost $26 million dollars this year, and that was considered a victory, because at the beginning of the year, BIDMC was projected to lose $40 million. The hospital hopes to be breaking even by the end of 2004, without compromising quality of patient care. All the prophylaxis that you've suggested "should have been there" needs to be taken into context with the larger financial picture. Should BIDMC have fired nurses to pay for routers? Cut back lab services to buy newer software?

    So it happened. Next question: were patients endangered? Obviously I'm bound by all sorts of privacy concerns, but it's fair to say, probably, a little, but:

    First of all, think about how the network impacts patient care. They mostly DO impact data retreival, in the following ways:
    * Getting diagnostic test results quickly, rather than having to call, or go to the lab
    * Getting old patient data off of the online medical record, rather than waiting for the patient's old charts (ALL data is duplicated in the paper record)
    * Entering patient care orders without handwriting them and making sure that a nurse sees them

    But, they DON'T impact in the most important ways. The following things work without a network:
    * The computers that monitor vital signs of sick patients and patients in operations.
    * The computers inside emergency medical equipment such as defibrillators and respirators

    Nobody drops dead instantly because of a network outage.

    The network didn't crash all at once: it was up and down intermittently for about 24 hours. After several attempts to get the network running without shutting it down, they finally decided that they needed to shut the whole thing down and start it up again, piece by piece.

    I'm not sure who got the idea that a "scramble" to restart paper ordering implies that BIDMC didn't have a plan in place. The hospital has paper backup systems prepared for everything. But you try to orchestrate a quick return to paper on dozens of inpatient wards, with a thousand patients, in short order. Good luck; that's a system involving hundreds of health care providers and separate physical locations. Suggesting that BIDMC ought to be able to throw the railroad switch and just do it easily is rather unrealistic.

    That all being said, once BIDMC gave up on keeping the network up while fixing it, we had the whole switched to paper in a matter of hours. This made the system slower and more error-prone, which is why we switched to computers in the first place!

    In theory, such a situation could endanger patient care. Slower data retrieval, and the possibility of missing relevant data, could both cause medical errors and patient injury. But, lest you did not realize, medical errors and patient injury are part and parcel of daily healthcare. So many decisions are made on so many patients in a day, that errors happen all the time. Due to the multisystem nature of health care and the multiple levels of safeguards and error-checking, no patient injury happens for one reason alone.

    Don't fool yourself into thinking that when the network's up, nothing ever goes wrong, and once the network's down, scores of patients are unjustly injured. Any difference would be incremental. In any theoretical particular case, it would be virtually impossible to prove that the network outage was the crucial component that caused the error.

    Is BIDMC at fault? Well, if there were a snowstorm, would BIDMC be at fault if they didn't have enough snowplows on hand? If someone slipped on a banana peel, would BIDMC be at fault for not hiring enough janitors? If there were a fire, would BIDMC be at fault for not having appropriate fire safety?

    Was BIDMC at fault? No more than for any other disaster; you can't be 100% prepared for everything, ever. Will they be sued? Probably. Will the suits be just? Probably not. Will they win? I hope not. Hopefully you agree with me.

    And, by the way, the computerization of our hospital is multifaceted, and has taken place slowly over 6 years. It's not like we've had our current network in place for 6 years with no changes. Rather, it has grown geometrically with added functionalities as time goes by.

    OK, let's end with some responses to comments that I think are informative, but which qualify as "personal agenda," so if you're not interested, you can stop reading here with my compliments.

    First issue.
    > Also, it is very common for doctors to reject
    > any spending on IT because it will bring their
    > 8 figure salaries down to 7 figures and that is
    > totally unacceptable!!!

    If you're going to pillory doctors, perhaps you should actually know what you're talking about. The average physician makes $180,000 a year. The most well-paid MD in Rochester, NY (a city of .5M) makes just under seven figures.

    Academic centers pay less than average; many grown-up MDs at my hospital don't even make 6 figures. Nobody who's in it for the money works at an academic center like BIDMC. These hospitals lose money, and every expense, yes, including doctor's salaries, suffers from it. Those who stay perceive intangible benefits beyond the monetary compensation.

    Believe me, doctors are not cutting the IT budget to line their pockets. I can't speak for the administrators, some of whom are MDs and some of whom aren't, but BIDMC is a not-for-profit institution, and nobody is walking away with fat profits.

    Next issue. Whoever suggested that we were unable to play Quake for 4 days: Probably you were just trying to be clever, but it's worth noting that we can't install software on any of the hospital computers.

    And finally, whoever made fun of senior managment for "running around like errand boys": Good for them! This was truly a crisis, and all hands pitched in to try to prevent any patients from being hurt. Laugh at them if you like; they could have stayed in their offices, but like the rest of us, they did whatever they could.

    Hopefully you have found this informative. A disclaimer should not be necessary, but since it is, let me say that my opinions are in no way intended to reflect those of BIDMC, its administration or employees, the federal government, John Halamka, you, your dog, or anyone else other than me. Have a nice day.

  293. Re:comments from a BIDMC physician (O/T) by good+soldier+svejk · · Score: 1


    John H. really appreciated your comments and asked that you give him a call. Hope you have e-mail notification turned on in your /. prefs. BTW, I think you put way too much faith in the moderation system by browsing at +3. For instance you missed John's own comments here.

    --
    It is cowardly, and a betrayal of whatever it means to be a Jew, to act as a white man

    -James Baldwin
  294. O/T reply (Re:comments...) by calm_rising · · Score: 1

    Thanks for the note. I do have e-mail notification turned on for replies to my comments, at a much lower threshold than +3. :)

    My browsing at +3 is not an indication of my "faith in the moderation system" so much as it is an indication of my limited time. I can only afford to read these comments in so much depth. When I get really interested in a thread, I turn down the threshold to take a closer look.

    I actually skimmed these comments at a threshold of 0. I did miss your posted corrections from John (sorry!), but that's only because there were >400 comments and I was moving pretty fast.

    There is an eternal tradeoff between efficiency and fidelity. In medicine, we refer to the tradeoff between sensitivity (finding something important) and specificity (not finding something unimportant). It's kind of the same here, and I have chosen specificity over sensitivity.

    I'll contact John.

  295. Archiving Radiology Materials by Anonymous Coward · · Score: 0

    There is no simple answer. Chest radiographs for sbestos cases, even if only suspected asbestosis, have to be kept 30 years. The US Air Force keeps films for 5 years after the last year in which any film is taken. The community in which I worked kept films for 10 years. In Washington State one should keep films on children until they are 22, possibly longer depending upon how your lawyer interprets the state regulations and what the courts say. Some people think you should keep mammography films for the life of the patient or even a few years beyond.

    Certainly any images involved in known litigation need to be kept till the case is settled.

  296. Last Post! by alpg · · Score: 1

    The following quote is from page 4-27 of the MSCP Basic Disk Functions
    Manual which is part of the UDA50 Programmers Doc Kit manuals:

    As stated above, the host area of a disk is structured as a vector of
    logical blocks. From a performance viewpoint, however, it is more
    appropriate to view the host area as a four dimensional hyper-cube, the
    four dimensions being cylinder, group, track, and sector.
    . . .
    Referring to our hyper-cube analogy, the set of potentially accessible
    blocks form a line parallel to the track axis. This line moves
    parallel to the sector axis, wrapping around when it reaches the edge
    of the hyper-cube.

    - this post brought to you by the Automated Last Post Generator...