Slashdot Mirror


Hospital Brought Down by Networking Glitch

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

215 of 569 comments (clear)

  1. Problem was with an application, by Anonymous Coward · · Score: 5, Insightful

    according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

    Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

    1. Re:Problem was with an application, by cryptowhore · · Score: 5, Insightful

      Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

      --
      Happiness is a slider variable
    2. Re:Problem was with an application, by sugrshack · · Score: 5, Interesting
      that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.

      Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.

      realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

      --
      I can't believe it's not lard!
    3. Re:Problem was with an application, by Anonymous Coward · · Score: 4, Insightful

      So a researcher with a workstation isn't allowed to use the network do to his job? No, this stems from incompetence on the part of the network engineering team.

    4. Re:Problem was with an application, by GoofyBoy · · Score: 2

      How could one application, which they could shutdown/control, take down an entire network?

      I admit I'm mostly clueless when it comes to network hardware but shouldn't a massive reset/buffer clear have returned the network to a working state? Am I missing something here?

      --
      The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
    5. Re:Problem was with an application, by rppp01 · · Score: 2, Offtopic

      Well.....I guess you could look at most of the sites we slashdot.....one application (IE, Mozilla, Opera, etc) takes down an entire site for hours and days and sometimes longer.

      --
      They stuck me in an institution, said it was the only solution, to...protect me from the enemy, myself
    6. Re:Problem was with an application, by nolife · · Score: 5, Interesting

      Not only that but they gave the impression no one had problems using the old paper method. Actually noting that at times the network was fine but they decided to stick with the backup method until the issue was resolved because it was harder switching back and forth when the network was working. All in all though they made a point that no appointments were missed, no surgeries were cancelled etc.. Meaning business was as usual but using a backup manual method.

      I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?

      --
      Bad boys rape our young girls but Violet gives willingly.
    7. Re:Problem was with an application, by ipstacks · · Score: 2, Interesting

      Routing is the solution. Anyone that runs a layer two network beyond one switch should be fired. Routing convergence is much faster than spanning-tree (even with the Cisco tweaks). Why would I want layer two when layer routers are capable of wire-speed routing?!

      --
      Which distro does Linus use?
    8. Re:Problem was with an application, by GoofyBoy · · Score: 2


      First sentence says its wasn't the software, but how he/she was using it (uploading a huge amout of data).

      Why not effectively "kill" the upload and wouldn't that clear the problem?

      --
      The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
    9. Re:Problem was with an application, by GarryOwen · · Score: 2, Informative

      You sound a bit old school, routing now days can be as fast as a switch, course routers that fast will cost a hell of alot more. The reason why is most routers nowdays don't actually do a per packet inspection and routing. They route the first packet of stream and then switch all following packets in that stream. Also, if your statement the lower on the 7 layer model you are the faster you go is wrong, otherwise hubs would be faster than switches(layer 1 vs layer 2).

    10. Re:Problem was with an application, by aheath · · Score: 5, Informative

      I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."

    11. Re:Problem was with an application, by darkonc · · Score: 2
      My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

      RTFA: (from the globe artice)

      The crisis had nothing to do with the particular software the researcher was using . . . . . . The large volume of data the researcher was uploading happened to be the last drop that made the network overflow.
      The essential problem was that the network was (almost) overloaded. The data from the researcher was simply enough to complete the overload.. This probably either caused an overload in a fixed-sized table (oops) or it caused a router/switch to run out of memory. This caused the data loop. Shutting down the segment with the data loop caused a large chunk of dataflow to be re-routed along a secondary path --- overloading that path. "and they caused two more, and they caused two more and so on , and so on....".

      For all we know, this researcher could have been doing an FTP transfer (but my {blind} guess is that he was doing some sort of multi-system collaberative computing). His problem was that he put a bit more load onto an already groaning network, and broke it's back{bone}.

      Now, as to preventing research work on a 'production' system: this is a teaching (read research) hospital. Research and production work go hand in hand. From reading the article, it appears that the reason why they're adding a second parallel network isn't because they want redundant connections. It's because they need the extra bandwidth (and knew that they needed it before this happened).

      In fact, on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time. ''Now,'' he said, ''we're going to do it faster.''
      In a sentence, preventing research on a production network would have been a PHB reaction. The only way that that sort of reaction would have had the required effect would have been to apply bandwidth/connectivity quotas to everybody on the campus. (which would have placed extra load on the routers which would, of course, have made the underlying problem worse, which......)
      --
      Sometimes boldness is in fashion. Sometimes only the brave will be bold.
    12. Re:Problem was with an application, by pyite · · Score: 3, Informative

      Technically, hubs are faster than switches for N endpoints when N = 2. The reason is hubs do not have to look at the frame being sent and either store-and-forward or cut-through like a switch does. Your total possible collision locations on a hub is N * (N - 1) / 2 (Gauss' formula for sum of 1 to N, coincidentally), where once again N is the number of endpoints. In a switch, your collision domain always has two endpoints, therefore your total possible collisions is 1, thus you get increased speed.

      --

      "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

  2. This is what you call... by Anonymous Coward · · Score: 2, Funny

    ... "an old boys' network"

  3. No. by Clue4All · · Score: 5, Interesting

    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

    --

    Is your browser retarded?
    1. Re:No. by passion · · Score: 2

      good idea, the problem is that most institutions don't do enough regression testing to see if *absolutely everything* is working. Oh sure, my cat's webpage with the 3-d rotating chrome logo still loads, but what about the machine that goes ping keeping Mr. Johnson alive just down the hall?

      --
      - passion
    2. Re:No. by Anonymous Coward · · Score: 5, Informative

      As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.

    3. Re:No. by barberio · · Score: 5, Insightful

      The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

      So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.

    4. Re:No. by ostiguy · · Score: 5, Insightful

      If a network problem breaks down network 1, what is going to stop it from breaking network #2? If the problem was with the firmware in device#23a, the problem will reoccur on network 2 with device #23b

      ostiguy

    5. Re:No. by pubjames · · Score: 5, Interesting

      I spoke to an electrician at our local hospital recently. He told me the hospital had three separate electricity systems - one connected to the national grid, one connected to an onsite generator which is running all the time, and a third connected to some kind of highly reliable battery system (sorry can't remember the details) for life support and operating theatres in case both the national grid and the on-site generator fail simultaneously.

      If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.

    6. Re:No. by hey! · · Score: 2

      Sounds good. Unfortunately, details didn't make it into the Globe article.

      A few questions, if I may. Is the design and scope of the redundant network the same as the original network? Personally I'd consider a smaller network to carry just the most critical information so that efforts to diagnose and recover that network, should become necessary, will be more concentrated.

      Secondly, have the contingencies plans considered the possibility of deliberate subversion, such as a buffer overflow attack on the equipment or DDOS on hosts? Again, this is where I'd consider a restricted network useful, as well as contingency plans to move data by paper or other media.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    7. Re:No. by dirk · · Score: 3, Interesting

      No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

      While in the short term the anser is to fix what is broken, they should have had an alternative network set up long ago. When you are dealing with something as important as a hospital, you should have redunancy for everything. that means true redundancy. there should be 2 T1 lines coming in from 2 different vendors from opposite direction if that is something will endanger lives if it breaks. If something is truely mission critical, it should be redundant. If it is life-threatening critical, every single piece should be redundant.

      --

      "Information wants to be expensive" - Stewart Brand, the same guy who said "Information wants to be free"
    8. Re:No. by Openadvocate · · Score: 2

      "things don't break on there own"
      mkaeeyy, I'd like some of that hardware.

      I have never seen unbreakable network hardware.
      I have seen network hardware with redundancy to prevent loss of servies in case of a breakdown and I have seen the redundancy fail also.
      :)

      --
      my sig
    9. Re:No. by HamNRye · · Score: 2

      This whole thing makes no sense....

      They state that the problem was application, workstation level. The solution, install a second network. WTF?? If it really was a researcher at his workstation, disconnecting his station and possibly a reset of his hub and go. Problem solved, I'm back to finding out what shows are playing at Harvard Square this weekend.

      Now, Fast forward 5 years when the network goes out like this again... If their past maintenance performance is any judge, I'll just assume they did not maintain quarterly testing of the secondary network (It's a pain in the butt, and Hospitals are 24 hour operations) and I'll bet it doesn't work when they need it. The extra switches and such might come in handy, but I'm positive that they could have achieved better reliability for the same money by spending in other areas.

      For our 1,000 person operation, installing a second network would involve about 50-60 hubs, routers, switches, etc. Involve the extra telco racks, and running that cable, that's mighty frikken' expensive. We do have a backup for our backbone, but the entire thing?? Ewww....

      ~Hammy

    10. Re:No. by Idarubicin · · Score: 2
      If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.

      Well, in a modern hospital, being without network access for a few minutes doesn't kill people. Losing power in an operating theatre can make soeone very dead, very quickly. Yes, procedures exist to handle such a situation, but there really isn't a good backup to say, a heart-lung machine.

      I know, there are /.ers that would die without their DSL lines, but most of them don't live in hospitals.

      --
      ~Idarubicin
  4. Of course it can help by Anonymous Coward · · Score: 2, Insightful

    Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.

    But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?

    Which puts them right back to where they were.

    Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.

  5. Major American Bank Outage by MS_leases_my_soul · · Score: 5, Informative

    A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

    We had to rebuild the entire network ... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.

    Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.

    1. Re:Major American Bank Outage by Pig+Hogger · · Score: 2

      Well, for a banker (and any ignorant bean-counting type), a pound of cure is better than an ounce of prevention...

    2. Re:Major American Bank Outage by passion · · Score: 3, Informative

      If triple-redundancy is good enough for San Francisco's BART, and this "major bank", then why can't it be good enough for a hospital, where there are most likely many people on life support, or who need instant access to drug reactions, etc?

      --
      - passion
  6. Leading question by Junks+Jerzey · · Score: 4, Insightful

    do you think the answer to having an massive and unreliable network is to build a second identical network?

    Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

    1. Re:Leading question by enkidu55 · · Score: 4, Interesting

      Isn't that the whole point in posting a story? To foster your own personal agendas? What would be the point in making a contribution to /. then if everything was vanilla in format and taste. You would think that the members of the /. community would feel a certain sense of pride knowing that their collective knowledge could help another business/community out with some free advice.

      IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.

    2. Re:Leading question by hey! · · Score: 2

      I'm sorry if this kind of thing strikes you as cliche. You are correct in characterizing the question as a "leading" question. However what I was trying to lead people to is not a conclusion, but an area of inquiry. Everyone knows techies don't always get the resources or time they need to do things right. If you had the opportunity presented by this kind of disater, what would you do with it?

      I admit the question as a tone of disparagement which was perhaps unwarranted: the layman's article may not have accurately characterized the proposed solution. However, if the solution is as represented, it raises many important design strategy issues that apply not just to networks, but to any kind of mission critical, or in this case life critical system. Redundancy is an easy sell because it is easy for non-technical people to understand. However, underlying the concept of redundancy is an assumption of independence of one component from another's problems that may not be warranted.

      In my opinion, it is the concept of independence rather than redundancy that is key, and it is concept that underlies many design principles.

      The direction I hope to lead the discussion in is more abstract and general, and it applies to the design of any system from a computer network to a nuclear power plant.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    3. Re:Leading question by Alsee · · Score: 2

      >do you think the answer to having an massive and unreliable network is to build a second identical network?

      Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?


      Fine. Just submit a duplicate story and end it with:

      "Shouldn't all life-critical systems like hostpitals have an identical backup systems in case the primary goes down?"

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
  7. Re:Well! Woopsy! by Iamthefallen · · Score: 5, Funny

    Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.

    You may now begin to panic in an orderly fashion, thank you.

    --
    Wax-Museum Fire Results In Hundreds Of New Danny DeVito Statues
  8. Spanning tree by skinfitz · · Score: 2, Interesting

    do you think the answer to having an massive and unreliable network is to build a second identical network?"

    I think the answer is to disable spanning tree.

    We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.

    My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.

    1. Re:Spanning tree by zyglow · · Score: 2

      Adding on to the VLAN idea, I'd also change the routing protocol to OSPF. They would be squandering a lot of money to run two networks side by side.

      --
      http://www.forum-addicts.com
    2. Re:Spanning tree by GLX · · Score: 5, Interesting

      This would imply that either:

      A) A campus could afford to do Layer 3 at every closet switch

      or

      B) Live without Layer 2 redundancy back to the Layer 3 core.

      I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

      Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

      Spanning tree is our friend, when used properly.

      --
      Sig (appended to the end of comments you post, 120 chars)
    3. Re:Spanning tree by TheMidget · · Score: 3, Insightful
      I think the answer is to disable spanning tree.

      On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...

    4. Re:Spanning tree by AKnightCowboy · · Score: 3, Insightful
      I think the answer is to disable spanning tree.

      Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).

    5. Re:Spanning tree by hey! · · Score: 2

      Hmmm. But what happens in the rare instance (as here) that you have to bring up a large LAN from a dead stop? IIRC, once the network collapsed, they couldn't get the spanning tree to converge for days. All the equipment was operating correctly.

      Spanning tree is a remarkable protocol, but there are limits to its upward scalability, at least if you don't want problems like this.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    6. Re:Spanning tree by stilwebm · · Score: 5, Interesting

      I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.

    7. Re:Spanning tree by Chanc_Gorkon · · Score: 4, Insightful

      Egads no! Dedicated hardware designed for this is the only solution in this kind of case. A PC simply is not. You CAN'T use a hack in a hospital. You should not use a hack like this in a business either, but I understand if it's done this way. Hacks like this can become rather problematic once it's asked to grow. Also most PC's do not have redundancy in power supply and probably doesn't have a raid array (although I have seen a vpr Matrix machine at Best buy with a raid array...Your standard adaptec type included in a lot of MB's now). If I were to do something similar, I would rather do something with AIX or if using Linux, using a server class machine. By the time you do that, you have already spent the money you'd spend on the dedicated stuff.

      --

      Gorkman

    8. Re:Spanning tree by rakslice · · Score: 2

      Using a general purpose hardware for routing may be slow, but that doesn't make it 'a hack'.

      Maybe I'm missing something obvious, but what do you need good mass storage on a router for?

    9. Re:Spanning tree by jroysdon · · Score: 3, Informative
      Disabling spanning tree on a network of any size is suicide waiting to happen. Without spanning tree you'll be instantly paralyzed by any layer two loops.

      For instance: Bonehead user wants to connect 2-3 more PCs at his desk, so he brings in a cheap hub or switch. Say it doesn't work for whatever reason, so he leaves the cable in and connects a second port from the wall (or say later on it stops working so he connects a second port to test). When both of those ports go active and you don't have spanning tree, you've just created a nice loop for that little hub or switch to melt your network. Just be glad it's going to be a cheap piece of hardware and not a large switch, or you'd never be able to even get into your production switches using a console connection until you find the connection and disable it (ask my how I know). How long does this take to occur? Not even a second.

      Spanning tree is your friend. If you're a network technician/engineer, learn how to use it. Learn how to use root guard to protect your infrustructure from rouge switches (or even evil end-users running "tools"). A simple search on "root guard" at Cisco.com returns plenty of useful hits

      At my present employer, we're actually overly strict and limit each port to a single MAC address and know what every MAC address in any company hardware is. We know where every port on our switches go to patch panels. If anything "extra" is connected, or a PC is moved, we're paged. If a printer is even disconnected, we're paged. The end-users know this, and they know to contact IT before trying to move anything.

      Why do we do this? We've had users bring in wireless access points and hide them under their desks/cubes. We want to know instantly if someone is breaching security or opening us up to such a thing. Before wireless, I'd say this was overly anal, but now, it's pretty much a requirement. The added benefit to knowing if an end-user brings a personal PC from home, etc., on to the network (which means they possibly don't have updated MS-IE, virus scanners/patterns, may have "hacking tools", etc.). This isn't feasible on a student network or many other rapidly changing networks, but on a stable production network it's a very good idea. Overhead seems high at first, but it's the same as having to go patch a port to a switch for a new user - you just document the MAC address and able port-level security on the switch port:
      interface FastEthernet0/1
      port security action trap
      port sec max-mac-count
      With Syslogging enabled, you'll know when this occurs and if you've got expect scripts to monitor and page you when another mac address is used on that port, and if you've got your network well documented, you can stop by the end-user while they're still trying to dink around hooking up their laptop and catch 'em in the act.

      Yes, I know all about MAC address spoofing. Do my end-users? Probably not, and by the time they find out, they're on my "watch list" and their manager knows. Of course, that's where internal IDS is needed and things start to get much more complicated, but at least you're not getting flooded with odd-ball IDS reports if you manage your desktops tight so users can't install any ol' app they want. Higher upfront maintenance cost? Perhaps, but we've never had any end-user caused network issue.

      I'm fairly certain that if someone was running a "bad" application like what hosed the network in this story, I'd find it in under 30 minutes with our current network documentation. Would it require a lot of foot traffic? Yes, as the network would possible be hosed so management protocols wouldn't work, but I could isolate it fairly fast with console connections and manually pulling uplink ports.
    10. Re:Spanning tree by Cramer · · Score: 2, Interesting

      That's handled by "partitioning" on the same switch. Most switches are smart enough to tell they've been plugged into themselves. And even if they don't, broadcast suppression will catch such setups really well -- all it takes is one broadcast packet to flood both ports. STP prevents loops between switches. In this case, that'd be plugging ports from multiple switches into the same hub.

      There's an even easier way to fix the problem in your example... don't give the idiots access to multiple ports in the same network. :-)

      And I would submit it's not very wise to create a city sized switched ethernet network.

    11. Re:Spanning tree by Chanc_Gorkon · · Score: 2

      Yeah. It's called a router. Yeah you don't need mass storage for it, but what else are you going to store your code on and have it be reliable if you use a pc? Flash memory? Running on what kind of BUS? How many PC's have integrated bootable flashram?

      Like I said, you don't do this kind of stuff especially in a hospital. You should not do it in business either. Yes, to me, this qualifies as a hack. First off performance would be dog slow. It's just too much stuff to put on the one PCI bus most PC's have especially if you run all of the slots full. Second, cost and setup time would just make it cheaper (and safer) to go with a real router.

      --

      Gorkman

    12. Re:Spanning tree by skinfitz · · Score: 2

      Its been interesting reading the replies here to my "drastic" suggestion of disabling spanning tree. Allow me to elaborate...

      We've had some very odd issues in the past with spanning tree, and it's for this reason we normally disable it. I do run it on some segments, but there are other segments that literally cannot have it enabled, otherwise things stop working. For example, Apple Mac's really don't like spanning tree. (Plugging a Mac server into a spanning tree enabled switch can break it).

      On the rare occasion that we have had a loop, we only lose one segment. As when this happens it's noticed, and it could only have happened from one of several locations, we can easily track down the problem.

      VLAN's have proven to be quite good at isolating segments from problems on other segments.

      Still think I'm crazy? ;)

  9. Hospital Systems by charnov · · Score: 4, Informative

    I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
    1. Re:Hospital Systems by gorf · · Score: 5, Insightful

      To be fair, they have gotten much better...

      You seem to have forgotten to explain why they were worse.

      If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.

      ...truly terrified me...

      What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.

      New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.

    2. Re:Hospital Systems by laughing_badger · · Score: 2, Funny
      computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching

      Yeah, that ability to compute using both metric and imperial units in parallel really comes in useful ;-)

      --
      Help children born unable to swallow - www.tofs.org.uk
    3. Re:Hospital Systems by passthecrackpipe · · Score: 2
      "To be fair, they have gotten much better (I don't work there anymore either)"

      Is this a rather unfortunate juxtaposition of words, or an intentional statement of cause and effect?

      --
      People who think they know everything are a great annoyance to those of us who do.
    4. Re:Hospital Systems by gorf · · Score: 2, Interesting

      That wasn't a manned flight :-)

      I've heard stories about NASA having competely different teams of programmers in different cities being given the same specs. Of multiple computers running different programs independently controlling separate hydraulics, to the point when if one decides to move something one way, the others can physically force it correct. Now that's redundancy.

      I'll bet that people designing new computerized air traffic control systems have never even heard of a real-time system, never mind know what one is.

    5. Re:Hospital Systems by Hast · · Score: 2

      The point of going to university/college isn't to learn the details of how to maintain a specific network. The point is to learn the basics and learn how to learn new material and adapt quickly.

      There will never be a college which teaches you exactly how to do your work at a specific workplace (at least not one worth going to) that's called job experience.

      Sometimes you might need to get someone with a lot of experience. One potential benefit of getting newly gradutated people is that they are already accustomed to learning. So training one of them to suit your needs might prove a lot cheaper than trying to convert someone who already know how to do things "best".

    6. Re:Hospital Systems by Hast · · Score: 2

      Oh my, we are cranky aren't we?

      Perhaps you should just try to find your applicants from other universities. I know that I have had to reverse engineer production code (from companies around where I study) and eg implement TCP/IP and webservers on custom hardware in C/C++ as part of one course. Many other courses I've taken also required similar skills. Ie for me to take an existing system and extend it in different ways to do new things. And in a variaty of languages.

      We are also required to have spend a couple of weeks out "in the real world", 12 weeks as of now. And that sure taught me a lot of things. Mainly company politics and how many ways you can spend your day trying to start solving your problem. (For all the normal Dilbert-esque reasons.)

      And I don't really get what you mean about the "if a biologist didn't know how to identify a microscope..." analogy. Are you insinuating that your new recruits didn't know what a compiler and similar was? Then, as I stated above, recruit from a different place. Or get rid of the HR person who hired interviewed them and get someone competent on that job.

    7. Re:Hospital Systems by lucifuge31337 · · Score: 2, Interesting

      They are't over-educated for a damn thing. They are under-educated for everything. Don't give out credit where it's not deserved.

      CS programs are supposed to teach both the theory AND the operations of current technology. This should allow CS grads to quickly learn new technology incrementally. That's the point of these programs.

      People coming out of tech schools are fine, but they often have no idea how things REALLY work (just "if "a" happens then I'm supposed to do "b" type of knowledge).

      OK...I'm pretty bored with the thread now.

      --
      Do not fold, spindle or mutilate.
  10. A second (unreliable) network? by shrinkwrap · · Score: 4, Insightful

    Or as was said in the movie "Contact" -

    "Why buy one when you can buy two at twice the price?"

  11. Disaster recovery by laughing_badger · · Score: 4, Interesting
    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.

    That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

    --
    Help children born unable to swallow - www.tofs.org.uk
  12. Um.. by acehole · · Score: 4, Insightful

    In six years they never thought to have a backup/redundant system in place in case of a failure like this?

    Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.

    --
    Be you Admins? nay, we are but lusers!
    1. Re:Um.. by nolife · · Score: 2

      Thats is an issue with a lot of aspects of IT and in the real world. It is hard to justify the cost of a backup, redundency, plan "B", virus software, firewall, faster network, more printers, wireless security, network intrusion detection, blah blah until you are burned by one or more of them.

      Normally a consultant will try to justify your need for these things to you but of course they are always selling the $perfect_product for that job also so naturally you take the suggestions with a grain of salt.

      The US may have needed a Department of Homeland Security years ago but no one wanted to jump on it until the WTC's.

      --
      Bad boys rape our young girls but Violet gives willingly.
    2. Re:Um.. by Anonymous Coward · · Score: 5, Interesting

      They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.

      Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.

      In this case, staying with netware has saved lives.

      Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.

  13. 2nd network by Rubbersoul · · Score: 4, Insightful

    Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.

    --
    man .sig
    No manual entry for .sig.
  14. Re:Well! Woopsy! by hey! · · Score: 4, Interesting

    I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

    However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  15. Re:That's why I hate automatic routing by parc · · Score: 3, Insightful

    And your change in routing policy is going to affect spanning tree how?

    How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
    Hand-editing of routing tables works only in the most simple of networks.

  16. Short answer? No. by krinsh · · Score: 2

    Should there be a few replacement devices on hand for failures? Yes. Should there be backups of the IOS and configurations for all of the routers? Yes. Should this stuff be anal-retentively documented in triplicate by someone who knows how to write documentation that is detailed yet at the same time easy to understand? Yet another yes.

    If it is so critical, it should be done right in the first place. If a physically damaged or otherwise down link is ESSENTIAL to the operation or is responsible for HUMAN LIFE, then there should be duplicate circuits in place throughout the campus to be used in the event of an emergency; just like certain organizations have special failover or dedicated circuits to other locations for emergencies.

    Last but absolutely certainly not least; the 'researcher', regardless of their position at the school, should be taken severely to task for this. You don't experiment on production equipment at all. If you need switching fabric; you get it physically separated from the rest of the network or if you really need outside access you drop controls in place like a firewall, etc. to severely restrict your influence on other fabric areas.

    --
    I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
    1. Re:Short answer? No. by 42forty-two42 · · Score: 2, Insightful

      The researcher was just entering data in. Not experimenting with the network. Where do you expect him to store his experimental resulst? On a ZIP disk?

  17. What is spanning tree protocol? (google whoring) by Anonymous Coward · · Score: 5, Informative

    Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.

    Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.

    To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.

    Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.

    see this page for mode info

  18. Of course they need another network by virtual_mps · · Score: 5, Insightful

    Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml

    1. Re:Of course they need another network by ces · · Score: 2

      Not always true, most tech firms tend to have reasonably well designed networks, as do most companies that do a lot of OLTP such as airlines, banks, and brokerages.

      Large universities seem to have well designed redundant networks as well despite the difficulty of securing funds in that environment.

      --
      Happy Fun Ball is for external use only.
  19. How many domain controllers? by Hairy_Potter · · Score: 2

    If you're just using a Primary Domain Controller, that could be your problem. I'd recommend adding a backup PDC, as well as a Tertiary Domain Controller, and add an X.25 backup network layer to give you hot-swappability and real-time rollover capabilities.

  20. Comment removed by account_deleted · · Score: 2

    Comment removed based on user account deletion

  21. All Layer 2? by CatHerder · · Score: 5, Informative

    If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.

    Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!

    Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.

    1. Re:All Layer 2? by swb · · Score: 2

      We do some of this, although the logic and rationale most of the time for being able to do "any vlan to any port" has proven in my small environment (500 users, 6 floors, 6 VLANs) to be of somewhat limited value.

      I've trunked the DMZ to a port in our studio, kept closet switches on the core VLAN, and put a port in my office on the core network, but beyond that devices generally "belong" to the network they're on, and being able to dynamically move a given machine between ports and have it auto-home to the subnet it "belongs" to sounds like a lot of work and investment in time/software/record-keeping.

      We had a huge flat, shared-media (not switched) network when I started, now its 100% 100MB switched with a Layer 3 switch at the core. I still get the willies when I think of the legwork alone required for fault isolation.

    2. Re:All Layer 2? by isdnip · · Score: 2

      BIDMC is a big place, too; two adjacent campuses (the old Beth Israel and Deaconess hospitals) and a lot of legacy stuff from pre-merger days. The articles are shy on details but from what I can tell, they had a mix of routable IP and non-routable protocols. The old ones (like LAT, or IPX if you don't route it) depend on bridging, and the routers try to be bridges too, and that's just not something they're good at.

      Indeed, <b>bridging does not scale well</b>. Campus-wide (both campuses, actually) support for any non-routing protocol is hazardous to a network's health. It's tempting to have a little bridged network and just add a little more, and a little more, but when it tips, it tips fast.

  22. Re:That's why I hate automatic routing by Swannie · · Score: 5, Interesting
    Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).


    This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.


    Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)


    Swannie

    --
    :q!
  23. OMG! by jmo_jon · · Score: 2, Funny

    The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days.

    Did that mean the doctors couldn't play Quake for four days!?

  24. Flat networks. by zerofoo · · Score: 2

    Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?

    -ted

    1. Re:Flat networks. by skinfitz · · Score: 2

      Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?

      The whole point of VLANS is so you can put multiple networks along the same cable. We distribute sets of VLANS to edge switches over fibre (and dark fibre to the remote sites at gigabit speed) where they are then seperated out into 100Mbit ports on the switches.

  25. Complexity brings bugs by stevens · · Score: 5, Interesting

    The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

    We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

    But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

    Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.

    1. Re:Complexity brings bugs by Mr+Guy · · Score: 2

      neither I nor the admins
      Developers suspect that there's a simpler way to do it all, but since we're not networking experts

      Sounds like he's a developer, not an IT guy. It's none of his business what the problem is, he's just screwed when it doesn't work.

  26. Re:Why fly equipment from california?? by marklyon · · Score: 2, Interesting

    They have a huge hot lab in California where they have pre-configured switches, routers, ect running and ready to go at a moment's notice. When my ISP went down, they sent (same day) three new racks of modems configured with our last known "good" configuration so all we had to do was unplug, pull, connect.

    It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.

    --
    -- Mark Lyon http://www.marklyon.org
  27. Re:Reliability is inverse to the number of compone by Xugumad · · Score: 4, Insightful

    However, the probability of both failing at the same time is:

    0.1 * 0.1 = 1%

    So as long as it can run on just one out of two, get you get ten-fold increase in stability.

  28. Re:Why fly equipment from california?? by GLX · · Score: 2

    Because Cisco is very California-centric, and the fact is that when it comes to their switching and routing gear, there is very little "hardware" that you can bring in to troubleshoot that's little more than commodity software loaded onto a commodity PC.

    The best thing they had was the input of (hopefully) knowledgeable Cisco engineers. God knows if they relied on Cisco TAC Level 1 support they'd still be down today.

    --
    Sig (appended to the end of comments you post, 120 chars)
  29. Re:Reliability is inverse to the number of compone by pknoll · · Score: 2, Informative
    Sure, but that's not the point of redunancy. The question you want to ask is: How likely is it that both redundant components will fail at the same time?.

    That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.

  30. Obviously not. by buss_error · · Score: 2
    do you think the answer to having an massive and unreliable network is to build a second identical network?"

    Obviously, if something fails due to design, then duplicating the design duplicates the problem. While this can be a useful troubleshooting tool, it makes somewhat less sense for production enviroments.

    I would be willing to guess that the network was one giant collision domain, and that the trouble springs from that. But it is just a guess.

    --
    Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  31. STP by netwiz · · Score: 2

    isn't that hard to troubleshoot. You look at the device ID that most recently made a Topology Change Notification, and then start looking at the hardware diagnostics for that system. If they're showing clean, reboot the switch. If, while the device is rebooting, the network stabilizes, you've found the problem. When the system finishes it's boot, check the hardware diagnostics again (Ciscos only run H/W diags at POST, and a reset is the only way to re-run them); odds are that you'll see there's a failed component.

    A previous poster nailed it too, simply back out the changes you made (obviously the problem you were fixing is of a lower magnitude than a total outage), and things should start working again.

  32. My best hospital glitch by eaddict · · Score: 5, Informative

    was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!

    --
    "If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
  33. Lawsuit by Gary+Franczyk · · Score: 2

    There will probably be many lawsuits after this.

    The line of thinking will be something like this:

    How many people died or will die, or get improper treatment because of this networking glitch? If the hospital is as large as described, certainly a number of persons were given inadequate healthcare while they were there.

    Some may have a good case.

    1. Re:Lawsuit by Waab · · Score: 2

      I'm afraid in our lawsuit-oriented society the line of thinking will be something more like:
      How many happened to be within 2 blocks of the hospital during this glitch and how many of them feel an overwhelming sense of entitlement that might motivate them to join a class-action suit?

      I fear a fairly large number of people will see this as an opportunity to sue, regardless of the quality of care they received during the network outage. I'm sure there are plenty of people who feel their lives weren't saved fast enough or at least weren't saved with the quality of service they feel they deserve.

      Oh, and IANAL.

    2. Re:Lawsuit by JohnnyBolla · · Score: 2

      That's true, many people are litigious asses. In fact, the people that can have that line of reasoning should be lined up and shot.

      --
      Carpe Deez
  34. Cisco implemenatation of Spanning Tree sucks by xaoslaad · · Score: 4, Interesting

    I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

    I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

    I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

    Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

    Two wrongs don't make a right.

    1. Re:Cisco implemenatation of Spanning Tree sucks by netwiz · · Score: 4, Informative

      Cisco only runs per-VLAN spanning tree if you're using ISL as your trunking protocol. The reason you don't get it on Extreme Networks stuff is because they use 802.1q. In fact, Cisco devices trunking w/ the IEEE protocol run single instances, just like the Extreme product.

      There are tradeoffs, of course. STP recalculations (when running) can be kind of intensive, and if you've got to run them for each of your 200 VLANs, it can take a while. However, for my particular environment, per-VLAN STP is a better solution.

    2. Re:Cisco implemenatation of Spanning Tree sucks by photon317 · · Score: 2


      Putting layer-3 switching only (no pure L2 devices) all the way uot to the workstations is prohibitively expensive. Anytime you've got multiple L2 switches in a segment, you should have spanning tree turned on. Turning it off will seem like a gain, till some dumb user plugs two of your network connections into a 4-port hub under his desk and you start getting broadcast storms. Spanning Tree saves you from these types of disasters and a myriad of other possibilities.

      --
      11*43+456^2
    3. Re:Cisco implemenatation of Spanning Tree sucks by PatJensen · · Score: 2
      There are a few Cisco-related features in both CatOS and IOS that can improve spanning-tree convergence on large networks - but they have to be engineered at all layers from the get go. (core, distribution and access) All of your switches must have versions of software that support them as well.

      Spanning tree backbonefast lets your core layer switches reconverge after a link/switch failure quite rapidly. Used in connection with spanning tree uplinkfast, your distribution and access layer switches can switch over to another redundant copper or gigabit fiber link quickly without waiting for full spanning tree convergence.

      Another feature that seems to be widely used (and probably the most dangerous), is spanning-tree portfast - this gives access layer switches the capability to immediately begin forwarding a workstation's packets on the network. portfast should NOT however be used on trunk, channel or hub links as it can create a bridge loop by a user/site support mistakenly plugging in a crossover cable.

      Hope this helps!

      -Pat

  35. Are you crazy? by AriesGeek · · Score: 2, Insightful

    Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!

    No, disabling STP is NOT an option. Learning how to use STP properly is the option.

    --
    Insert offensive troll-style sig here. Please mod or respond appropriately.
  36. The real problem by Enry · · Score: 4, Insightful

    There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.

    So what's the lessons?

    1) Make sure your solution scales, and be ready in case it doesn't.
    2) Make sure some overall organization can control how networks get connected.

  37. I don't buy it by hey! · · Score: 5, Insightful

    The same explanation was floated in the Globe, but I don't buy it.

    People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

    The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

    One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    1. Re:I don't buy it by anonymous+loser · · Score: 2
      The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

      I had to reread this a couple of times. It looks to me like you're saying that it couldn't be a single application because that would indicate a poorly designed network, then go on to say the network was poorly designed.

    2. Re:I don't buy it by NecroPuppy · · Score: 2, Interesting

      I think he's laying more of the fault at
      the bad network design than any app that
      was run on it.

      I.e., the app was only able to do as much
      damage as it did because the network was
      so bad; if the network have been set up
      'properly', then the app could have only
      done localized damage.

      Does that make sense?

      --
      I like you, Stuart. You're not like everyone else, here, at Slashdot.
    3. Re:I don't buy it by DaveV1.0 · · Score: 5, Informative
      Actually, if you read the article carefully, they say that the application the research was running was the straw that broke the camel's back.

      "The crisis had nothing to do with the particular software the researcher was using."
      "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow. "

      While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge.

      --
      There is no "-1 offended" or "-1 you don't agree with me" mod options for a reason.
    4. Re:I don't buy it by Alsee · · Score: 2, Funny

      Someone moded Kathleen's "Yes" as Offtopic. He can kiss those moderator privileges goodbye.

      If "offtopic" results in a loss of moderation rights I'd hate to see what the consequences would have been for calling her a troll :)

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
    5. Re:I don't buy it by ScuzzMonkey · · Score: 2

      Exactly. That's also implied by the fact the article mentions that an outside consultant had previously recommended a network overhaul and that it had already been approved--just not yet implemented, unfortunately.

      --
      No relation to Happy Monkey
    6. Re:I don't buy it by patter · · Score: 2, Interesting

      While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge

      Makes a lot of sense actually. I've been doing a bit of a campaign for a while to have a seperate domain or the ability to connect my test machines (in complete isolation of course) to only each other and maintain my OWN PDC... of course no one thinks this is a good idea, but some of the tests I need to run can bog down when the network's busy, and they of course are not helping the rest of the network be happy.

      Our network's reasonable, but people should give software folks what they need, not force them to work under the constraints the sales folks do (for example).

      Sure, we have to respect the 'rules' when joining the normal network for email and such, but testing of network applications should almost be on a smaller completely isolated network (to prevent dragging down the whole system when an automated test goes awry).

      Infinite loops don't just happen to stupid people ;). Anyone can get too tired to realise they're sending a billion packets a second because they reversed a conditional or something.

      I know a developer who had to leave one job because the IT folks didn't understand why he couldn't develop windows services without admin equivalence on his local machine (duh).

      --
      -- If at first you do succeed, try to hide your astonishment. -- Harry F. Banks
    7. Re:I don't buy it by John+Hasler · · Score: 2

      > Infinite loops don't just happen to stupid people
      > ;). Anyone can get too tired to realise they're
      > sending a billion packets a second because they
      > reversed a conditional or something.

      That would account for a temporary slowdown, but a robust network would have recovered as soon as he pulled the plug. This one didn't. Are they going to actually fix it, or just throw more hardware at the problem?

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
  38. Re:Reliability is inverse to the number of compone by Alranor · · Score: 2

    I'm a little confused here:-

    Prob train A fails = 0.1
    Prob train B fails = 0.1

    Prob train A doesn't = 0.9
    Prob train B doesn't = 0.9

    So Prob neither fail = 0.9 * 0.9 = 0.81

    So prob at least one fails = 0.19 = 19%

    One of us has got the maths wrong.
    Can someone who's not trying to remember his stats courses from years back tell me if it's me :)

  39. done right in the first place by wiredog · · Score: 3, Interesting
    You've never worked in the Real World, have you? It is very rare for a network to be put in place, with everything attached in it's final location, and then never ever upgraded until the entire thing is replaced.

    In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.

    1. Re:done right in the first place by krinsh · · Score: 2

      Yes I have worked in the Real World before but I won't claim to be a super expert on any of this. It's just my opinion. And it should be documented. I've worked in a couple of places where the place closes down if their regulatory agency comes in and doesn't find all the proper documentation for everything, and that includes data processing.

      --
      I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
  40. Fix it the first way that works. by tomblackwell · · Score: 3, Insightful

    If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.

    It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.

  41. Network Utilization Analysis not run yet by chopkins1 · · Score: 2, Interesting

    In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.

    They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."

    Another telling comment about the situation was: "network function was fading in and out".

  42. Re:Reliability is inverse to the number of compone by dago · · Score: 2

    I don't know what SAT is, but I think you made some mistakes.

    if your 10% is the probability that 1 train will fail during NY -> LA trip then you've got the following probability :

    0 train fails = 0.9 * 0.9 = 0.81
    1 train fails = 2 * 0.1 * 0.9 = 0.18
    2 train fails = 0.1 * 0.1 = 0.01

    which means that the probability of having at least one train going from NY -> LA is ... 98%, much better than the previous 90%.

    --
    #include "coucou.h"
  43. Re:the sad part by krinsh · · Score: 3, Insightful

    While paper-based may seem like the best solution to you; what you don't realize is that paper-based is just a single phrase for the rest of these 'bases':

    sneaker-based when everyone must run throughout passing paper;

    warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;

    administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;

    and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.

    Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.

    The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.

    --
    I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
  44. Re:Well! Woopsy! by Ken+Dods'+dad's+dog' · · Score: 2, Interesting

    I have seen this happen before in an organisation I have worked for. It happened when a second Cisco network (installed by a large well known company) was joined to an existing one and the routing protocol problems of the new network corrupted the existing one. Solution was to disconnect the two and force the external company to rebuild the new network from scratch.

  45. This assumes.. by nurb432 · · Score: 5, Informative

    That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...

    As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.

    and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..

    --
    ---- Booth was a patriot ----
  46. YES- air traffic management experience... by mekkab · · Score: 5, Interesting

    Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.

    Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.

    How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).

    How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.

    --
    In the future, I would want to not be isolated from my friends in the Space Station.
  47. Redundancy, Redundancy, Redundancy by ChaosMt · · Score: 2
    If it's critial, YES! When's some is life or death, such as a hospital, it is worth it to be prepared. N+1 redundancy.


    The sad thing is I've seen this so many times before in different medical environments I've been in. They usally aren't very modivated to spend money on *any* infrustucture costs. Hospitals may spend some, but it's usally with the modivation to increase donations; "Oh look! It's shiny!"


    Just like any other critical service, it costs big bucks to be prepared. How much you want to bet they 1) didn't have version control, 2) didn't have change control and ... I could go on. The point is everyone plans for system redundancy and recovery, but just assumes the network is resilent. The network is the comptuer - i.e., the system is the network.


    I am proud of them for one thing in particular. IMHO, your last line of redudancy, backups and recovery, etc. should ALWAYS be tangible. When you are involved with something life, death or riches, dead tree backups are the most reliable form I know. I am glad not everyone has lost their common sense to electron envy.

  48. Been there done that, got the ass beating by nt2UNIX · · Score: 3, Insightful

    In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.

    I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.

    We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.

    Test Network GOOD (if you have the money).

  49. The Solutoin by Shishak · · Score: 5, Insightful

    Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.

    The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.

    Sure you'll be a touch slower routing between the segments but you'll have much more reliability.

    --
    Now I hope and pray that I will But today I am still, just a bill
    1. Re:The Solutoin by Large+Green+Mallard · · Score: 2

      I'm a network admin for a university department.. I think the smartest thing my department ever did was have all our subnets routed. Almost every other department is switched, so the thing with the default gateway for client machines is a switch up to several kilometres away ;)

      This was of course after my current manager with a clue about networking came along and saw the hub serving as a network core that then had 10 bridges hanging off it for segmenting the network into each subnet... Of course, he then bought a Nortel Accelar to use as the network core.. but he's seen the folly of his ways now, and we have a Cisco 3550 doing that now ;)

  50. Simple Answer by DarkZero · · Score: 2

    I'm surprised I'm not seeing the really simple, obvious answer here to the question that's posed in the story.

    do you think the answer to having a massive and unreliable network is to build a second identical network?

    Don't build a second identical network. Just set it up so that whenever a file is saved, it's dumped onto a secondary network that's locked down so tightly that it doesn't run programs, search for documents, or anything like that. It just provides documents and that's it. For instance, it could be just a bare bones, huge-ass listing of links to patient data in a single document, and you would just use Ctrl+F or some such to find the name, and then click through it to see a TXT or HTML document with the patient's data in it. That way, you can have fancy programs and extensive information and such on the normal network without risking the network instability that comes with them.

    1. Re:Simple Answer by gorilla · · Score: 4, Interesting
      Having worked in a hosptial, I'll tell you that's not acceptable.

      Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.

  51. Add a second network? Not likely to help by markwelch · · Score: 5, Insightful
    > Do you think the answer to having an massive and unreliable network is to build a second identical network? <

    Of course not. Two solutions are more obvious:

    1. Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
    2. If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
    A third, and somewhat obvious, solution would be to make sure that
    • crucial data is kept on the local server farm, but also copied in real time to a remote server; and
    • a backup access mode (such as a public dial-up internet connection, with strong password protection and encryption) is provided for access to either or both servers, in the event of a crippling "local" network outage.

    This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.

    The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.

    I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.

    --
    -- http://www.MarkWelch.com/ Pleasanton California
  52. I have the solution... by FleshWound · · Score: 4, Funny

    I live in the Boston area, and I have the perfect solution: they should hire me. I'll make sure their network never fails.

    Well, maybe not. But I still need a job... =)

  53. Networks are fragile. by XPisthenewNT · · Score: 3, Interesting
    I am in intern in a networking department where we use all cisco stuff. Spanning tree and some other protocols are very scary because once one switch declares itself a server of a given protocol, other switches "fall for it" and believe the new switch over the router. Getting the network back is not as easy as turning off the offender, because the other switches are now set for a different switch server. Power outages are also very scary because if switches use any type of dynamic protocol, they have to come back up in the right order; which Murphy's Law seems to indicate would never happen.
    Networks are fragile, I'm surprised there arn't more massive outages.
    The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.

    KISS: Keep it simple, stupid!

    1. Re:Networks are fragile. by Mr.+KaryHead · · Score: 2, Informative

      Networks can be fragile and spanning tree can certainly cause some of the problems. That is why one must design the spanning tree topology. When you say "one switch declares itself a server of a given protocol", I assume you mean "declares itself the root of a VLAN." The root is determined by the lowest advertised bridge ID from each switch. The bridge ID is the bridge priority plus the bridge address. Cisco switches have a default bridge priority. So then it boils down to whichever switch has the lowest bridge address becomes the root, which could be any switch anywhere in your network. The network admin should decide which switch will be the root for a given VLAN and set the bridge priority lower. And then he/she selects another switch to be a backup root and sets its priority to be lower than the default but higher than the root's priority. So you if don't manually set the root then a new switch plugged into the network could very well become the root if all the switches have a default priority and the new switch has a lower bridge address than the current root.

      If this happens, you can just turn off the offender to get your root back. In STP only the root talks. If the other switches don't hear from the root in something like 20 seconds, then they'll elect a new root.

      -Kary

  54. Was it OSPF? by Anonymous Coward · · Score: 2, Interesting

    The article is a little light on technical details, but does anyone know what internal routing protocol they were using? We've got a network with 11 cisco routers running OSPF. The routing changes happen very often, because there's a bunch of dial-ups and a few dozen routes that come and go with short-term connections (like backups from a remote office or running a CC authorization from a remote office). Everything works perfectly if none of our three newest routers are the first powered up. Those three are running IOS 11.0. After several calls to cisco (we buy all cisco internally and for our customer ends, so we get very good support from them) over the past three years, cisco is still stumped as to what the problem could be. The lines in the config file for OSPF are only five lines long, so we (and cisco) are sure there's no problem there. The hospital's problems sounds like it's of the same sort.

  55. Previous lack of funding for IT? by quark2universe · · Score: 2

    If this hospital is like any of the medical instituions I've worked for, then it's not unreasonable to expect that the IT group has been begging for more money to upgrade the infrastructure because they knew this kind of thing could happen. This usually falls on deaf ears at the doctor and senior administration level of the hospital because they see computers and networks as "magic" and don't take any time to understand the kind of reliance that is now placed on those systems. Also, it is very common for doctors to reject any spending on IT because it will bring their 8 figure salaries down to 7 figures and that is totally unacceptable!!! The story did say they are looking at 3$million for future upgrades, but that ONLY happened after this disaster.

    --

    Believe in things of which no person has ever learned
  56. Redundancy and death by FearUncertaintyDoubt · · Score: 2, Insightful
    Of course, as open as they were about the whole incident, the hospital did not disclose whether any patients were affected or even died due to the breakdown (nurses having wrong information, staffing problems caused critical situations to wait too long, etc.).

    A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.

  57. Life threatening? by saider · · Score: 3, Insightful

    I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.

    The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.

    --


    Remember, You are unique...just like everyone else.
    1. Re:Life threatening? by benwb · · Score: 5, Insightful

      Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.

  58. Down here in North Carolina by LWolenczak · · Score: 2

    I used to work for a systems intergrator. Just by general pratice, anything that was mission critical was on a seperate network.... if not two different networks. This is most likely a WinXP machine that somebody played with the stp/vlan settings.

    Speaking of teaching hospitals... Yes, they are large..... I live just a few miles from Wake Forest/Baptis Hospital. They add, or renovate a wing a year.... There are always large crains over the building... and since I'm looking for work... I applied there... Even though they had a polethra of positions open for Network Techs, and since I'm well over qualified, and cheap... you would have thought they would have hired me... they did not... they seem to go for bottom barrel regarding techs... cheapest... most likely they think A+ is the best cert you can get.

    1. Re:Down here in North Carolina by buss_error · · Score: 2
      they did not... they seem to go for bottom barrel regarding techs

      I know at some places I've worked, the question is "Well, if they're that good, then they wouldn't settle for our wage. They'll just leave when a better paying job rolls around. Better to hire someone that will stay."

      --
      Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  59. QoS and network boundaries by pangur · · Score: 5, Informative
    There are several non-exclusive answers to the Beth Israel problem:

    1) introduction of routed domains to seperate groups of switches

    2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.

    3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.

    4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.

    Thank you.

    1. Re:QoS and network boundaries by caluml · · Score: 2

      Have a look at this device. Not quality of service, but guarantee of service. Very cool.

      The FlowFusion 2M and 5M are U4EA's first commercial hardware products to simultaneously manage all three of the factors affecting multi-service networks - throughput, loss and delay.

      While others have addressed bandwidth, U4EA has developed the GoS solution that allows network managers to manage packet delay and device buffers, as well as to isolate problematic streams to avoid random packet loss. Critical applications are guaranteed bandwidth - up to 2 Mbit/s (FlowFusion 2M) and up to 5 Mbit/s (FlowFusion 5M) at the WAN interface - and Quality of Service (QoS), even during extended periods of network overload.

      The FlowFusion units are typically installed between an office LAN and the WAN access equipment via two fast Ethernet ports, and can stand alone or be rack mounted.

      The network administrator is able to define treatment parameters for each application so that mission-critical applications get the exact resources when needed, while maintaining the WAN resource at near 100% utilisation. For the first time, mixed networks can achieve maximum efficiency through a single connection, accelerating the deployment of converged services like VoIP and online videoconferencing.


      http://www.u4eagroup.com/pdf/data%20sheet;%206844. pdf

  60. It's HIPAA by mrneutron · · Score: 3, Informative

    Health Insurance Portability and Accountability Act.

    Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.

    The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.

  61. Enterprise applications need Enterprise CYA by Matey-O · · Score: 2

    They need a smaller test environment that ALL changes have to be checked off on before implementing. They need images of all router configs they can roll back to if necessary, and they need a diff comparison tool (mantrap or somesuch) to see what's changed between their known good configuration and what exists now.

    Oh yeah, and they need a signed piece of paper with the moron's signature saying the change wouldn't impact the network. (a papertrail, as archaic as that seems.)

    --
    "Draco dormiens nunquam titillandus."
  62. Simple...apply the formula by liquidsin · · Score: 2

    do you think the answer to having a massive and unreliable network is to build a second identical network?"

    Take the number of patients in the hospital, A, multiply by the probable rate of death should the network fail, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a redundant network, we don't build one.

    --
    do not read this line twice.
  63. Re:CCNP/CCIEs not what they are cracked up to be? by JohnnyBolla · · Score: 2, Interesting

    True. For the most part, having a Cisco cert means you studied hard on how to pass the cert, it really has little bearing on wheather or not you can do the work. Not to say that a chimp can pass them, but I have met some people that couldn't troubleshoot a toaster problem with CCNPs.
    Yes, I have some Cisco certs.

    --
    Carpe Deez
  64. I work at a teaching hospital... by pacsman · · Score: 5, Insightful

    The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.

    1. Re:I work at a teaching hospital... by DNS-and-BIND · · Score: 2

      Hey, it's better than the X-ray machines controled by Visual Basic apps. As you get ready to be irridated, you watch the technician click through several dialog boxes of errors as she reassures you "it's OK, this always happens".

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
  65. Maybe not so ridiculious by lucifuge31337 · · Score: 2, Insightful

    This sounds like a case of poor network infrastructure management. That being said, you can't pin it all on IT. Organizations like this have networks that grow out of necessity, and are often nearly impossible to make large changes to.

    Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.

    --
    Do not fold, spindle or mutilate.
  66. Re:That's why I hate automatic routing by Swannie · · Score: 3, Interesting
    Can you make a case why spanning tree is bad? Beyond "It's old", or "I've been burned before?" I've never, personally, heard a good arguement as to why spanning tree is bad.


    As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.


    Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...


    Swannie

    --
    :q!
  67. Yes, if I'm selling the network ;) by dnoyeb · · Score: 2, Funny

    Of course the answer is to build a completely seperate network if I am the one who you will pay to build it ;)

    This is obvious.

    In truth the network problem was not a physical one so then solution should not be a physical one.

    1. Re:Yes, if I'm selling the network ;) by kaoshin · · Score: 2, Funny

      You can never be TOO safe when lives are at stake. I think at least 4 networks would really be needed.

  68. "Parallel Network" by Megane · · Score: 2

    The story I heard was that they had already approved the new network and it was still a few months away from being implemented when the old chewing-gum-and-bailing-wire network prematurely fell apart.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  69. 2 pieces of a solution by WPIDalamar · · Score: 2

    Way I see it, there are 2 things that need to get done.
    1) Policy change. Only production machines on a production network.
    2) Topology change. Make it easy to get a non-production network connection so people don't violate #1

  70. Re:Sure it was STP? by jefftp · · Score: 4, Informative

    The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.

    Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.

    Laziness and ignorance are the villians of most network problems.

    Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.

  71. Redundant Networks for Patient Care by jcm · · Score: 2, Informative

    I spent three years (1995-1998) at Perot Systems as a consultant designing and implementing hospital networks for Tenet Healthcare (2nd largest hospital chain in the US). There was at least one hospital that had the budget and the foresight to see that reliance on the network would do nothing but increase.

    For that hospital, my network design was one that incorporated as much redundancy as possible at the time. For each patient care area, such as nurse's stations and ancillary areas such as radiology, cardiology, surgical theaters, etc. it was decided that each of the two network jacks would terminate in seperate closets. This meant doubling the number of closets required in order to meet distance limitations, but the hospital had already started working on allocating that space for the closets. Also for any important ancillary areas such as the lab, central supply, there also was two seperate networks. For the server farms theirselves, the Patient Care systems all had redundant connections to the primary and backup networks as well.

    As each wall jack terminated into a different closet, each closet had two seperate networks as well. Each closet would house the primary network for half of the jacks served, and the backup network for the other half of the jacks served. The fiber paths from each closet took disparate paths back to seperate data center rooms, one external to the main building of the campus and one inside the main building. At the time layer 3 switches, or switch routers such as the Foundry Big Irons, or Cisco 6500s were not available. So as much as I dislike using Spanning Tree, I had used it at the time. All priorities were manually set though so there was no doubt where the root was and where it would move to in case of failure.

    So, the switches terminated on another switch which was partitioned to several segments. Switch connections were made between the two data center as well. Each segment had a connection to a Cisco 7507 Fast Ethernet port local to that computer room, and another in the second computer room. Forming the core were two sets of two Cisco 7507s. In order to prevent one OSPF network from affecting the other OSPF network static routes were used (would use BGP if I had to do it over again). Outside WAN connections were terminated redundantly on the two patient care networks as well.

    While the primary network in the hospital also supported the non-patient care areas (such as administration, the backup network was only for the patient care areas. That was just to prevent the type of thing that happened in the article, where something non-patient care related ends up taking everything down.

    Reverting to backup paper systems would be nearly impossible once the "tube" systems were sealed up. Much like the movie Brazil, hospitals used to have pneumatic tubes running all over the place, especially between the lab and the nurse stations. Running samples and results back and forth would definately introduce a LOT of delay for a doctor trying to make a life and death decision.

    I am sure that I would I design things different these days (for one, Layer 3 would go all the way to every single edge switch and collapse on a fast switch router) but I think the design probably held together well. I should check back in someday and see how long and well it lasted, if they did replace it.

    Jay

  72. Contribution to causality responsibility by hey! · · Score: 5, Insightful

    Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.

    Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.

    But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.

    I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  73. Fraternal Twins by SEWilco · · Score: 5, Interesting
    I hope the "second redundant network" uses equipment by a different manufacturer and has at least one network technician whose primary duty is that network. That person's secondary duty should be to monitor the primary network and look for problems there. Someone in the primary network staff should have a secondary duty to monitor and check the backup network.

    The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.

  74. Re:Reliability is inverse to the number of compone by gorf · · Score: 4, Informative

    No.

    You can only multiply them together like you have done if the two variables are independent.

    Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.

  75. Unreliable? Eh? by Psiren · · Score: 2

    Seeing as these paper forms hadn't been used for 6 years, I'd have to assume that the network was very reliable. Problems do occur from time to time, but it doesn't mean that the whole thing should be replaced. Just fix the issue and move on.

  76. Why not power-cycle whole complex? by dpbsmith · · Score: 2

    The Globe was indeed short on technical details. What puzzles me is that they say the network was down for four days.

    NOT a rhetorical question:

    Why didn't they power-cycle the whole complex? Maybe even literally? Presumably a hospital should be able to handle a short interruption in AC power... and presumably the network equipment wouldn't preserve the "I'm-broken-state" in nonvolatile memory. Why wouldn't a scheduled power outage for 10 minutes at 2 a.m. in the morning have been less disruptive than the network being down for four days?

    Less drastically, couldn't they have called every operator and system administrator in and said "Synchronize your watches... at 2. a.m. power off every piece of computer gear within a hundred feet of your chair off, then at 2:10 a.m. power them on again?"

  77. When You Tune to Channel 9 at 8 O'Clock... by RobotRunAmok · · Score: 2

    ...the TV show you intend to watch is there. It may begin a few seconds late, on purpose or as a result of some discrepancy, but the TV show you want to watch is there.

    For the past few years, networks on the national and local levels have all been switching over to server-based content play-out. TV from Computers! How Exciting! How Wonderful! How... frickin' scary, for those whose jobs it has been to ensure that Buffy plays down at 8, and not 8:02, or 8:15, or - Powers-That-Be Forbid! - Wednesday morning.

    Professional TV Master Control operations traditionally operate (often contractually) to "five 9's" of reliability, 24x7, assessed monthly. Full Stop, Period, End-of-Story. TV Master Control geeks, their supervisors, and the maintenance engineers who support them have ever been a priesthood apart when it comes to worship at the Uptime Altar.

    So what has their industry done, to ensure that all this "new wave" server and automation technology provides them with the same reliability as manual control and tape-based playback? Why, buy two of everything, of course! EV-ER-Y THING!

    The server industry is only getting around to understanding that now, and is beginning to price their wares accordingly. I've attended dozens of vendor meetings over the past ten years where the salesguys, who six months earlier were selling mailservers to sysAdmins, are now selling their new video servers to Master Control guys. (Chum dished into a shark tank is the only comparable visual I can come up with.) What makes the sale is never the reliability of server over tape or (especially) the quality of server over tape, but desire of management to run more channels with fewer bodies. In the past this has led to management re-assessment of just how "inexpensive" server-based playout technology was and, in many cases I have seen, an increase in the number of channels created or planned as a means to justify the hardware costs.

    The only debate point in most TV Master Controls comes down to what components are in-chassis redundant, which are external-chassis "hot" spares, and which are shelf spares.

    My point (and I do have one...) is how it is unconscionable that a hospital where lives are at stake, lacks the war-room mentality that an entertainment operation has. It's real simple at the end of the day to assess which components in a network --info or video or both - chain are critical, and buy two of them and keep it all lit and tested. Lives are at stake, and your signature is on the shift report? You rent a tertiary back-up system to bring online while you do your regular and frequent preventive maintenance on your primary and secondary.

    The guys who take care of Buffy do it. I would have thought that the guys who take care of sick babies and grandmothers would be playing in the same league.

  78. Re:friggin windoze users by b1t+r0t · · Score: 5, Funny
    I'll let my doctor worry about curing whats wrong with my brian than dealing with high-order complex networking issues, thank you very much.

    "Dammit, Jim, I'm a doctor, not a CCIE!"

    --

    --
    "Open source is good." - Steve Jobs
    "Open source is evil." - Microsoft
  79. Its been coming for a log time by bolix · · Score: 5, Informative

    I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!

    AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.

    I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.

    The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!

    The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.

    I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.

  80. Problem was with bad Business Practices... by Alyeska · · Score: 2, Insightful
    Yes, the network failed. Good businesses -- including hospitals -- will allow for system failures through contingency planning.

    I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?

  81. Re:Contribution to causality responsibility by timeOday · · Score: 5, Informative
    I agree, and let me refer you to a real life example. The USS Yorktown is that very famous Navy ship that was immobilized by a network outage. The whole thing was trigged by some seaman entering a 0 where he shouldn't have, so the Navy made some attempt to pin it on him. But it didn't fly. Operational errors like that are routine. It shouldn't have crashed the app. Having crashed the app, it shouldn't have taken down the whole network.

    If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.

  82. Mission Critical Networks 101 by rhoads · · Score: 5, Interesting

    One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.

    We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.

    Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.

    These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.

    FUNNY NETWORK TRICKS

    I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.

    -- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.

    -- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.

    -- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.

    And the list of stories goes on. You get the point.

    1. Re:Mission Critical Networks 101 by GiMP · · Score: 2

      > Why in the hell would a router "run out of
      > memory". Damn, I mean didn't they test those
      > conditions before selling the router? Again, it
      > should manage its memory efficiently and throttle
      > when needed.

      Yes, routers can run out of memory.. just like any other device. Your router should have enough memory to perform well for it's situation.. however, it is unavoidable that under an attack (intentional or non-intentional) your router can run out of memory...

  83. no, identical networks crash in identical ways by Anonymous Coward · · Score: 2, Insightful

    Interesting how even an army of Cisco engineers couldn't fix the problem. Perhaps a testament to how overly(and needlessly) complex cisco's equipment is...and/or, how bad their certification/training is.

    As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.

    Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?

    One would crash, then the second would crash immediately.

    As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.

    I'd also guess improper change control procedures were involved here as well.

    Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.

    1. Re:no, identical networks crash in identical ways by Large+Green+Mallard · · Score: 2

      CCNA = Can Crap - Not Assist :)

  84. Counterexamples by hey! · · Score: 3, Interesting

    As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.

    Here are some examples of the ways in which failures can occur that have implied linkages:

    (1) Both trains are damaged by an earthquake.

    (2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).

    (3) The firm has cut the maintenance budget and is neglecting routine maintenance.

    (4) The train is sabotaged by disgruntled employees or terrorists.

    (5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.

    (6) Design flaw means trains do not meet expected performance specifications.

    In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  85. Downtime Procedures by Kraegar · · Score: 5, Insightful
    Posting this kind of late, but it needs to be said.

    I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.

    I don't know how that hospital was able to pass inspections.

  86. Re: Thick Coax links by Ashurbanipal · · Score: 2, Informative

    Etherhose (10b5 thick coax) is a useable networking technology. It has very good resistance to RFI/EMF. Lots of hospitals still run it, on links where 10 Mb/sec is sufficient.

    Etherhose is no longer a good investment because it is labor-intensive to work with (vampire taps, and thick, heavy cabling) and because nobody is developing the technology any more.

    Today, fiber optics might seem a better choice for noise isolation, since the cost has come down to a reasonable level.

    However, glass has the same potential for future obsolescence as etherhose - I have a half-dozen mutually incompatible fiber links here. And termination, splicing, and interconnection of fiber is at least as difficult as working with etherhose... having done both, I'd say drilling for a vampire tap is easier.

    In short, don't replace a working piece of infrastructure needlessly (wait until you project a need for additional bandwidth) and for noise isolation cat 5e in a grounded metal conduit is probably your best bet. Large diameter, professional quality conduit runs through electrically noisy areas are costly but also a very safe investment.

    I wouldn't knock that old etherhose - it does its job quite well, far better than the 10b2 thin coax that replaced it ever did. And it's far more physically sturdy than anything else outside of conduit.

  87. Oh come on by Flamesplash · · Score: 2

    I was hoping for at least a funny. :)

    --
    "Not knowing when the dawn will come, I open every door." - Emily Dickinson
  88. Data storms... by SwedishChef · · Score: 2

    This outage was caused by a researcher's data creating a storm of data which outpaced the network's ability to cope. The problem was allowing the research data to flow unimpeded across vital systems. The solution is to implement methods of controlling bandwidth, not just routing.

    In order to prevent this from happening again, engineers should analyze the system to determine where to put data storage. In this case, almost certainly (although the article is unclear) data was stored in a central location but spanned across several servers and then backed up in another location. One part of the solution is to have distributed data storage spread across the institution and then that data backed up (across a separate network) to a central location.

    The data storm itself could be prevented by using QoS bandwidth management. Of course, every network user believes that he/she should have unfettered access to all the bandwidth available, but quietly implementing some well-known techniques for limiting bandwidth usage would have at least mitigated the damage.

    Finally, routing protocols other than spanning-tree or OSPF should be used. Creative implementation of internal addressing schemes (10.0.0.0 IP addresses) and a combination of BGP and last-resort static routes would certainly help to avoid these sorts of problems. I'm also wondering whether a *nix box running Zebra in critical locations might not reduce the problems. Certainly Zebra can remove the routing load from the Ciscos and, with plenty of RAM and processing speed available on PCs nowadays, could probably improve routing efficiency when a circuit goes down.

    But the key to this problem is bandwidth management not routing management. Of course, the next problem could be routing. One seldom has the budget to solve everything.

    --
    No one ever had to evacuate a city because the solar panels broke!
  89. Reduced to? by tomdarch · · Score: 2
    Senior executives were reduced to errand runners.

    What do you mean 'reduced to'? What else are they good for?

  90. I can top that! by Ashurbanipal · · Score: 5, Funny

    There was an electrician named Joe at the place I used to work who was counting the days to retirement. He never did a lick of work he didn't absolutely have to, and he never cared if his work would last 24 hours after his retirement.

    The NEC (National Electrical Code) was the first casualty of his attitude. But not the last!

    I discovered that he carried a heavy-duty plug in his pocket with the two hot leads wired directly together. He called it his "pigtail".

    When Joe needed to find what circuit breaker controlled an outlet, he jammed in the pigtail (with an audible *snap* of electric arc) and then calmly walked down to the nearest breaker box to see what had tripped.

    You could tell he was working in a building because you'd see scientists running down the hallways tearing their hair and screaming "My research!!! My research!! Ten years of research ruined!!" as the voltage spikes destroyed their equipment...

  91. Offtopic by InadequateCamel · · Score: 2, Informative

    I read in a book about the number zero that I mentioned here before that the real cause was someone accidentally left a zero in a line of code, rather than a person pressing zero and crashing the entire network. Perhaps someone tried to execute a command that led to this faulty code being used by the ship's computers?
    Maybe this was proven to be false later, I dunno.
    Kind of funny though...

  92. To make an analogy to another redundant system. by Inoshiro · · Score: 2

    Yes, there is always the possibility you might be born blind, but most people don't have that genetig defect. They have two eyes which work very well, even if one of them happens to be broken by a random toothpick accident.

    Redundancy is always good in a system where uptime is king. That is why so much of nature has organisms based around semi-redundant designs.

    --
    --
    Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
  93. Re:CLARIFICATION by ChimChim · · Score: 2, Interesting

    Yes, i'm not the wizard of words (or apparently math ;) this morning am i?

    My main reason for posting was to appease my instinctual reaction to the (somewhat intuitive) mistake soemtimes made that having twice the stuff makes it twice as good/reliable, etc. Which holds true for availability (10-fold in fact), but you'll get less in the case of reliability, and manageability is also a concern since you'll have to constantly check the backup network (if it's not in active use, failures are harder to find or predict for that matter). Also, failures aren't always randomly dispersed throughout the network, as the model might imply. You have to figure out how much failure each part of the network can sustain.

    So, throwing more hardware, developers, or whatever at the problem isn't a real solution. Figuring out what was wrong in the first place will let them spend their money more wisely, rather than letting all that hardware go to waste, doing nothing. They could possibly get all the redundancy they want with less than twice the hardware and maybe even increase performance of the network during regular usage.

    ok, i've totally over spent my $0.02.

  94. A Case History by Baldrson · · Score: 3, Interesting
    A major corporation wanted to go paperless. They had all sorts of IDEF graphs and stuff like that to go with. I was frightened for them and suggested that maybe a better route was to start by just going along the paper trails and, instead of transporting paper, transport physical digital media -- sneaker-net -- to workstations where digital images of the mail could be browsed. Then after they got that down they could put into place an ISDN network to the phone company which would allow them to go from sneaker-net to a network maintained by TPC. If TPC's ISDN support fell apart they could fall back to sneaker-net with physical digital media. Only after they had such a fail-safe "network" in place -- and deliberately fell back on it periodically and randomly to make it robust -- would the IDEF graphs start being generated from the actual flow of images/documents. By then of course there would be a general attitude toward networks and computers that is quite different from that of the culture that typically surrounds going paperless.

    Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.

  95. In my opinion... by freebase · · Score: 2, Interesting

    First, I don't have all the details of what happened, nor do I have any idea of what the network looked like prior to the outage. However, I have a general design philosphy based on my experience with teaching hospitals and telco networks.

    The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.

    All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.

    Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.

    --
    Sig??? I don't need no stinkin Sig!
  96. Re:the sad part by ceejayoz · · Score: 2

    If the hospital had been paper-based, this tragedy would not have occurred.

    Tragedy? It sounds like they handled it quite well, and nobody died because of it.

    The advantage of a paperless hospital is that you don't have to wait an hour for the lab results or X-rays to get to you (or longer, if they get lost). That saves time, letting the hospital save more patients.

  97. Why not fix spanning tree? by m1a1 · · Score: 3, Insightful

    If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.

  98. Re:Reliability is inverse to the number of compone by ceejayoz · · Score: 2

    Someone failed their vision test...

    See that percent sign? The little "%" thingy?

  99. Go Wireless, Use copper for Backup by randomErr · · Score: 2

    Go Wireless, Use copper for Backup

    I'm not talking 802.11, but miltary grade Spread Spectrum. It would cost a lot less then laying new copper. And if some a$@hole inadvertantly starts a DOS attack you could just flip off the main antena array at your NOC for 10 minutes and let the network reset itself. Also throttle your nodes to say 10 mbit. That way one node can't take down your entire network.

    If a storm or other activity takes out the antena array you still have the old copper. Keep a switch(physical switch, not hub like switch) so that you could walk over to a pannel a switch your node over to copper in a jiff. If they both fail then go carrier pigeon, CB's, or cellphones. Nothing like a good old analog message in a pinch.

    --
    You say things that offend me and I can deal with it. Can you?
  100. Multiple Problems and Multiple Solutions by SuicidalSquirrel · · Score: 2, Insightful

    First of all, this was apparently a flat layer-2 network. From the information I have seen, it was a very large network. Spanning tree is a wonderful protocol and layer-2 networks are not bad things, BUT spanning tree is very complex in a large network, and latency is going to be an issue if there are no routed boundaries to control traffic. I have experience in designing networks for hospitals (and financial institutions and universities and gov't institutions), so I am aware that implementing layer-3 to the edge is not necessarily feasible for many reasons - financial, legacy setups, etc. That being siad, however, there should be some layer-3 at some point to segregate traffic and protect the critical pieces of the network. Identify the critical points of the networks and put redundancy there - i.e. the server farm, critical care monitoring systems, WAN connection. All network equipment vendors have some type of redundancy feature that would take care of automatic failover for these devices.

    Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.

    The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.

    Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.

    If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.

    The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.

    Just my $0.02, and I'm just that blond chick, so what do I know anyway...

    --
    So what are you going to do? Bleed on me?
  101. Re:Reliability is inverse to the number of compone by ceejayoz · · Score: 2

    0 train fails = 0.9 * 0.9 = 0.81
    1 train fails = 2 * 0.1 * 0.9 = 0.18
    2 train fails = 0.1 * 0.1 = 0.01

    which means that the probability of having at least one train going from NY -> LA is ... 98%, much better than the previous 90%.


    Erm... to quote you, "I think you made some mistakes."

    100% - 1% = 99%.
    81% + 18% = 99%.

    How'd you get 98% out of those numbers?

  102. Interesting response by jhines · · Score: 3, Insightful

    That this happened in a teaching hospital, rather than a large corporation, makes their response much different.

    They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.

  103. Absolutely a redundant network by Fastolfe · · Score: 2

    I don't really understand all of the comments saying a redundant network infrastructure is bad/stupid/etc.

    If your network is critical to your business, you should absolutely consider backing up every bit of that network with one (or more?) redundant components. This means every router should have a redundant pair, every physical network link should be redundant (including how it's routed through the building), every firewall, switch, etc. If you have mission-critical servers, they should have two NIC cards. Upgrades should never occur on both "sides" of the infrastructure at the same time, and both sides should be capable of running alone.

    Not only does this type of configuration resist failures, but upgrades or configuration changes to the A or B side should never impact the other side, and if it does, you should be able to shut down the offending sections without impacting availability.

    If your network staff doesn't understand these concepts, you desperately need to train them better. If the expense cannot be justified by management, then that's a business decision and when failures like this occur, they should not be surprised.

  104. Two is better than one? by mr_z_beeblebrox · · Score: 2

    Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

    Since Michael asked it like that I will leave behind my network engineer role (professional) and pick up my role as armchair mathmatician.
    The item too be doubled is a network. Unreliability and massiveness are qualities of that network. So, using the distributive property of multiplication this would give us the equivalence of one network that is twice as large and twice as unreliable as the original.

  105. I lived in Boston until 1999 by Newer+Guy · · Score: 2

    I lived in Boston until 1999 and had my (ruptured) appendix removed at that hospital. That place is absolutely HUGE, many city blocks in size. It's network must be huge too and that's the problem. A LAN that size HAS to be sub-netted into smaller segments! Now, I'm not a whiz bang Network engineer, but I do know when something's done WRONG, and it sure seems like this is the case here. Building a parallel WRONG network won't solve the problem, it'll DOUBLE the problem! There are many gifted people here....why not come up with a solution for them here? Consider it a public service to a very public oriented hospital.

  106. Re:What is spanning tree protocol? (google whoring by jerde · · Score: 2, Interesting

    Well, mostly transparent to end stations.

    Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.

    Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.

    Apple's DHCP implementation was bitten by this on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.

    I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.

    - Peter

    --
    INsigNIFICANT
  107. Obligatory barely-related Microsoft reference by Infonaut · · Score: 2
    Remember reading about the Microsoft-driven Hospital of the Future(tm) a couple of yars back? I was trying to find info about it by doing a Google search. Amazing what * Microsoft hospital * brought up. MS is definitely making a concerted push in the health care industry.

    Let your imaginations wander, and ponder a point in the future when all of our health care facilities will be run on Microsoft... .

    --
    Read the EFF's Fair Use FAQ
  108. And on an unrelated note... by Radical+Rad · · Score: 3, Funny

    Mail any lucrative^h^h^h^h^h^h^h^h^h job offers to:

    Former MIS Director,
    Beth Israel Deaconess hospital
    Boston, MA 02215

  109. WRONG!: Re:Problem was with an application, by fanatic · · Score: 5, Informative

    No application can cause a spanning tree loop. It is simply impossible.

    A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".

    There are 2 likely causes:

    Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.

    Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.

    A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.

    This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.

    Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.

    --
    "that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
    1. Re:WRONG!: Re:Problem was with an application, by khafre · · Score: 4, Informative

      Actually, it is possible for an application to cause Spanning Tree to fail. Most switches have a management port that allow remote access (via telnet, ssh, SNMP, etc.) to the switch. This management port is normally connected to its own VLAN isolated behind a router so user brodcasts & multicasts in another VLAN can't affect the switch CPU. This port can be overrun with brodcasts and multicasts from user applications providing both the user and the switch are on the same VLAN. If this CPU is consumed by processing broadcasts, it may not have enough CPU time available to process and forward spanning tree BPDUs. If a blocked port becomes opened, a switch loop could form and, BINGO, network meltdown.

    2. Re:WRONG!: Re:Problem was with an application, by Anonymous Coward · · Score: 4, Informative

      Third possiblity - and what I'd be confident is the initial cause.

      The amount of traffic the researcher was putting onto the network caused spanning tree hello BPDUs to be dropped.

      After a period of not receiving hello messages (20 seconds if memory serves), downstream devices believe the upstream device has failed, and decide to re-converge the spanning tree.

      During this re-convergence, the network can become partitioned. It is preferable to partition the network to prevent loops in the layer 2 infrastructure. Datalink layer frames eg ethernet, don't have a hop count, so they will loop endlessly - potentially causing further failures of the spanning tree protocol.

      Once the bulk traffic source is removed from the network, STP should stabilise within a fairly short period - 5 minutes or so - so there may also have been a bug in Cisco's IOS, which was triggered by this STP event.

      Altneratively, the network admins may have played with traffic priorities, causing this researcher's traffic to have a higher priority over STP messages, causing the STP to fail.

      Radia Perlman has a good description of STP in her book "Interconnections, 2nd ed" - but then she should - she invented it.

  110. uh wtf by ealar+dlanvuli · · Score: 2

    Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions.

    I call sensationalist bullshit. It takes at most 15 minuites to switch over to a fully paper hospital here.

    Either that or their hospial is really really shity.

    --
    I live in a giant bucket.
  111. Ahhh... that's it, you see! by sconeu · · Score: 2

    Well, that's it you see! Alan Ralsky thought it said spamming tree protocol and tried to use the network!

    --
    General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
  112. I was stuck there by drwho · · Score: 2

    Well, this explains what happened when I was there after being hit by a truck. The doctors were great but the place was very disorganized. Hrm.

  113. A common logical fallacy... by The+Ape+With+No+Name · · Score: 3, Insightful

    ... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...

    --
    Comparing it to Windows will be a moot point, since El Dorado is going to have a 40% larger code base than XP.
  114. No contigency planning by CormacJ · · Score: 2

    I was an operations manager for a large hospital for several years, and planning for this such as that should be a number one goal for IT staff.

    The first rule in anything to do with hospitals is to ensure that they have disaster plans in place and that these are tested on a regular basis. The disaster plans should include scenarios such as total power outage, failures of vital equipment etc.

    The second rule I used was to ensure that in critical areas there was a second independant network path that if needed could be isolated from the rest of the network. Usually this mean putting in a run of fibre that bypassed buildings etc.

    The third rule is to ensure that vital equipment can be run without need for a network. Nothing should be so dependant on networking that if there is a failure it will stop it from working. If networking is a requirement (eg Medical Imaging) that network should be independant from the main network.

    The fourth rule is to ensure that there is a secondary method of accessing electronic patient records in the event of an extended downtime. I wrote an application that would dump the most needed patient information and leave it available on PC's in critical areas in query only mode. This allowed access to most of the patient details for using the patient forms.

  115. Re:Interconnections by netwiz · · Score: 2

    To begin with, it's unlikely a CCIE would have required a consultation w/ the inventor of the protocol, as they'd already have a firm understanding of the inner workings of STP. And there is no "quick start" to a CCIE. That's why there's less than 10,000 of them in the world. And why, even in the depressed tech market, CCIEs are still follwed by headhunters bearing offers of $100K+/yr jobs...

  116. Re:Cisco STP implementation may have a bug by netwiz · · Score: 2

    Now, however, if two vlans get bridged (a computer with a wire in one vlan, and a wireless card in another vlan), the forwarding tables on the switches get confused because there are multiple paths to the same stp root.

    Excuse me? Since when do end hosts forward BPDUs? Since when do end hosts forward _anything_, for that matter?

    Unless you're going the el cheapo route, there's no reason that individual computers should be forwarding traffic. Okay, I'm sure some of you could show me valid scenarios, but I'll bet that none of them are realistic production environments (unless management has been incredibly stupid).

  117. These guys got off easy! by raehl · · Score: 3, Funny

    The last time I had a problem with a spanning tree algorithm I lost 12 points on my CS final!

    Ok, so seriously, I'd be embarassed if I screwed up a spanning tree algorithm on a test. If it took Cisco engineers 6 days to fix it, it musta been something really quirky, most likely the software not configuring something right. I can't imagine an application problem that would hose a network past a power toggle.

  118. Sounds right. by twitter · · Score: 2
    The "backup" network should look different from the first so that it is not suceptible to common mode failure. It should be simpler, learing from the last accident, backing up the most important and difficult to replace segments. The Boston article mentions lab results. One way to back up the network is to have a simplified link from the lab to several key locations. "Non essential" functions and other less heavy stuff might just have to do without the backup. It might be inconvenient to walk down a hall or a flight of steps to get info, but that beats everyone having to go to a different building.

    The above is specious. I know nothing about the network or campus in question. I'm sure the folks on hand know what to do. Good luck.

    --

    Friends don't help friends install M$ junk.

  119. Sure, and while we're at it!! by cybercomm · · Score: 3, Funny

    Why not buy M$ wireless 802.11b install W2K/XP on every computer and set up an MS exchange server. Who needs BSD when you have M$ :)

    <I>just kiddi'n the uptime of the above mentioned network would be measured in nanoseconds, and then they will have to switch MS paper'n'pen method</I>

    --
    Live for the present, learn from the past, and dream of the future!
  120. It's all about the Benjamins by sjbe · · Score: 5, Insightful

    My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.

    The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.

    Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.

    1. Re:It's all about the Benjamins by passion · · Score: 2

      Well, then to take the optimistic view, I guess that crashing 10+ a day isn't that bad an occurrence... that way, they don't develop an ultimate dependence on a system, and when it crashes, it's an annoyance instead of a mission-critical failure.

      --
      - passion
  121. Yes, but.. by Inoshiro · · Score: 2

    The kidneys are internally redundant. You only need a 10% kidney function to contintue to survive. Ditto for Liver and other organs (aside from heart). They take years of abuse via smoking or drinking before they finally start to wear out to the point of causing system collapse.

    --
    --
    Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
  122. Obligitory Spanninng Tree Poem by crotherm · · Score: 2

    Algorhyme

    I think that I shall never see
    A graph more lovely than a tree.
    A tree whose crucial property
    Is loop-free connectivity.
    A tree that must be sure to span
    So packets can reach every LAN.
    First, the root must be selected.
    By ID, it is elected.
    Least-cost paths from root are traced.
    In the tree, these paths are placed.
    A mesh is made by folks like me,
    Then bridges find a spanning tree.

    ---Radia Perlman

    --
    "Those who make peaceful revolution impossible, make violent revolution inevitable" - JFK
  123. Executives working? by wandernotlost · · Score: 3, Funny
    Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus.

    It's always nice to see those people doing useful work for a change.

  124. Standard UPS by Bios_Hakr · · Score: 2

    Sounds like a standard UPS system to me. You have the grid feeding banks of batteries. The batteries feed the hospital. The generators are between the grid and the batteries, but they are not wired in such a way as to allow a generator failure to disrupt pawer from the grid. If the grid fails, no one notices because the batteries are what feed the hospital. After a few minutes, the generators start and they keep the batteries full. Once the grid is back on, the generators shut down.

    --
    I'd rather you do it wrong, than for me to have to do it at all.
  125. Sad but true by sjbe · · Score: 2

    Actually there is more truth to that than you know. They can't keep any files locally and simply have to not rely on the systems for anything critical. Recently they had their computers taken away for 3 weeks (refurbishing offices), which was a terrible inconvenience, but it didn't bring work to a halt. Just made everyone's lives harder than they had to be.

  126. Union "help" by ces · · Score: 3, Insightful

    Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.

    My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.

    --
    Happy Fun Ball is for external use only.
    1. Re:Union "help" by ces · · Score: 2

      That may be true, but if you treat the tradespeople like shit they will act like cretins.
      The problem is often the only way to get decent, prompt, and/or after-hours service from union trades is by getting them to want to help you. This is accomplished by making friends with them and bribery.

      Unfortunately life often requires you to go out of your way to be nice to people who really don't deserve it.

      --
      Happy Fun Ball is for external use only.
  127. Testing the backup network by Skapare · · Score: 2

    And how will you know if the backup network even works? Of course you could test it. But will it work under the kind of extreme live stress that would take down the primary network? And what if the issue is simply load than neither network can fully handle? Could you run both networks in tandemn correctly? It sounds to me like the original problem was that the network was designed by someone who thinks of the switches as magical black boxes that will take care of everything ... someone that assumes perfect abstraction. That 3 million dollars to build a parallel network I think could be better spent by hiring competent people to build a correct network that includes redundancies structured in the right places. No matter what you do, there will be some single points of failure, such as the very logic used to switch over to the backup network if that's what you have (which would be a big waste if it sat there idle). The network engineering people need to know and understand those single points of failure and have plans to deal with failures at those points.

    --
    now we need to go OSS in diesel cars
  128. Re:what ever happened to TTL? by Ian+Peon · · Score: 2

    To elaborate on what zzyrc said, TTL wont decrement when it passes through a typical layer 2 switch - only a router or other layer 3 device.

  129. Re:the sad part by alizard · · Score: 2
    So who gets priced out of medical care with your "solution"?

    Probable results:

    • a new army of minimum wage clerks
    • you might die between when that utterly necessary record with the info required to treat you is delayed, screwed up, or lost. Not sure how much of a loss that would be in your case, of course. People involved with technology who don't understand what it's really for should listen carefully to the call of Darwin.
    • increased costs and reduced efficiency. Remember why they went to computerized records to begin with? It wasn't because of a passionate love for unglamourous back-office technology.
  130. No wonder! by PerryMason · · Score: 2

    Meanwhile, the hospital was figuring out how to run at its usual pace without the 100,000 e-mails it usually sends a day.

    So thats where they're doing all those penis enlargements!

    --
    "I'm tired of all this 'Aren't humanity great' bullshit. We're a virus with shoes" - Bill Hicks
  131. talk about repeating the problem by stinky+wizzleteats · · Score: 2

    Build a second parallel network because the network designers didn't know wtf they were doing? How are you going to fail over to this network? STP? (insert obnoxious chortle here)

    10 bridged hops = big flat network = they needed layer 3 switching in the first place, ergo, the network was badly designed. The very fact that a root bridge STP reconverge occurred indicates a poorly framed implementation plan and obviously no backout plan.

    Find somebody who knows what the hell they are doing and have them do a network audit.

  132. I found the problem by stinky+wizzleteats · · Score: 2

    Cisco Systems, the hospital's network provider...

  133. Re:So true by mekkab · · Score: 2

    I wonder if that open source ATC comment was for that UK airspace shutdown on May 17th...

    Its nothing open source could fix...
    he shouldn't worry though, we've put a fix in for that (Works damn well, too!)

    --
    In the future, I would want to not be isolated from my friends in the Space Station.
  134. Re:Reliability is inverse to the number of compone by dago · · Score: 2

    There's always some error in calculs, in this case, the traindriver forgot to lace its shoes.

    --
    #include "coucou.h"