Slashdot Mirror


Dublin Air Traffic Control Brought Down By Faulty NIC

Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ... Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ... '[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."

27 of 203 comments (clear)

  1. Re:testing and QA by Thanshin · · Score: 4, Insightful

    Testing doesn't confer prescience.

  2. There's only one way to solve this by anomnomnomymous · · Score: 5, Funny

    Put all those NIC's on the terror watchlist!

    --
    When you shoot a mime, do you use a silencer?
  3. Re:testing and QA by MortenLJ · · Score: 4, Informative

    The possiblity of failure can be reduced, but never completely removed. It's a simple matter of probabilities. E.g. a certain component fails on any day with probability p, if we add n redunndant fail-overs, the total system will fail with probability 1-p^n, an equation which will never be one, but it can get close.

  4. Re:testing and QA by tinkertim · · Score: 5, Funny

    Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.

    Well, when we set up some cheap NAS boxes with redundant nics .. some load balancers and other goodies .. we tested it by yanking cables on the bonded nics and making sure everything still worked.

    This was for an e-commerce site.. I would agree in hoping more testing with real failures would be done on systems that monitor air traffic.

    Also, we were very drunk when yanking cables during our test .. so I don't think intoxication is really a factor. In fact, turning a drunken monkey loose in a data center with a clearance to pull cables is _very_ good fail over testing :)

  5. Re:testing and QA by mcrbids · · Score: 5, Insightful

    The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".

    It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.

    And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.

    So design a system that fails gracefully! That's what nature did.

    Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.

    Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.

    It's the nature of the beast.

    Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.

    High availability - it's much, much, MUCH harder than you thought.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  6. NICtzche by cornjchob · · Score: 3, Funny

    if this piece of hardware was capable of "overc[oming] the built-in system redundancy", perhaps its ilk ought to be patrolling the transistorized wunderplatz of interconnected morsels governing our most hubris means of transportation? I, for one, would certainly feel safer.

    --
    We now have confirmed reports from an informed Orange County minister that Ethel is still an active communist.
  7. Re:testing and QA by HungryHobo · · Score: 5, Insightful

    I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.

  8. Re:testing and QA by kitgerrits · · Score: 5, Insightful

    The problem is not that redundancy wasn't implemented.
    The problem is that redundancy doesn't handle 'flapping' hardware very well.
    The NIC intermittently failed, causing the redundancy to switch cards several times.
    This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
    Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.

    This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).

    There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.

    --
    "I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
  9. Re:ten minutes by wintermute000 · · Score: 5, Informative

    there are plenty of examples of 10 minute failover

    Older cisco ATAs take 10 minutes to swing onto SRST if keepalives are lost to the callmanager cluster.

    a complex routing protocol refresh (big BGP networks) can take many minutes

    a faulty NIC can easily bring down a LAN segment, with or without redundant switching paths - and it makes it look like a router failure as the router overloads trying to deal with the broadcast storm

  10. Re:testing and QA by zach_d · · Score: 3, Interesting

    in a high noise/vibration/dust environment?

  11. Re:testing and QA by Hal_Porter · · Score: 5, Funny

    Only The Spice confers prescience.

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  12. It's a success story. by Farmer+Tim · · Score: 4, Funny

    "...an intermittent malfunctioning network card which consequently overcame the built-in system redundancy"

    But it's one of the lucky ones.

    Every year, thousands of NICs fall victim to built-in system redundancy; if you know a card whose activity indicators are darkened and lifeless, it may have a redundancy problem. With your support and donations, we at Ethernetics Anonymous can help more network cards beat the scourge of built-in system redundancy, and make them feel like a useful part of society again.

    --
    Blank until /. makes another boneheaded UI decision.
  13. Re:testing and QA by DaedalusHKX · · Score: 3, Insightful

    The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??

    I call "CYA kissass excuse maker" to the stand!

    Someone screwed up big, and they're Covering Their Asses now.

    --
    " What luck for rulers that men do not think" - Adolf Hitler
  14. In the queue by davew · · Score: 3, Funny

    I was due to fly the evening it all went wrong. Here's a lesson: if you're standing in a three-hour queue for the Ryanair desk, and they tell people to rebook on the web, and you take out a laptop and 3G modem, be prepared for a stampede.

  15. One card "overcame the redundancy"??? by gweihir · · Score: 3, Insightful

    If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  16. Was it running windows? by iwein · · Score: 3, Funny
    --
    Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
  17. Re:testing and QA by Hognoxious · · Score: 5, Funny

    if we add n redunndant fail-overs, the total system will fail with probability 1-p^n

    Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  18. Why!? by damburger · · Score: 4, Funny

    I am flying to Florida tomorrow, it will only be my fifth plane flight in total and my first transatlantic flight. Despite being a rational scientist, who knows how safe it is statistically, I am having trouble suppressing my anxiety.

    And at this point, fate sees fit to bombard me with horror stories about flying. This news about air traffic control comes on the heels of a headline I just saw on the front page of the Independent about pilots not reporting faults on aircraft and thus unsafe ones still flying about. I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"

    --
    If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
    1. Re:Why!? by FrostedWheat · · Score: 3, Funny

      A long time ago I went on a school trip to London, and it was the first time I had ever been on a plane so I was a bit nervous. In the airport shop there was a magazine (can't remember which now) with a plane in flames on the front cover, with the large headline "Why Planes Crash". Whoever put them out must have had an evil streak too, they had spread them out to fill the entire top shelf.

  19. Confusing terminology by ddrichardson · · Score: 5, Informative

    I work in aviation and wonder if the terminology being used by the newspaper articles is correct.

    It appears to be talking about mode S IFF (Interrogation Friend or Foe) or SIFF radar systems which identify aircraft and appends height data. The speed is the only thing that needs calculating, as it isn't encoded in the pulse train.

    Why this is weird is because much older bus technologies are normally used to handle this data being transferred than current network technology, such as MIL-STD-1553.

    This makes me wonder if it was one of two things - a system inputing to an ethernet PC system that calculates and displays the information or more likely they are talking about a DLTU type stub connector (or remote terminal) used in such typical buses. This is unlikely because the bus systems they are employed on, the bus controller would have picked up on the failure during continuous built in test and pulled in an alternative.

    If its the former then someone needs shooting. ATC is a realtime application and the overhead involved here would be unacceptable. I'm not even sure of the benefit of a network, multiple self contained indiviual terminals would be safer.

    --
    A thistle is a fat salad for an ass's mouth...
  20. Re:testing and QA by mowall · · Score: 4, Insightful

    The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??

    I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.

  21. Re:testing and QA by jimicus · · Score: 4, Interesting

    Yes but if _one_ NIC can bring the entire system down what other single failures in a component could bring the entire system down? Obviously the system with the malfunctioning NIC can do any number of things that may result in a similar failure mode. Or what happens if the network switch it is attached to fails (I assume they use multiple paths... but if one nic can nuke it all, imagine if a switch went bonkers).

    You don't need to bring the entire system down to cause havoc. What if there's a hitherto unknown bug in one of the CPUs which under some very specific set of circumstances causes aircraft altitude to be misreported on the operator's screen? As the GP said, most redundant systems only ensure that the components appear to be broadly working. They seldom check that all the components are doing something sensible.

  22. Re:testing and QA by david.given · · Score: 3, Informative

    Actually, it confers "the ability to fold space. That is, travel to any part of universe without moving."

    Actually actually, the space folding is done using the Holtzman drive, which is a perfectly ordinary machine. The Navigator merely navigates, plotting a safe path through the non-space/time foldspace. The spice grants the Navigator the limited prescience required to do this.

    Eventually the Navigators become obsolete, replaced by Ixian semisentient machines known as Compilers that perform the same task without needing melange. A good thing too, because by that point Arrakis is rubble and sandworms are pretty much extinct.

    Details courtesy of Wikipedia (and my lack of a social life).

  23. Irish Examiner, ha! by PinkyDead · · Score: 3, Funny

    Everyone in Ireland knows that the Irish Examiner used to be the Cork examiner - and they never miss an opportunity to point out how Dublin is doing a bad job.

    This is because Cork thinks that it's the centre of the friggin' universe. The 'Real Capital', my arse! Just a bunch of thunderin' ejits, living in their little Blarney fantasy land. Sure they can't even talk right. What the hell is a 'langer', anyway. They wouldn't even know how to spell NIC.

    The fact that they are right is quite beside the point.

    (For a North American cultural equivalent, please see http://en.wikipedia.org/wiki/South_Park:_Bigger%2C_Longer_%26_Uncut)

    Anyone who mods me down is from Cork - believe it!

    --
    Genesis 1:32 And God typed :wq!
  24. Re:testing and QA by methamorph · · Score: 3, Insightful

    The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??

    If the card had failed completely the redundant one would probably have kicked in. What I think happened is the card malfunctioned in a way causing the system to still think that the card is fine and there is no need for the redundant one to kick in.

  25. Re:testing and QA by Hal_Porter · · Score: 5, Interesting

    One of the odd and very likable things about Dune is that there are occasionally implications that the society we read about is not the most advanced. Maybe their taboos are limiting them. Essentially the world we read about is actually in its own version of the Dark Ages where progress has all but stopped and feudalism is the only system. The Tleiaxu and the Ixians aren't in a Dark Age though. But we don't here too much about them because they are outside the known world because they violate the taboos that govern the know world.

    Essentially it's a bit like reading history Taliban controlled Afghanistan, or unfortunately anywhere with an Islamic government. And I'm sure it's deliberate - Frank Herbert apparently was inspired by the Islamic uprisings against the British.

    Or if you look at another way he wanted to write a hallucinogenic, retro sci fi epic, and he came up with a bunch of explanations - the Butlerian Jihad, the necessary for spice based prescience for interstallar travel, and the incompatibily between directed energy weapons and shields to explain why his universe was that way and not like conventional sci fi with ray guns, robots and open societies in the Popper sense.

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  26. Re:testing and QA by morgan_greywolf · · Score: 3, Funny

    Yeah. Fortunately I just got back from going outside. OTOH, it was just raining, and I saw all the millions and millions of tiny water drops falling from the sky. Which made me think of Interrupt 80 and all those forked off-processes it would spawn with that code...