Slashdot Mirror


Dublin Air Traffic Control Brought Down By Faulty NIC

Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ... Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ... '[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."

10 of 203 comments (clear)

  1. There's only one way to solve this by anomnomnomymous · · Score: 5, Funny

    Put all those NIC's on the terror watchlist!

    --
    When you shoot a mime, do you use a silencer?
  2. Re:testing and QA by tinkertim · · Score: 5, Funny

    Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.

    Well, when we set up some cheap NAS boxes with redundant nics .. some load balancers and other goodies .. we tested it by yanking cables on the bonded nics and making sure everything still worked.

    This was for an e-commerce site.. I would agree in hoping more testing with real failures would be done on systems that monitor air traffic.

    Also, we were very drunk when yanking cables during our test .. so I don't think intoxication is really a factor. In fact, turning a drunken monkey loose in a data center with a clearance to pull cables is _very_ good fail over testing :)

  3. Re:testing and QA by mcrbids · · Score: 5, Insightful

    The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".

    It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.

    And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.

    So design a system that fails gracefully! That's what nature did.

    Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.

    Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.

    It's the nature of the beast.

    Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.

    High availability - it's much, much, MUCH harder than you thought.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  4. Re:testing and QA by HungryHobo · · Score: 5, Insightful

    I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.

  5. Re:testing and QA by kitgerrits · · Score: 5, Insightful

    The problem is not that redundancy wasn't implemented.
    The problem is that redundancy doesn't handle 'flapping' hardware very well.
    The NIC intermittently failed, causing the redundancy to switch cards several times.
    This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
    Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.

    This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).

    There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.

    --
    "I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
  6. Re:ten minutes by wintermute000 · · Score: 5, Informative

    there are plenty of examples of 10 minute failover

    Older cisco ATAs take 10 minutes to swing onto SRST if keepalives are lost to the callmanager cluster.

    a complex routing protocol refresh (big BGP networks) can take many minutes

    a faulty NIC can easily bring down a LAN segment, with or without redundant switching paths - and it makes it look like a router failure as the router overloads trying to deal with the broadcast storm

  7. Re:testing and QA by Hal_Porter · · Score: 5, Funny

    Only The Spice confers prescience.

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  8. Re:testing and QA by Hognoxious · · Score: 5, Funny

    if we add n redunndant fail-overs, the total system will fail with probability 1-p^n

    Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  9. Confusing terminology by ddrichardson · · Score: 5, Informative

    I work in aviation and wonder if the terminology being used by the newspaper articles is correct.

    It appears to be talking about mode S IFF (Interrogation Friend or Foe) or SIFF radar systems which identify aircraft and appends height data. The speed is the only thing that needs calculating, as it isn't encoded in the pulse train.

    Why this is weird is because much older bus technologies are normally used to handle this data being transferred than current network technology, such as MIL-STD-1553.

    This makes me wonder if it was one of two things - a system inputing to an ethernet PC system that calculates and displays the information or more likely they are talking about a DLTU type stub connector (or remote terminal) used in such typical buses. This is unlikely because the bus systems they are employed on, the bus controller would have picked up on the failure during continuous built in test and pulled in an alternative.

    If its the former then someone needs shooting. ATC is a realtime application and the overhead involved here would be unacceptable. I'm not even sure of the benefit of a network, multiple self contained indiviual terminals would be safer.

    --
    A thistle is a fat salad for an ass's mouth...
  10. Re:testing and QA by Hal_Porter · · Score: 5, Interesting

    One of the odd and very likable things about Dune is that there are occasionally implications that the society we read about is not the most advanced. Maybe their taboos are limiting them. Essentially the world we read about is actually in its own version of the Dark Ages where progress has all but stopped and feudalism is the only system. The Tleiaxu and the Ixians aren't in a Dark Age though. But we don't here too much about them because they are outside the known world because they violate the taboos that govern the know world.

    Essentially it's a bit like reading history Taliban controlled Afghanistan, or unfortunately anywhere with an Islamic government. And I'm sure it's deliberate - Frank Herbert apparently was inspired by the Islamic uprisings against the British.

    Or if you look at another way he wanted to write a hallucinogenic, retro sci fi epic, and he came up with a bunch of explanations - the Butlerian Jihad, the necessary for spice based prescience for interstallar travel, and the incompatibily between directed energy weapons and shields to explain why his universe was that way and not like conventional sci fi with ray guns, robots and open societies in the Popper sense.

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;