Dublin Air Traffic Control Brought Down By Faulty NIC

← Back to Stories (view on slashdot.org)

Dublin Air Traffic Control Brought Down By Faulty NIC

Posted by timothy on Thursday July 17, 2008 @07:40PM from the can-go-wrong-can-go-wrong-nothing-can dept.

Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ... Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ... '[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."

8 of 203 comments (clear)

Min score:

Reason:

Sort:

Re:testing and QA by Thanshin · 2008-07-17 19:55 · Score: 4, Insightful

Testing doesn't confer prescience.
Re:testing and QA by mcrbids · 2008-07-17 20:11 · Score: 5, Insightful

The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".
It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.
And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.
So design a system that fails gracefully! That's what nature did.
Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.
Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.
It's the nature of the beast.
Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.
High availability - it's much, much, MUCH harder than you thought.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:testing and QA by HungryHobo · 2008-07-17 20:41 · Score: 5, Insightful

I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.
Re:testing and QA by kitgerrits · 2008-07-17 20:43 · Score: 5, Insightful

The problem is not that redundancy wasn't implemented.
The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.
This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).
There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.

--
"I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
Re:testing and QA by DaedalusHKX · 2008-07-17 20:55 · Score: 3, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I call "CYA kissass excuse maker" to the stand!
Someone screwed up big, and they're Covering Their Asses now.

--
" What luck for rulers that men do not think" - Adolf Hitler
One card "overcame the redundancy"??? by gweihir · 2008-07-17 21:01 · Score: 3, Insightful

If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:testing and QA by mowall · 2008-07-17 22:06 · Score: 4, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.
Re:testing and QA by methamorph · 2008-07-17 23:30 · Score: 3, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
If the card had failed completely the redundant one would probably have kicked in. What I think happened is the card malfunctioned in a way causing the system to still think that the card is fine and there is no need for the redundant one to kick in.