Dublin Air Traffic Control Brought Down By Faulty NIC

← Back to Stories (view on slashdot.org)

Dublin Air Traffic Control Brought Down By Faulty NIC

Posted by timothy on Thursday July 17, 2008 @07:40PM from the can-go-wrong-can-go-wrong-nothing-can dept.

Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ... Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ... '[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."

12 of 203 comments (clear)

Min score:

Reason:

Sort:

Re:testing and QA by Thanshin · 2008-07-17 19:55 · Score: 4, Insightful

Testing doesn't confer prescience.
Re:testing and QA by mcrbids · 2008-07-17 20:11 · Score: 5, Insightful

The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".
It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.
And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.
So design a system that fails gracefully! That's what nature did.
Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.
Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.
It's the nature of the beast.
Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.
High availability - it's much, much, MUCH harder than you thought.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:testing and QA by HungryHobo · 2008-07-17 20:41 · Score: 5, Insightful

I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.
Re:testing and QA by kitgerrits · 2008-07-17 20:43 · Score: 5, Insightful

The problem is not that redundancy wasn't implemented.
The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.
This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).
There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.

--
"I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
Re:testing and QA by DaedalusHKX · 2008-07-17 20:55 · Score: 3, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I call "CYA kissass excuse maker" to the stand!
Someone screwed up big, and they're Covering Their Asses now.

--
" What luck for rulers that men do not think" - Adolf Hitler
One card "overcame the redundancy"??? by gweihir · 2008-07-17 21:01 · Score: 3, Insightful

If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:My idea of fault tolerance by a_real_bast... · 2008-07-17 21:09 · Score: 2, Insightful

Unfortunately, this NIC's fault showed up as the radar not working. What were they supposed to fail-over to? Binoculars?

--
You're making me think. You won't like me when I'm thinking.
Re:testing and QA by Anonymous Coward · 2008-07-17 21:09 · Score: 2, Insightful

-- "The very best planned of redundant systems can be brought to its knees by hardware that "mostly works"."--
--NO, you are wrong there. What this indicates is that someone skimped. Techniques for processing and getting reliable signals through systems that only mostly work are very well known and used routinely. What this event means is that someone, either explicitly or implicitly assumed that NICs are binary - they either work or they don't, and designed accordingly.
What should have been used is multiple (more than two) parallel simultaneous communication paths with comparison voting at the far end to determine if the information received can be regarded as valid or not. Assuming that a NIC will fail gracefully is so boneheaded that in a safety critical application like this someone could likely be prosecuted for negligence.
Unfortunately, using the right techniques is expensive. Given that systems like this are provided by tender processes that favour low-bidders, it is not surprising that problems appear.
Of course, somebody may have done a cost-benefit analysis and decided the risk of one (or several) aircraft accidents didn't merit the extra expenditure. That's unlikely to be publicised 'though - although it would be a correct calculation to run.
As for testing - well running around pulling cables out at random doesn't really do it. Unplugging and plugging cables at various frequencies/intervals, swapping cables, plugging them into incorrect sockets, injecting noise, dropping the voltage on the power supply, overvoltage/spikes are all things that could and should be done - and in some cases mathematical formal proof that the system will work as required. All of this (and more) is done for safety critical applications.
Re:testing and QA by tinkertim · 2008-07-17 21:10 · Score: 2, Insightful

The problem is not that redundancy wasn't implemented. The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
That's what got me curious, it looked like they were using takeover instead of bonding devices.
The most well engineered system in the world can not hope to escape a ~9 minute ARP cache upstream, which makes me wonder why it was designed the way that it was.
I'm not thinking in an antagonistic sense, I'm more wondering what changed in the network _after_ the system was deployed.
Re:testing and QA by TheThiefMaster · 2008-07-17 21:13 · Score: 2, Insightful

If one fails with probability p, and you have n of them, a total system failure is probability p^n, not 1-p^n. Well technically it's Mult(p,1->n) where p1 is the probability of the first failing, p2 the probability of the second, etc, multiplying them all together to get the chance of a total system failure.
The probability of any one device in a redundant system failing is (1-((1-p)^n)). This equation rapidly approaches 1, so in larger setups failures will be a common occurrence, but they'll largely be harmless due to redundancy.
Of course this all assumes the failure mode of the device is "off" or "non-functioning". If it fails in a way which routes 15A of mains power into a network cable, redundancy might not help a whole lot.
Obviously that's not what happened, but it's not outside possibility for one device to take down an entire redundant system.
Re:testing and QA by mowall · 2008-07-17 22:06 · Score: 4, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.
Re:testing and QA by methamorph · 2008-07-17 23:30 · Score: 3, Insightful

The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
If the card had failed completely the redundant one would probably have kicked in. What I think happened is the card malfunctioned in a way causing the system to still think that the card is fine and there is no need for the redundant one to kick in.