Dublin Air Traffic Control Brought Down By Faulty NIC
Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ...
Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ...
'[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
Testing doesn't confer prescience.
The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".
It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.
And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.
So design a system that fails gracefully! That's what nature did.
Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.
Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.
It's the nature of the beast.
Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.
High availability - it's much, much, MUCH harder than you thought.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.
The problem is not that redundancy wasn't implemented.
The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.
This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).
There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.
"I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.