Dublin Air Traffic Control Brought Down By Faulty NIC
Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ...
Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ...
'[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.
Caesar si viveret, ad remum dareris.
Makes it sound like the NIC was fighting against The Man! Go NIC!
Task Mangler
Put all those NIC's on the terror watchlist!
When you shoot a mime, do you use a silencer?
if they were really smart they would have two separate machines dedicated to this information for more redundancy.
People - I am trying to collect airport related scary stories. I haven't got many yet but if you have some then please let me know - you can email me at admin@scareports.com or just visit the site (blatant pimping) here .
I'd have to have some sympathy that it was an intermittent problem. They can really cause confusion to automated systems that are designed to cope with hard failures. I've had many occasions in my latter career in Service Delivery and support where it's taken human conviction to sort out issues caused by the cluster software trying to cope with intermittent connections
if this piece of hardware was capable of "overc[oming] the built-in system redundancy", perhaps its ilk ought to be patrolling the transistorized wunderplatz of interconnected morsels governing our most hubris means of transportation? I, for one, would certainly feel safer.
We now have confirmed reports from an informed Orange County minister that Ethel is still an active communist.
When I was administering a small network in Marin, every time we had a small earthquake, all of the AppleTalk connectors would come loose. Took hours to find the faults and push them together. I guess we should have used duct tape.
I suppose at an airport as each jet came in creating vibrations, those same connectors would have dislodged.
Ten minutes at a time? That doesn't sound like a "mostly broken" problem to me, that sounds like a 10 minute fail-over time. Shit happens, but if it takes you 10 minutes for your stuff to automatically start working again you're doing it wrong, especially since its all int one data center. And whatever hapened to redundant off-site systems? New law: As a conversation progresses, the chance of someone saying "terrorist" approaches 100%
If video games influenced behavior the Pac Man generation would be eating pills and running away from their problems.
we don't have a prime minister and I'm fairly sure customs don't wear green.
Any I am not entirely sure what a Lucky Charm is ...
Cruise TT
"...an intermittent malfunctioning network card which consequently overcame the built-in system redundancy"
But it's one of the lucky ones.
Every year, thousands of NICs fall victim to built-in system redundancy; if you know a card whose activity indicators are darkened and lifeless, it may have a redundancy problem. With your support and donations, we at Ethernetics Anonymous can help more network cards beat the scourge of built-in system redundancy, and make them feel like a useful part of society again.
Blank until
in this case would be the ability to run air traffic control without all those fancy computrons, should the need arise.
Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities
I was due to fly the evening it all went wrong. Here's a lesson: if you're standing in a three-hour queue for the Ryanair desk, and they tell people to rebook on the web, and you take out a laptop and 3G modem, be prepared for a stampede.
âoeThales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported.â
Is the LAX incident not of the same type, then?
If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
http://www.networkworld.com/community/node/29644?page=2&ts=
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
Lordy, someone's lost the craic! Quick get this man a pint of the black stuff and a plate of annoying stereotype! /me dances a jig to up-beat folk music.
More seriously: had I said Taoiseach the joke would have flown over everyones heads. Secondly taoiseach is just Irish for prime minister, so get off your high horse. Thirdly, I can exploit Irish stereotypes for a cheap joke, because I am Irish. Personally my favourite examples of Irish stereotypes are the Monkey Dust sketches: Diary of Anne Frank and The Crusades, always worth a plug.
I am flying to Florida tomorrow, it will only be my fifth plane flight in total and my first transatlantic flight. Despite being a rational scientist, who knows how safe it is statistically, I am having trouble suppressing my anxiety.
And at this point, fate sees fit to bombard me with horror stories about flying. This news about air traffic control comes on the heels of a headline I just saw on the front page of the Independent about pilots not reporting faults on aircraft and thus unsafe ones still flying about. I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"
If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
Customs do wear green, but it is hard to prove since the camera does not manage to capture them.
Beside that, it is not a area where photograping is allowed. If you do so anyway the X-ray machine will wipe your camera (It is not supposed to do that, but it will do anyway!)
was not 'built in' properly, the system should have been isolated after the first fault and not put back into service until the fault was diagnosed and fixed.
pretty sloppy if you ask me...
"[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
And when we edit littlebit, can we have the truth?:
They confirmed the root caused the hardware system malfunction using an intermittent malfunctioning network card wich consequently overcame the build-in system redundancy.
I work in aviation and wonder if the terminology being used by the newspaper articles is correct.
It appears to be talking about mode S IFF (Interrogation Friend or Foe) or SIFF radar systems which identify aircraft and appends height data. The speed is the only thing that needs calculating, as it isn't encoded in the pulse train.
Why this is weird is because much older bus technologies are normally used to handle this data being transferred than current network technology, such as MIL-STD-1553.
This makes me wonder if it was one of two things - a system inputing to an ethernet PC system that calculates and displays the information or more likely they are talking about a DLTU type stub connector (or remote terminal) used in such typical buses. This is unlikely because the bus systems they are employed on, the bus controller would have picked up on the failure during continuous built in test and pulled in an alternative.
If its the former then someone needs shooting. ATC is a realtime application and the overhead involved here would be unacceptable. I'm not even sure of the benefit of a network, multiple self contained indiviual terminals would be safer.
A thistle is a fat salad for an ass's mouth...
This is the first time in 50 years that this has happenend. And the first time they had accurate information on screens was 10-15 years ago..
Outside ireland, lucky charms is an american-style (sickly sweet and artificial, featuring sort of freeze-dried marshmallow lumps) breakfast cereal, with a hollywood-irish-accented mascot "lucky the leprechaun" (akin to "tony the tiger" from "frosties"), with a surrounding advertising campaign that could be considered vaguely offensive (on grounds of nauseating cutesyness if nothing else), at least if irish people were excessively thin-skinned (fortunately, they're generally not, and since they're also white-skinned people in america probably wouldn't care if they were upset anyway). It's not hugely offensive, it's not "Bloody Sunday Breakfast Snacks" or something, but all the same, it's not sold in the Republic of Ireland, and was withdrawn from the UK market fairly rapidly upon introduction.
As part of that campaign, the leprechaun obsessively worries about people trying to take his lucky charms, or at least used to, now he just seems to be resigned to thieving kids running off with it.
Everyone in Ireland knows that the Irish Examiner used to be the Cork examiner - and they never miss an opportunity to point out how Dublin is doing a bad job.
This is because Cork thinks that it's the centre of the friggin' universe. The 'Real Capital', my arse! Just a bunch of thunderin' ejits, living in their little Blarney fantasy land. Sure they can't even talk right. What the hell is a 'langer', anyway. They wouldn't even know how to spell NIC.
The fact that they are right is quite beside the point.
(For a North American cultural equivalent, please see http://en.wikipedia.org/wiki/South_Park:_Bigger%2C_Longer_%26_Uncut)
Anyone who mods me down is from Cork - believe it!
Genesis 1:32 And God typed
G-GP:
if we add n redunndant[sic] fail-overs, the total system will fail with probability 1-p^n
GP:
Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
P:
Sometimes, pure intuition can be more handy than maths.
Only if you're not good at the math.
The way the G-GP described the system, the number of redundant fail-overs includes the primary system. With n=0, you have no system in place. No system, no possibility of system failure.
It's a cover-up for Zing Zang Zoom rolling out a rootkit protection
...it is not so black and white.
I administer a network like that. Pharmaceutical plant to be precise.
All machines on the production network have 2 independent PCI nics, connecting to 2 identical but separate networks, using separate routers and switches. The critical servers are stratus high availability servers which have dual redundant everything, driving all components in lock-step and correcting errors on the fly.
If something happens to cause a network switch over, there is a bulk of network traffic to deal with it, because sockets have to be opened and closed, state has to be transferred, system control message flow has to be restarted so that all controllers go back to the normal state, ... And at application level, everything is RPC and DCOM based, so this will cause a significant disruption for the running services, since COM objects and RPC marshalling have to be destroyed and recreated, reinitialized, ...
The whole thing is very complex from a systemwide point of view, and causes a significant disruption.
Now, switchovers can be triggered by different things, like a maintenance techician replacing a controller and causing a timeout halfway through a message flow, network cable that has to be unplugged for some reason, ...
When that happens, performance of the system sucks for the next couple of minutes.
Critical signals will work within defined tolerances, but anything else will simply stop responding during the switchover.
I can perfectly imagine that IF you have a nic which is failing in just such a way that makes the network switch back and forth, it'll dirupt and eventually kill the entire network.
Unless of course the switches are smart enough to detect this and disable the physical port. But even then, we are not talking about DDoSing, but just minor errors at exactly the right time to trigger network failover.
Network redundancy of complex systems (and air traffic is much more complex than a pharma plant) can protect very well against single or predictable failures.
But there will always be very specific failures, depending on dozens of variables, which have the power to bring down the system. Since is the first time such an event occurred with that air traffic control system, it is likely to be one of those corner cases.
It's the same with other proecedures with aviation. everything is supposed to be double and triple redundant, but still National Geographic has enough material to create the series 'Seconds from disaster' where one unlikely error amplifies another, and in the end, the plane hits the building / swamp / ground.
I'll have you know, my good man, that not all of slashdot's readers are American.
It's true I tell you, feller at work's next door neighbour read it in the paper.
What is a "contol" and why is this so important?
http://www.dieblinkenlights.com
Ever heard of a nice thing called Spanning Tree Protocol and the Campus Model for network design? With proper network design, this would not be an issue.
I think this is a case of Sales guy vs web dude.
Jumpstart the tartan drive.
I prefer Bobby Sands Diet Bars myself.
No system, no possibility of system failure.
No possibility of success either. Brilliant! Let's all just go back to bed and forget about trying anything that might possibly fail...
If something happens to cause a network switch over, there is a bulk of network traffic to deal with it, because sockets have to be opened and closed, state has to be transferred, system control message flow has to be restarted so that all controllers go back to the normal state, ... And at application level, everything is RPC and DCOM based, so this will cause a significant disruption for the running services, since COM objects and RPC marshalling have to be destroyed and recreated, reinitialized, ...
That strikes me as the wrong approach. Use the two physical NICs as interfaces of a router and route everything from/to a logical NIC. Then there is no failover disruption. I think that today there is no need to do this so that the applications actually notice or (worse) need to be able to deal this type of failure.
I can perfectly imagine that IF you have a nic which is failing in just such a way that makes the network switch back and forth, it'll dirupt and eventually kill the entire network.
The well established way to deal with that is lockout periods, where only a failure of the second network and a re-availability of first will cause a switch-back before a certain time has passed. Note that this is needed regardless of switchover delay, since the system can ''oscillate'' faster for faster switchovers.
I see three possible problems: Either this system was not state-of-the art and has serious issues simply because of its old and unflexible deaign, or the designers screwed up, or it was indeed some very complex failure that had several unexpected sideeffects. Avionic people are vey good in learning from past desasters. Their history of prevention and prediction is not nearly as good.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Sounds like they spilled Guinness on the servers again! Either that, or it's those damned pesky Leprechauns! :-)
That's why intermittent failures are so bad, because they introduce a dynamic element into an already very complicated equation, and usual testing strategies don't go as far as this.
For example, just try to see how any routing algorithm reacts to a router that's up and down, up and down, up and down...
[Pruneau
I was on my first flight on an airplane and was excited and a bit nervous. I decide to read the book my wife gave me at the last minute for some reading material. The first line was, "I was going down fast..." regarding a crashing plane. My wife claims she had no idea of course. After the fact, I can appreciate the irony at least.
I only look human.
My mother is a halfling and my dad is an ogre, so that makes me an Ogreling
I work in a company that makes stuff for ATC. Our systems have 2 networks (2 NICs in each box, 2 sets of cables, switches etc)
Every packet is sent on both networks and alarms are set off if a packet turns up somewhere without being received on the other network.
Iffy connectors etc. are discovered pretty quickly, but it is still be possible for the system to fail if, for example, a switch on the 'red' network failed and a box had a dodgy NIC on the 'green' network -- this kind of thing happens when somebody chooses to ignore an alarm because "it's working fine, we'll get round to replacing that card next week"
Lucky Charms never pissed me off so much as Trix did. I remember one commercial where the rabbit actually *bought his own cereal* and the kids took it because "trix are for kids". I wanted to see him mow the little fuckers down for home invasion or something...
Silly rabbit, supersonic lead is for thieving, speciesist little pricks.
I swear I've heard about something like this happening before.
Can't a Network Access Control (NAC) enabled switch with some kind of "reasonable NIC operation metrics" shut down the port the bad NIC card is on? (While notifying the network admin, of course)
...just in the NIC of time.
"Open source" is not equivalent to "source code is available to the end-user for inspection" even though that is one of the consequences of making software open source.
"Open source" means that the author places little-to-no restrictions on what the end-user may do with the software.
There are many, MANY other ways to supply a software's source code to its end-users including under contracts which explicitly places heavy restrictions on what an end-user may do with the supplied source code. I've received (from Sun) the source code to Solaris 2.6, but that was certainly never open source!
If you'd developed actual flight control systems (or even air traffic control systems such as the one which failed here) you'd know that such systems are always subjected to review by 'many eyes' -- they're just drawn from a restricted set. Think about all the havoc experienced when some hacker is the first to find a 0-day exploit in IIS or Apache and then imagine how much worse it would be if hacker-Ivan was the first to find an exploitable bug in a nuclear reactor control system or air traffic radar processing.
No. Odds are exceptionally high that whomever built this system built it to exactly the level of system availability the Irish ATC asked them to provide and the folks at-fault are still Ireland's ATC for failing to specify a more reliable system.
I know of one NIC (that was never recalled) that mysteriously stops taking interrupts every so often. I am prohibited by NDA to mentioning that manufacturer's name. Anybody want to guess who it is?
An engineer who ran for Congress. http://herbrobinson.us