Software Bug Caused Qantas Airbus A330 To Nose-Dive

← Back to Stories (view on slashdot.org)

Software Bug Caused Qantas Airbus A330 To Nose-Dive

Posted by Unknown on Monday December 19, 2011 @05:12PM from the bugs-on-a-plane dept.

pdcull writes "According to Stuff.co.nz, the Australian Transport Safety Board found that a software bug was responsible for a Qantas Airbus A330 nose-diving twice while at cruising altitude, injuring 12 people seriously and causing 39 to be taken to the hospital. The event, which happened three years ago, was found to be caused by an airspeed sensor malfunction, linked to a bug in an algorithm which 'translated the sensors' data into actions, where the flight control computer could put the plane into a nosedive using bad data from just one sensor.' A software update was installed in November 2009, and the ATSB concluded that 'as a result of this redesign, passengers, crew and operators can be confident that the same type of accident will not reoccur.' I can't help wondering just how a piece of code, which presumably didn't test its input data for validity before acting on it, could become part of a modern jet's onboard software suite?"

26 of 603 comments (clear)

Re:What about Google driverless car? by Anonymous Coward · 2011-12-19 17:16 · Score: 5, Insightful

sure, but the number of accidents will likely still be fewer than those caused by human drivers.
Bad software by Hadlock · 2011-12-19 17:19 · Score: 5, Funny

I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"
This, from the same company, while building the A380 megajet decided to upgrade half of their facilities to plant software version 5, while the other half decided to stick with version 3/4. And did not make the file formats compatible between the two versions, resulting in multi-month delays of production as a result.

Point being, in huge projects, simple things get overlooked (with catastrophic results). My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).

--
moox. for a new generation.
1. Re:Bad software by RealGene · 2011-12-19 18:03 · Score: 5, Informative
  
  My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).
  No, it was when two different softwares were used to calculate thrust. The spacecraft software calculated thrust correctly in newton-seconds.
  The ground software calculated thrust in pounds force-seconds. This was contrary to the software interface specification, which called out newton-seconds.
  The result was that the ground-calculated trajectory was more than 20 kilometers too close to the surface.
  The engineers didn't "forget to convert", they failed to read and understand the specifications.
  
  --
  Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
Re:What about Google driverless car? by timeOday · 2011-12-19 17:19 · Score: 5, Interesting

But the best part is that once you fix a bug in an automated system, it's fixed forever, whereas a fresh new crop of novices hits the roads/skies every day.
There were people against airbags, too, because they killed some people who otherwise wouldn't have died. You work on fixing those things. But whether the system as a whole is worthwhile is judged on whether it saves more than it kills.
don't just wonder, learn by fche · 2011-12-19 17:22 · Score: 5, Interesting

"I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?""
How about reading the darned final report, conveniently linked in your own blurb? There was lots of validity checking. In fact, some of it was relatively recently changed, and that accidentally introduced this failure mode (the 1.2-second data spike holdover). (Also, how about someone spell-checking submissions?)
1. Re:don't just wonder, learn by inasity_rules · 2011-12-19 17:33 · Score: 5, Interesting
  
  Mod parent up. Anyhow, information from a sensor may be valid but inaccurate. I deal with these types of systems regularly(not in aircraft, but control systems in general), and it is sometimes impossible to tell with out extra sensors. Its one thing to detect a "broken wire" fault, and a completely different thing to detect a 20% calibration fault, for example, so validity checking can only take you so far. Its actually impressive the failure mode in this case caused so little damage.
  
  --
  I have determined that my sig is indeterminate.
2. Re:don't just wonder, learn by Animats · 2011-12-19 18:50 · Score: 5, Informative
  
  How about reading the darned final report.
  
  I highly recommend that. It's a good read. This was not a sensor problem. The problem actually occurred in the message output queue of one of the CPUs, and resulted in sending data with the label for one data item with the data from another. The same hardware unit had demonstrated similar symptoms two years earlier, but the problem could not be replicated. This time, they tried really hard to induce the problem, with everything from power noise to neutron bombardment, and were unable to do so.
  There are several thousand identical hardware units in use, and one of the others demonstrated a similar problem, once. No other unit has ever demonstrated this problem. The The investigators are still puzzled. They unit which produced the errors has been tested extensively and the problem cannot be reproduced. They considered 24 different failure causes and eliminated all of them. It wasn't a stuck bit. It wasn't program memory corruption. (The code gets a CRC check every few seconds.) The code in ROM was what it was supposed to be. Thousands of other units run exactly the same software. It wasn't a single flipped bit. It wasn't a memory timing error. It wasn't a software fault. It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.
  Very frustrating.
Re:What about Google driverless car? by Pikoro · 2011-12-19 17:24 · Score: 5, Insightful

Even on the road today this is an issue. Doesn't matter how good of a driver you are. If one other idiot on the road is driving crazy, you could get killed no matter how you drive. Weakest link and all that...

--
"Freedom in the USA is not the ability to do what you want. It is the ability to stop others from doing what THEY want"
Re:What about Google driverless car? by Anonymous Coward · 2011-12-19 17:25 · Score: 5, Insightful

Which is actually Airbus relies on sensor input over the "pilot". Boeing believes in the opposite. I'm inclined to believe Airbus in that the majority of accidents are human error over computer error.
The problem with aviation accidents is the relatively small sample size. With cars there will be much better data (i.e. more data points).
If anything computer driven cars will be better - since due to the safety "fears" like the OP, they will be programmed to be cautious. They have to be better at handling conditions than human operators, otherwise it's instant blame. They have to be better to the degree that you can blow the stats out of the water. e.g. When the first computer driven car hits a person, they need to say "well based on hours on the road, if it was human driving this it would have hit 30 kids by now".
Re:What about Google driverless car? by Geldon · 2011-12-19 17:29 · Score: 5, Interesting

It's so interesting to see people's reaction to the whole driver-less car thing. It's incredible to see the kind of ethical thought-experiment that must necessarily go through everyone's mind when they come to this conclusion: How many lives must be saved before I will tolerate someone being brutally slain by a malfunctioning computer?
Every day, children are run down by drivers who are not paying attention, tired, drunk, or just plain don't have time to react. Since a driver-less car is incapable of being drunk, tired, or distracted, then it's a safe bet that they'll be much better at avoiding those accidents that can be avoided. But the reality is that the latter scenario (no time to react) would still lead to the deaths of many children (and others!).
At what point does it become "worth it"? When the driver-less car causes 1/10th as many fatalities? 1/100th? 1/1,000th? How many human deaths must be prevented by letting computers drive cars before we're willing to accept 1 single death by those same computers?
It's a real-life example of the "Trolley Problem"
http://en.wikipedia.org/wiki/Trolley_problem
What? by Spikeles · 2011-12-19 17:37 · Score: 5, Informative

"I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"" - pdcull

What are you? some kind of person that doesn't read the actual articles or documents? Oh wait.. this is slashdot. Here let me copy paste some text for you

If any of the three values deviated from the median by more than a predetermined threshold for more than 1 second, then the FCPC rejected the relevant ADR for the remainder of the flight.

The FCPC compared the three ADIRUs’ values of each parameter for consistency. If any of the values differed from the median (middle) value by more than a threshold amount for longer than a set period of time, then the FCPC rejected the relevant part of the associated ADIRU (that is, ADR or IR) for the remainder of the flight.
So there you go, there actually really was validity checking performed. Multiple times per second in fact, by three separate, redundant systems. Unfortunately all 3 systems had the bug. Here is the concise summary for you:

The FCPC’s AOA algorithm could not effectively manage a scenario where there were multiple spikes such that one triggered a memorisation period and another was present 1.2 seconds later. The problem was that, if a 1.2-second memorisation period was triggered, the FCPCs accepted the next values of AOA 1 and AOA 2 after the end of the memorisation period as valid. In other words, the algorithm did not effectively handle the transition from the end of a memorisation period back to the normal operating mode when a second data spike was present.

--
I don't need to test my programs.. I have an error correcting modem.
Re:it's more complicated than that by holophrastic · 2011-12-19 17:37 · Score: 5, Interesting

yup. all the while forgetting that the while altimeter shows altitude, it rarely actually measures distance to the ground, it measures air pressure, and then assumes an aweful lot.
we already fixed it. its called 'trains'. by decora · 2011-12-19 17:54 · Score: 5, Insightful

the idea that a bunch of automatically piloted vehicles is somehow a better solution to city transport than mass-transit, it boggles my mind.
real people do not have money to maintain their cars properly. things are going to break. there are not going to be 'system administrators' to fix all the glitches that come up when cars start breaking down after a few years.
there will be problems. do i know which problems? no, but i know the main problem.
arrogance amongst revolutionaries. it is historically a pattern of the human species. declaring that nothing could go wrong is usually a precursor to a lot of things going wrong. not because the situation was unpredictable, but because human beings in an arrogant mindset tend to make a lot of mistakes, be reckless, and try to cover their asses when things go wrong.
but successful engineering is the anti-thesis of arrogance. nobody worth his salt is going to say 'what could go wrong'? they are going to have a list of 500 things that could go wrong, and all the ways they have tried to counter-act those wrong things happening.
Re:What about Google driverless car? by slew · 2011-12-19 18:22 · Score: 5, Insightful

Which is actually Airbus relies on sensor input over the "pilot". Boeing believes in the opposite. I'm inclined to believe Airbus in that the majority of accidents are human error over computer error.
Sometime in a flight like AF447 the computer doesn't know jack either and gives up the ghost. In the AF477 flight(equipment airbus A330), apparently, the pitot sensors gave inconsistent readings and the autopilot disengaged. What insued was apparently what can happen when you have pilots that are error prone and a computer that doesn't know what the hell to do to help them. In these situations, I think it's prudent to still have a system that defaults to the pilot as if they knew what to do when they know the sensors have crapped out and apparently even Airbus agrees with this. Unfortunatly, it appears that the AF447 pilots were not up to the challenge in this circumstance.
Re:What about Google driverless car? by BeShaMo · 2011-12-19 18:27 · Score: 5, Funny

Clearly the solution is that I, as the only decent driver around, will be the only person allowed to drive myself, while everybody else must be using these Google cars.
Re:What about Google driverless car? by Anonymous Coward · 2011-12-19 18:56 · Score: 5, Insightful

A good driver, by definition, mitigates the bad driver by taking appropriate actions to reduce the risk. It is not how you drive, its how you manager the drivers around you that makes you a good driver.
Re:What about Google driverless car? by hairyfeet · 2011-12-19 18:56 · Score: 5, Insightful

And that would make it different than today when i nearly got ran over by a moron playing with his cell....how exactly? when I was a kid we were taught "This is a 2000 pound weapon, you treat it like a weapon and respect it or someone could die, maybe you, maybe someone else" and even then we still liked to drive fast but today? Jesus tap dancing christ I've not seen a bigger bunch of dipshits in my entire life than what I see on the road every damned day! Dipshit men playing with their phones, dipshit women putting on makeup AND playing with their phones, its like moron bumper cars out there pal!
That is why the other day when I saw my oldest a couple of car lengths ahead of me (and I knew he couldn't see me from where i was at) and saw him pull over into a lot and get out i just had to pull in behind him. I just knew why he had pulled in but when I asked him and he said "Somebody called me so i was pulling over so I could return the call" i immediately pulled out a twenty and handed it to him, saying "Having a brain is a damned rare thing in this world, smarts should be rewarded".
Frankly i'm all for Google car because at its worst it can't be as dangerous as the braintrusts on our roads. With my oldest taking 18 hours next semester its not HIS driving I worry about every day, its the dipshits with too many toys and not enough functional brain cells. If the Google car takes the keys away from even 20% of these numbnuts frankly the accidents will plummet, and that can only be of the good.

--
ACs don't waste your time replying, your posts are never seen by me.
Yes it was a software problem, but .... by oneblokeinoz · 2011-12-19 19:00 · Score: 5, Informative

DISCLAIMER: I hate air travel, but do it most weeks.

I have worked in and around the safety critical software industry for over 20 years. The level of testing and certification that the flight control software for a commercial aircraft is subjected to far exceeds any other industry I'm familiar with. (I'm willing to be educated on nuclear power control software however.)

The actual problem on the Qantas jet was a latent defect that was exposed by a software upgrade to another system. So the bug was there for a long time and I'm sure there are still others waiting to be found. But this doesn't stop me getting on a jet at least twice a week.

As a software professional and nervous flyer, do problems with the aircraft software scare me? No not really. What scares me is the airline outsourcing maintenance to the lowest bidder in China, the pilots not getting enough break time, the idiotic military pilot who ignores airspace protocol, and the lack of english language skills in air traffic controllers and cockpit crew across the region where I fly (English is the international standard for Air Traffic Control).

A good friend is a senior training captain on A330's, and in all the stories he tells software is barely mentioned. What get's priority in the war-stories is the human factors and general equipment issues - dead nav aids, dodgy radios, stupid military pilots. One software story was an Airbus A320 losing 2 1/2 out of 3 screens immediately after takeoff from the old Hong Kong airport. The instructions on how to clear the alarm condition and perform a reset were on the "dead" bottom half of one of the screens.

A great example of software doing it's job is the TCAS system - Traffic Collision Avoidance System (http://en.wikipedia.org/wiki/Traffic_collision_avoidance_system). To quote my friend "If it had lips, he'd kiss it". It's saved his life, and the lives of 100's of passengers, at least twice. Both times through basic human error on the part of the pilot of the other aircraft.

One final thought - on average about 1000 people die in commercial aviation incidents each year world wide (source: aviation-safety.net) . In the USA, over 30,000 people die in vehicle accidents every year.
Re:What about Google driverless car? by Belial6 · 2011-12-19 19:20 · Score: 5, Insightful

Yep, my wife got hit by a semi while sitting stopped at a red light.
I had the problem once by Anonymous Coward · 2011-12-19 19:33 · Score: 5, Interesting

Posting anon because I moderated.
I had a very similar problem once with firmware on a TI DSP. The symptom was that a peltier element for controling laser temperature would sometimes freak out and start burning so hot that the solder melted. After some debugging, it turned out that somewhere between the EEPROM holding the setpoint, and the AD converter, the setpoint value got corrupted.
The cause turned out to be a 32 variable that was un-initialized, but always set to 0 by the stack initialization code.
Only the first 16 bits were filled in because that was the value stored in the EEPROM. The programming bug was that the other 16 bits were left as is. In >99% of the time, this was not a problem. But if a specific interrupt happened at exactly the wrong moment during initialization of the stack variable, that variable was filled with garbage from an interrupt register value. Since the calculations for the setpoint used the entire 32 bits (it was integer math) it came out with a ridiculously high setpoint.
Having had to debug that, I know how hard it can be if your bug depends on what is going on inside the CPU or related to interrupts.
There may only be a window of less a micro second for this bug to happen, so reproduction could be nigh on impossible.
Re:What about Google driverless car? by andyn · 2011-12-19 20:57 · Score: 5, Interesting

Back in my Finnish Air Force days I talked to a captain who had flown the F-18C in his last three active flight years. He told that when you're straight and level in the Hornet and peek over your shoulder you probably see the ailerons swaying back and forth as the computer tries to keep the plane stable.
Re:What about Google driverless car? by ewanm89 · 2011-12-19 22:00 · Score: 5, Insightful

Okay, a few facts, the A330 is fly by wire, this means between pilot and control surfaces everything must go through the avionics, if the avionics totally fails then that plane is by definition little more than a glorified missile.
That said, it seems the backups and pilot responded exactly as they should have in this case. The plane pitched, enough to throw the passengers around and cause injuries, pilot disengaged autopilot and corrected, declared an emergency and safely landed at the nearest big enough airport.
Please tell me how he did anything wrong? Please tell me how the rest of the computer systems failed to cause and actual crash Nope neither, the plane was left in one piece on the ground.
The only thing I say is, why did it take Airbus 2 years to find and fix that major bug?
Re:What about Google driverless car? by michelcolman · 2011-12-19 22:34 · Score: 5, Informative

Except that the designers of the software didn't take all possible situations into account. For example, any Fly By Wire Airbus will automatically pitch up if speed increases too far above the maximum airspeed, even when flown manually. This may be a good idea when the airplane is diving (the most likely cause for overspeed), but not when it's straight and level with other traffic immediately above! This has already made several Airbus planes in heavy turbulence suddenly start to climb violently due to a sudden change in airspeed or temperature and overriding the pilot's MANUAL inputs while he's trying to avoid flying into other traffic! That's insane, and it's only one of many reasons why I can't wait to get off the Airbus fleet onto a more sensibly designed plane. (I'm currently an A320 pilot).
Re:What about Google driverless car? by robi5 · 2011-12-20 00:09 · Score: 5, Funny

Except that the designers of the software didn't take all possible situations into account. For example, any Fly By Wire Airbus will automatically pitch up if speed increases too far above the maximum airspeed, even when flown manually. This may be a good idea when the airplane is diving (the most likely cause for overspeed), but not when it's straight and level with other traffic immediately above!

Except if that other traffic is also an Airbus.
Re:What about Google driverless car? by ultranova · 2011-12-20 01:02 · Score: 5, Insightful

A good driver, by definition, mitigates the bad driver by taking appropriate actions to reduce the risk.

So how will you reduce the risk of someone next to you suddenly deciding to switch the lanes without checking that you're there? How do you reduce the risk of someone deciding he just has to pass the car in front of him even when there's incoming traffick? How do reduce the risk of someone deciding to test his engine and losing control?
It doesn't matter how good a driver you are; if someone else screws up bad enough, you're dead.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:What about Google driverless car? by Walter+White · 2011-12-20 01:36 · Score: 5, Insightful

So how will you reduce the risk of someone next to you suddenly deciding to switch the lanes without checking that you're there? ...
You reduce that risk by not staying next to another driver any longer than you have to.
You watch the drivers around you and anticipate what stupid things they might do that would endanger you. Then you decide what actions you need to take to minimize that risk. Then you take those actions. That's what defensive driving is all about.
It's not easy and can't really be done while jabbering on the phone. And it's not very satisfying to the ego to drop behind another driver who is a little more aggressive than you, but it can pay out in reduction of accidents caused bu others.
Yes, I'm sure one can point out situations where there is little to no opportunity to avoid the actions of others, but in far more situations there is plenty of opportunity to minimize the risks due to other driver's stupidity.