Slashdot Mirror


Software Bug Caused Qantas Airbus A330 To Nose-Dive

pdcull writes "According to Stuff.co.nz, the Australian Transport Safety Board found that a software bug was responsible for a Qantas Airbus A330 nose-diving twice while at cruising altitude, injuring 12 people seriously and causing 39 to be taken to the hospital. The event, which happened three years ago, was found to be caused by an airspeed sensor malfunction, linked to a bug in an algorithm which 'translated the sensors' data into actions, where the flight control computer could put the plane into a nosedive using bad data from just one sensor.' A software update was installed in November 2009, and the ATSB concluded that 'as a result of this redesign, passengers, crew and operators can be confident that the same type of accident will not reoccur.' I can't help wondering just how a piece of code, which presumably didn't test its input data for validity before acting on it, could become part of a modern jet's onboard software suite?"

5 of 603 comments (clear)

  1. What? by Spikeles · · Score: 5, Informative

    "I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"" - pdcull

    What are you? some kind of person that doesn't read the actual articles or documents? Oh wait.. this is slashdot. Here let me copy paste some text for you

    If any of the three values deviated from the median by more than a predetermined threshold for more than 1 second, then the FCPC rejected the relevant ADR for the remainder of the flight.

    The FCPC compared the three ADIRUs’ values of each parameter for consistency. If any of the values differed from the median (middle) value by more than a threshold amount for longer than a set period of time, then the FCPC rejected the relevant part of the associated ADIRU (that is, ADR or IR) for the remainder of the flight.

    So there you go, there actually really was validity checking performed. Multiple times per second in fact, by three separate, redundant systems. Unfortunately all 3 systems had the bug. Here is the concise summary for you:

    The FCPC’s AOA algorithm could not effectively manage a scenario where there were multiple spikes such that one triggered a memorisation period and another was present 1.2 seconds later. The problem was that, if a 1.2-second memorisation period was triggered, the FCPCs accepted the next values of AOA 1 and AOA 2 after the end of the memorisation period as valid. In other words, the algorithm did not effectively handle the transition from the end of a memorisation period back to the normal operating mode when a second data spike was present.

    --
    I don't need to test my programs.. I have an error correcting modem.
  2. Re:Bad software by RealGene · · Score: 5, Informative

    My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).

    No, it was when two different softwares were used to calculate thrust. The spacecraft software calculated thrust correctly in newton-seconds.
    The ground software calculated thrust in pounds force-seconds. This was contrary to the software interface specification, which called out newton-seconds.
    The result was that the ground-calculated trajectory was more than 20 kilometers too close to the surface.
    The engineers didn't "forget to convert", they failed to read and understand the specifications.

    --
    Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
  3. Re:don't just wonder, learn by Animats · · Score: 5, Informative

    How about reading the darned final report.

    I highly recommend that. It's a good read. This was not a sensor problem. The problem actually occurred in the message output queue of one of the CPUs, and resulted in sending data with the label for one data item with the data from another. The same hardware unit had demonstrated similar symptoms two years earlier, but the problem could not be replicated. This time, they tried really hard to induce the problem, with everything from power noise to neutron bombardment, and were unable to do so.

    There are several thousand identical hardware units in use, and one of the others demonstrated a similar problem, once. No other unit has ever demonstrated this problem. The The investigators are still puzzled. They unit which produced the errors has been tested extensively and the problem cannot be reproduced. They considered 24 different failure causes and eliminated all of them. It wasn't a stuck bit. It wasn't program memory corruption. (The code gets a CRC check every few seconds.) The code in ROM was what it was supposed to be. Thousands of other units run exactly the same software. It wasn't a single flipped bit. It wasn't a memory timing error. It wasn't a software fault. It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.

    Very frustrating.

  4. Yes it was a software problem, but .... by oneblokeinoz · · Score: 5, Informative

    DISCLAIMER: I hate air travel, but do it most weeks.

    I have worked in and around the safety critical software industry for over 20 years. The level of testing and certification that the flight control software for a commercial aircraft is subjected to far exceeds any other industry I'm familiar with. (I'm willing to be educated on nuclear power control software however.)

    The actual problem on the Qantas jet was a latent defect that was exposed by a software upgrade to another system. So the bug was there for a long time and I'm sure there are still others waiting to be found. But this doesn't stop me getting on a jet at least twice a week.

    As a software professional and nervous flyer, do problems with the aircraft software scare me? No not really. What scares me is the airline outsourcing maintenance to the lowest bidder in China, the pilots not getting enough break time, the idiotic military pilot who ignores airspace protocol, and the lack of english language skills in air traffic controllers and cockpit crew across the region where I fly (English is the international standard for Air Traffic Control).

    A good friend is a senior training captain on A330's, and in all the stories he tells software is barely mentioned. What get's priority in the war-stories is the human factors and general equipment issues - dead nav aids, dodgy radios, stupid military pilots. One software story was an Airbus A320 losing 2 1/2 out of 3 screens immediately after takeoff from the old Hong Kong airport. The instructions on how to clear the alarm condition and perform a reset were on the "dead" bottom half of one of the screens.

    A great example of software doing it's job is the TCAS system - Traffic Collision Avoidance System (http://en.wikipedia.org/wiki/Traffic_collision_avoidance_system). To quote my friend "If it had lips, he'd kiss it". It's saved his life, and the lives of 100's of passengers, at least twice. Both times through basic human error on the part of the pilot of the other aircraft.

    One final thought - on average about 1000 people die in commercial aviation incidents each year world wide (source: aviation-safety.net) . In the USA, over 30,000 people die in vehicle accidents every year.

  5. Re:What about Google driverless car? by michelcolman · · Score: 5, Informative

    Except that the designers of the software didn't take all possible situations into account. For example, any Fly By Wire Airbus will automatically pitch up if speed increases too far above the maximum airspeed, even when flown manually. This may be a good idea when the airplane is diving (the most likely cause for overspeed), but not when it's straight and level with other traffic immediately above! This has already made several Airbus planes in heavy turbulence suddenly start to climb violently due to a sudden change in airspeed or temperature and overriding the pilot's MANUAL inputs while he's trying to avoid flying into other traffic! That's insane, and it's only one of many reasons why I can't wait to get off the Airbus fleet onto a more sensibly designed plane. (I'm currently an A320 pilot).