Software Bug Caused Qantas Airbus A330 To Nose-Dive
pdcull writes "According to Stuff.co.nz, the Australian Transport Safety Board found that a software bug was responsible for a Qantas Airbus A330 nose-diving twice while at cruising altitude, injuring 12 people seriously and causing 39 to be taken to the hospital. The event, which happened three years ago, was found to be caused by an airspeed sensor malfunction, linked to a bug in an algorithm which 'translated the sensors' data into actions, where the flight control computer could put the plane into a nosedive using bad data from just one sensor.' A software update was installed in November 2009, and the ATSB concluded that 'as a result of this redesign, passengers, crew and operators can be confident that the same type of accident will not reoccur.' I can't help wondering just how a piece of code, which presumably didn't test its input data for validity before acting on it, could become part of a modern jet's onboard software suite?"
What are you? some kind of person that doesn't read the actual articles or documents? Oh wait.. this is slashdot. Here let me copy paste some text for you
So there you go, there actually really was validity checking performed. Multiple times per second in fact, by three separate, redundant systems. Unfortunately all 3 systems had the bug. Here is the concise summary for you:
I don't need to test my programs.. I have an error correcting modem.
No, it was when two different softwares were used to calculate thrust. The spacecraft software calculated thrust correctly in newton-seconds.
The ground software calculated thrust in pounds force-seconds. This was contrary to the software interface specification, which called out newton-seconds.
The result was that the ground-calculated trajectory was more than 20 kilometers too close to the surface.
The engineers didn't "forget to convert", they failed to read and understand the specifications.
Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
How about reading the darned final report.
I highly recommend that. It's a good read. This was not a sensor problem. The problem actually occurred in the message output queue of one of the CPUs, and resulted in sending data with the label for one data item with the data from another. The same hardware unit had demonstrated similar symptoms two years earlier, but the problem could not be replicated. This time, they tried really hard to induce the problem, with everything from power noise to neutron bombardment, and were unable to do so.
There are several thousand identical hardware units in use, and one of the others demonstrated a similar problem, once. No other unit has ever demonstrated this problem. The The investigators are still puzzled. They unit which produced the errors has been tested extensively and the problem cannot be reproduced. They considered 24 different failure causes and eliminated all of them. It wasn't a stuck bit. It wasn't program memory corruption. (The code gets a CRC check every few seconds.) The code in ROM was what it was supposed to be. Thousands of other units run exactly the same software. It wasn't a single flipped bit. It wasn't a memory timing error. It wasn't a software fault. It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.
Very frustrating.
DISCLAIMER: I hate air travel, but do it most weeks.
I have worked in and around the safety critical software industry for over 20 years. The level of testing and certification that the flight control software for a commercial aircraft is subjected to far exceeds any other industry I'm familiar with. (I'm willing to be educated on nuclear power control software however.)
The actual problem on the Qantas jet was a latent defect that was exposed by a software upgrade to another system. So the bug was there for a long time and I'm sure there are still others waiting to be found. But this doesn't stop me getting on a jet at least twice a week.
As a software professional and nervous flyer, do problems with the aircraft software scare me? No not really. What scares me is the airline outsourcing maintenance to the lowest bidder in China, the pilots not getting enough break time, the idiotic military pilot who ignores airspace protocol, and the lack of english language skills in air traffic controllers and cockpit crew across the region where I fly (English is the international standard for Air Traffic Control).
A good friend is a senior training captain on A330's, and in all the stories he tells software is barely mentioned. What get's priority in the war-stories is the human factors and general equipment issues - dead nav aids, dodgy radios, stupid military pilots. One software story was an Airbus A320 losing 2 1/2 out of 3 screens immediately after takeoff from the old Hong Kong airport. The instructions on how to clear the alarm condition and perform a reset were on the "dead" bottom half of one of the screens.
A great example of software doing it's job is the TCAS system - Traffic Collision Avoidance System (http://en.wikipedia.org/wiki/Traffic_collision_avoidance_system). To quote my friend "If it had lips, he'd kiss it". It's saved his life, and the lives of 100's of passengers, at least twice. Both times through basic human error on the part of the pilot of the other aircraft.
One final thought - on average about 1000 people die in commercial aviation incidents each year world wide (source: aviation-safety.net) . In the USA, over 30,000 people die in vehicle accidents every year.
Except that the designers of the software didn't take all possible situations into account. For example, any Fly By Wire Airbus will automatically pitch up if speed increases too far above the maximum airspeed, even when flown manually. This may be a good idea when the airplane is diving (the most likely cause for overspeed), but not when it's straight and level with other traffic immediately above! This has already made several Airbus planes in heavy turbulence suddenly start to climb violently due to a sudden change in airspeed or temperature and overriding the pilot's MANUAL inputs while he's trying to avoid flying into other traffic! That's insane, and it's only one of many reasons why I can't wait to get off the Airbus fleet onto a more sensibly designed plane. (I'm currently an A320 pilot).