Slashdot Mirror


Missing Files Blamed For Deadly A400M Crash

An anonymous reader writes: Think you had a bad day when your software drivers go missing? Rejoice, you get to live! A fatal A400M crash was linked to data-wipe mistake during an engine software update. A military plane crash in Spain was probably caused by computer files being accidentally wiped from three of its engines, according to investigators. Plane-maker Airbus discovered anomalies in the A400M's data logs after the crash, suggesting a software fault. And it has now emerged that Spanish investigators suspect files needed to interpret its engine readings had been deleted by mistake.This would have caused the affected propellers to spin too slowly causing loss of power and eventually, a crash.

32 of 253 comments (clear)

  1. Good god. by Anonymous Coward · · Score: 5, Insightful

    Is it so hard to have a integrity check and diagnostic set run as part of the preflight checks? If you can place hundreds of miles of wire and know what's what, surely they have computer engineers competent enough to make something like this to catch such glaring errors.

    1. Re:Good god. by Anonymous Coward · · Score: 5, Insightful

      We've lost that kind of 'slow down and make sure it's right' attitude that engineers really need to have. In this fast-paced road of cutting costs and letting the marketing group run the show, the pressure to get product out the door as quickly as possible no matter what is unstoppable for software in particular, but really almost anything that is able to be 'patched' later. Making consumers into your beta testers is douche-y enough, but doing it when lives are at stake should be punished as criminal and in an extremely harsh and public way.

    2. Re:Good god. by fuzzyfuzzyfungus · · Score: 4, Insightful

      What's surprising to me is not merely that; but if the calibration data are so important that the engine shuts down without them, how did the aircraft take off?

      If the calibration data are nice, good for fuel economy, improve reliability, etc. you'd expect things to continue working without them, albeit possibly not as well as the manual specifies.

      If the calibration data are Absolutely Vital Lest The Engine Throw A Propellor Right Through The Cockpit, or something of that nature, how did the aircraft allow you to take off with 75% of the data missing? An actual error handling arrrangement would, of course, be in good taste; but even without one I would have (naively, apparently) expected the situation to take one of two courses: if the data are semi-optional, things would work, if perhaps not well. If they are Vital, attempting to get off the ground would have failed. Successful takeoff, followed by shutdown and fiery death, though, seems weird.

    3. Re:Good god. by ChumpusRex2003 · · Score: 3, Insightful

      My guess is that rather than "files" per se, these are look-up tables which were statically linked into the binary.

      On this type of safety critical application, it's a key design aim to avoid code which might fail or throw an exception at runtime. So, rather than load data from a file, which could fail due to a memory allocation failure, a file system failure, etc. the relevant data is static linked, so if the executable successfully launches, it cannot fail to have the data available.

      I don't know what these tables might have been mapping, but conceivably if they torque tuning parameters, the engine might still have run if the data was all NULLs, but delivered the incorrect torque in response to control inputs. Of course, if the missing data was things like fueling data, then the engine may have failed to start.

    4. Re:Good god. by Spy+Handler · · Score: 4, Informative

      if the calibration data are so important that the engine shuts down without them, how did the aircraft take off?

      One engine delivering full power and 3 engines running at low RPM would be enough to take off, since the plane was empty and probably had a small fuel load as well.

      Wiki has an article on the crash: http://en.wikipedia.org/wiki/2...

      Looks like they took off, but noticed a problem with the engines, turned around to do an emergency landing, but hit an electrical pylon and crashed. So it's not like they lost all power and fell out of the sky, they had some power and were doing an emergency landing when they hit an object on the ground just before touchdown. 2 of the 6 people on the plane survived.

    5. Re:Good god. by TWX · · Score: 4, Insightful

      Case in point, the Toyota vehicle acceleration problem.

      --
      Do not look into laser with remaining eye.
    6. Re:Good god. by tomhath · · Score: 4, Informative

      As I read it, the files weren't used until the plane was 400 feet off the ground. So takeoff wasn't a problem.

    7. Re:Good god. by Anonymous Coward · · Score: 5, Insightful

      Read the article... the warning was not designed to kick in until the aircraft was at an altitude of 400ft (120m).

      Not only do you not know you have a problem until are in the air. You don't know you have a catastrophic problem until you are at an unsurvivable altitude. Too low to effectively use a parachute. Too high to just 'jump out' or belly-land it.

      The worst thing is... a committee signed off that this was an 'acceptable risk'. Members of that committee should be brought of up on criminal negligence and manslaughter charges.

      Not a Luddite, but give me my bicycle back...

    8. Re:Good god. by ArcadeMan · · Score: 3, Funny

      I heard the first patch recommendation came from the marketing department, but management refused their idea of cutting one leg of each Toyota owner.

    9. Re:Good god. by joh · · Score: 4, Insightful

      We've lost that kind of 'slow down and make sure it's right' attitude that engineers really need to have. In this fast-paced road of cutting costs and letting the marketing group run the show, the pressure to get product out the door as quickly as possible no matter what is unstoppable for software in particular, but really almost anything that is able to be 'patched' later. Making consumers into your beta testers is douche-y enough, but doing it when lives are at stake should be punished as criminal and in an extremely harsh and public way.

      As far as I know aerospace software is far away from what you describe. Of course you're right if you say that these things are a reason for problems, but THIS is very well understood and usually software for planes is nothing like a consumer product.

      They screwed up, yes, but if they would be "punished as criminal and in an extremely harsh and public way" nobody would ever do anything useful anymore. The problems leading to this crash have to be analyzed and understood and then they have to make sure that the same thing can't happen again.

      But of course: If this was due to someone not following procedures or messing around with maintenance this can (and will) have consequences. I'm also pretty sure that one or more people will lose their job over that.

      But if you really think you can make shit never happen and things working 100% all the time by "hard punishment" you're just wrong.

    10. Re:Good god. by TWX · · Score: 5, Insightful

      If I remember right, Toyota got into trouble in court when the firmware provided to the investigators did not match the firmware in the vehicles. Toyota never did provide the real code if memory serves.

      --
      Do not look into laser with remaining eye.
    11. Re:Good god. by Casper0082 · · Score: 3, Informative

      As others have mentioned, limp mode is not just a transmission control feature. It is a fail safe built into the Engine Control Unit. When all sensors are operating correctly, the car has a map to determine the appropriate air/fuel mixture taking into consideration temperature, pressure, exhaust etc to ensure your car runs optimally. When in limp mode, the ECU cannot trust the sensors to determine the optimal air/fuel ratio. There is a base map that will allow your car to run with a rich fuel mixture (safer than lean) to prevent damage to the motor. Besides being worse for emissions and fuel economy, you can drive the car normally until you can get the issue repaired.

    12. Re:Good god. by Thelasko · · Score: 4, Informative

      ... you'd think the A400M engine software would have a *baked in* "go home without crashing" dataset.

      From how I read the article, it does have a default dataset that it switches to when it detects a problem. From TFA:

      The automatic response is to hunker down and prevent what would usually be a single engine problem causing more damage.

      Limiting the speed of a ground vehicle is safe. However, limiting the speed of an aircraft causes a crash. It sounds like they need to reevaluate their "limp home" calibration, as we call it in the industry.

      --
      One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
    13. Re:Good god. by 0123456 · · Score: 4, Interesting

      You mean, people accidentally mashing both pedals at the same time?

      Possibly. But there was a published third-party analysis of Toyota's ECU software which made me reluctant to buy one:

      http://embeddedgurus.com/barr-...

      I was glad to see that my new SUV automatically cuts the gas if it detects you pressing both pedals at the same time, even if due to a bad sensor or crashed throttle-monitoring process (yeah, I know, that means no left-foot braking, but if you're doing that in an SUV, you're probably doing it wrong).

    14. Re:Good god. by tlhIngan · · Score: 5, Insightful

      They screwed up, yes, but if they would be "punished as criminal and in an extremely harsh and public way" nobody would ever do anything useful anymore. The problems leading to this crash have to be analyzed and understood and then they have to make sure that the same thing can't happen again.

      You know, when an accident happens, the safety board (NTSB, TSB, BEA, etc) interviews are actually privileged information. As in, if you're being interviewed by the safety board, anything you say cannot be used as evidence against you.

      It's a privilege that the safety boards all fight for.

      The reason for this is the safety board's goal is to not find fault, but to find solutions to preventing it from happening again. Doesn't matter if someone hit a button that said "Crash this plane" and pushed it on purpose. They know that if the interviews were not privileged communications, no one would speak to them for fear of self-incrimination. And when that happens, everyone clams up, and you can't figure out why an accident happened or make recommendations to prevent the issue the next time it happens.

      This is especially more so when most complex accidents are a chain of events - this happened, then that happened, then this next thing, plus X, Y and Z and if any of them didn't occur, the accident wouldn't have happened. Almost never is it the result of one definitive action.

    15. Re:Good god. by Anonymous Coward · · Score: 3, Insightful

      You clearly have not done the SAAB slide to get around a corner faster. You do not press the clutch or change gears during the maneuver as the whole point is to apply power during the turn. Although heel-toe, or fat-footing works to apply both throttle and brake, it is not as controllable.

      Sadly many use the left foot for braking under normal circumstances as they are unfamiliar with a manual transmission.

    16. Re:Good god. by Anonymous Coward · · Score: 3, Informative

      That was toyota's excuse. In reality it actually *was* a software error (actually several)

    17. Re:Good god. by mjwx · · Score: 4, Informative

      >(yeah, I know, that means no left-foot braking, but if you're doing that in an SUV, you're probably doing it wrong).

      Sooooooo... no offroading for your SUV?

      SUV's arent built to go off road.

      They dont have locking diffs, a low range gearbox and often, not even underside protection. Most SUV's dont even have full time AWD as they dont have a centre diff, they use systems like the Haldex Traction to transfer power from a latitudinally mounted engine (transverse mounted, AKA: east-west) that drives the front wheels 99% of the time.

      Most SUV's are no more suited to going off road than your average Camry and get stumped by the first slightly damp grassy slope they come across.

      And yes, if you're left foot braking you're doing things horribly, horribly wrong. Doubly so for heel-toe. There are very few times when you need to left foot brake or heel-toe and none of them are on the road. Keep the fancy foot work for the track and dance floor, drive properly on the road.

      --
      Calling someone a "hater" only means you can not rationally rebut their argument.
  2. So, how did ... by PPH · · Score: 4, Interesting

    ... the engines even start. Or throttle up to take-off power?

    Come on, folks. Turn the power on to the engine controllers at the flight line and the status display should have been flashing warnings. Nobody should have even started this thing.

    --
    Have gnu, will travel.
    1. Re:So, how did ... by Anonymous Coward · · Score: 5, Informative

      The story seems to massively simplify how the ECUs work. Each engine needs to be calibrated after production so that the sensor data it hands to each ECU is actually meaningful due to the way it's actually acquired in the engine. The parameter set isn't stored in the engine, but in the associated ECU. To prevent them from getting out of sync, the engine itself contains a little register with the checksum of the parameter set. If that checksum doesn't match, the ECU shouldn't power up the engine. However, the register and the ECU are initially loaded with a default parameter set used in testing scenarios. Looks like that one might have been untouched for the engines on that flight. Now, this is bad because the ECU now misreads the true engine status in various ways and can even think that an engine which is otherwise running fine is seemingly in some critical condition - e.g. power output too high, which causes an immediate shutdown to prevent engine damage. A jet engine that fails by disintegration has a high chance of slicing other airplane parts with ripped off fan blades. This is why hard engine shutdowns do make sense. But when putting the pieces of this puzzle together, this is starting to look similar to how Murphy's law came to be: an exceptionally unlikely chain of human errors ruining everyone's day.

    2. Re:So, how did ... by TubeSteak · · Score: 3, Interesting

      A jet engine that fails by disintegration has a high chance of slicing other airplane parts with ripped off fan blades.

      It's actually exceedingly rare for there to be an uncontained failure.

      That engine shroud is intended to handle catastrophic failures at full throttle.
      This video is a test of the Rolls-Royce Trent 900 engine that went into the Airbus A380. The test starts ~3:25 in.
      https://www.youtube.com/watch?v=j973645y5AA

      Then again, this is the same engine after an oil leak led to an internal engine fire
      https://www.atsb.gov.au/media/2891294/vh-oqa-fig7.jpg
      https://www.atsb.gov.au/media/4173628/ao-2010-089_vh-oqa.jpg

      The Australian Transport Safety Bureau (ATSB) found that a number of oil feed stub pipes within the High Pressure / Intermediate pressure (HP/IP) hub assembly were manufactured with thin wall sections that did not conform to the design specifications. These non-conforming pipes were fitted to Trent 900 engines, including the No. 2 engine on VH-OQA. The thin wall section significantly reduced the life of the oil feed stub pipe on the No. 2 engine so that a fatigue crack developed, ultimately releasing oil during the flight that resulted in an internal oil fire. That fire led to the separation of the intermediate pressure turbine disc from the drive shaft. The disc accelerated and burst with sufficient force that the engine structure could not contain it, releasing high-energy debris.

      Most of the shroud's strength is focused around the main fan blades instead of the turbine blades that are much deeper in the engine.

      --
      [Fuck Beta]
      o0t!
  3. Re:This is what happens when you use Luddite softw by fuzzyfuzzyfungus · · Score: 5, Interesting

    Depressingly, that might actually be true.

    Not because of 'apps' of course; but because no self-respecting consumer OS would fail to cryptographically verify the execution environment(lest some precious 'premium content' be absconded with by pirates) and an entire missing file probably would have caused the aircraft to refuse to move until taken back to Airbus HQ for re-blessing by the vender.

    They don't succeed against motivated pirates, of course; but this is one area where consumer software vendors do actually give a fuck. If people believed that a sabotaged voting machine or a defective ECU could pirate Blu-rays, we'd live in a safer world.

  4. BIST - Built In Self Test by presidenteloco · · Score: 4, Insightful

    My printer at home does it every time it starts up.

    Too bad the airplane doesn't.

    I guess production delays are more expensive than debugging-by-crash. Sad.

    --

    Where are we going and why are we in a handbasket?
  5. Big fail from the software engineering standpoint. by Frosty+Piss · · Score: 5, Interesting

    Just my take as a software engineer and current DoD employee that works with C17...

    There should have been some process on firing up the jet / avionics / computers that ran checks to see that even if software was not latest, was it CONSISTENT?

    Big fail from the software engineering standpoint.

    --
    If you want news from today, you have to come back tomorrow.
  6. Return codes? by cfalcon · · Score: 5, Insightful

    This is a tragedy, but since we're on a tech site, lets talk tech.

    Return values are handled oddly in pretty much every major language. Many API calls want to return something simple- int or bool- and if anything is more complex than that, generally require an actual data structure to be returned, often as a reference. This means that the "I didn't do this" action has a variety of ways to be be passed back- none of them even close to standard.

    If something returns a distance, magnitude, or size, "0" normally means "Error, nothing happened" which is often the same as "Sure, I wrote 0 bytes. Really."
    If something needs to distinguish between success ("I did the thing 0 times as requested" and failure "I couldn't do the thing because of an error condition"), then sometimes a -1 is returned, or an exception thrown, or something else.

    In this plane, something was, at some point, responsible for getting data about the engines. Likely, this happened in layers, each one having access to the results of the lower pieces. One of those pieces had the task of parsing those files.

    So EITHER someone (process, program, whatever) meant to say "This is a problem" and instead said "Here's some default data", OR someone ELSE in that chain of commands (process, program, whatever) has a default for a "This is a problem" result to use as a failsafe, and it was never tested or never communicated up.

    We probably won't get the technical details that go from "files missing" to "engines don't work". Certainly, several level of software or hardware could allow for any number of workarounds in this case, and I'm sure they have a complex system and this was some eventuality that was hard to test for.

    Still, interesting to think about the error return methodology, and how it's so different everywhere in CS.

  7. Does the Therac-25 ring a bell for anyone? by dav1dc · · Score: 4, Informative

    + http://en.wikipedia.org/wiki/T...

    The first computer controlled X-ray machine.... which accidentally irradiated some people to death...
    due to *gasp* software faults! (say it ain't so!)

    I first heard about the Therac-25 during my "Ethics in Computer Science" class many years ago - it made an excellent case study... about problems just like this one.
    Once the textbooks get updated, Therac-25 will be replaced with a case study about the a400m roll out. ^_^

  8. Re: FMEA by TWX · · Score: 4, Informative

    You would be sadly mistaken.

    I've seen software writers follow RFC and ONLY RFC for communications protocols, to the point that anything not explicitly expected per the newest standard of RFC will cause the daemon to crash hard. Doesn't matter if it's garbage on accident, garbage on purpose to try to cause a buffer overflow, or even deprecated commands from previous RFCs, the daemon should handle unexpected input gracefully even if it throws a 500 and closes the connection. To do otherwise (as was done) is irresponsible, but all too common.

    --
    Do not look into laser with remaining eye.
  9. The last word... by Unknown74 · · Score: 4, Insightful

    " The more they overthink the plumbing, the easier it is to stop up the drain. " - Montgomery Scott, Star Trek III

  10. Re:This is what happens when you use Luddite softw by schlachter · · Score: 4, Insightful

    WTF? No automated system check to determine if all needed files are present before flying??!

    --
    My God can beat up your God. Just kidding...don't take offense. I know there's no God.
  11. Re:This is what happens when you use Luddite softw by Anonymous Coward · · Score: 3, Funny

    WTF? No automated system check to determine if all needed files are present before flying??!

    Sure there is.

    We call it 'gravity'.

  12. Re:This is what happens when you use Luddite softw by x0ra · · Score: 3, Insightful

    So it would probably have worked, and not crash because someone was using tr(1) to parse some output in an overly complicated shell startup system...

  13. Re: This is what happens when you use Luddite soft by sycodon · · Score: 3, Funny

    This is why Dr. McCoy didn't trust the transporter.

    --
    When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.