Slashdot Mirror


Long Uptime Makes Boeing 787 Lose Electrical Power

jones_supa writes: A dangerous software glitch has been found in the Boeing 787 Dreamliner. If the plane is left turned on for 248 days, it will enter a failsafe mode that will lead to the plane losing all of its power, according to a new directive from the US Federal Aviation Administration. If the bug is triggered, all the Generator Control Units will shut off, leaving the plane without power, and the control of the plane will be lost. Boeing is working on a software upgrade that will address the problems, the FAA says. The company is said to have found the problem during laboratory testing of the plane, and thankfully there are no reports of it being triggered on the field.

45 of 250 comments (clear)

  1. Have you tried turning it off and on again? by Anonymous Coward · · Score: 5, Funny

    Finally!

    IT support advice that's useful!

    1. Re:Have you tried turning it off and on again? by rjniland · · Score: 4, Interesting

      Yes, but perform a clean systems shut down BEFORE turning off power.

      I was on an airliner once that crashed at the gate, prior to departure.

      Ground power was disconnected before they had spun up the APU. Lights out. Lights on. ... Several minutes later we get an announcement that we'd have to wait for a backup plane, which took 45 minutes to arrange.

      They were unable to reboot the airliner.
      Robust systems design wasn't a phrase that came to mind.

  2. This is Boeing Tech Support by mikeabbott420 · · Score: 4, Funny

    "have you tried turning it off and then back on?"

    --
    This program was made possible by a grant from the Ultra-Humanite, and viewers like you.
    1. Re:This is Boeing Tech Support by fuzzyfuzzyfungus · · Score: 3, Funny

      NTSB investigators reported the cause of the crash as 'Controlled reboot into terrain".

  3. Very unlikely to be triggered in the field by Brandano · · Score: 2, Informative

    A commercial plane will most probably undergo through several maintenance events and checks during that sort of time frame, where cycling the power is part of the procedure.

    1. Re:Very unlikely to be triggered in the field by hawguy · · Score: 4, Insightful

      A commercial plane will most probably undergo through several maintenance events and checks during that sort of time frame, where cycling the power is part of the procedure.

      It's very reassuring to know that it probably won't happen.

    2. Re:Very unlikely to be triggered in the field by compro01 · · Score: 2

      You will probably not be struck by lightning, but I can't guarantee that it won't happen.

      Actually, when talking about airliners, getting struck by lighting is a fairly common occurrence. A typical airliner experiences a lightning strike about once a year.

      --
      upon the advice of my lawyer, i have no sig at this time
    3. Re:Very unlikely to be triggered in the field by confused+one · · Score: 5, Interesting

      If it ever happened on a plane, then it means that the maintenance was intentionally skipped. If they reach 248 days of continuous operation then a number of significant maintenance cycles have been skipped (some 23-25 inspection / maintenance cycles that generally require shutting down the electrical system). The generators in question are attached to the engines. The engines have a overhaul schedule that is shorter than 248 days of continuous operation. If they managed to reach this point, then the major maintenance cycles have been skipped and the engines are long overdue for a tear down inspection and overhaul. Any plane which could reach this point, 248 days of continuous operation missing all of the required maintenance; this is not a plane (or an airline for that matter) which anyone should be flying on.

    4. Re:Very unlikely to be triggered in the field by kthreadd · · Score: 2

      If it ever happened on a plane, then it means that the maintenance was intentionally skipped.

      And that would of course never happen.

    5. Re:Very unlikely to be triggered in the field by Anonymous Coward · · Score: 2, Funny

      You must not fly United.

    6. Re:Very unlikely to be triggered in the field by hawguy · · Score: 2

      If it ever happened on a plane, then it means that the maintenance was intentionally skipped. If they reach 248 days of continuous operation then a number of significant maintenance cycles have been skipped (some 23-25 inspection / maintenance cycles that generally require shutting down the electrical system). The generators in question are attached to the engines. The engines have a overhaul schedule that is shorter than 248 days of continuous operation. If they managed to reach this point, then the major maintenance cycles have been skipped and the engines are long overdue for a tear down inspection and overhaul. Any plane which could reach this point, 248 days of continuous operation missing all of the required maintenance; this is not a plane (or an airline for that matter) which anyone should be flying on.

      You would think that if this situation was unlikely to ever happen in practice that the FAA wouldn't have deemed it necessary to issue an AD requiring that the GCUs be power cycled at intervals no longer than 120 days. You'd think they'd already be aware of required maintenance intervals that require powercycling the GCUs, and they waived the usual comment period before issuing the AD due to the perceived imminent danger.

  4. Control unit runs at 100 Hz? by photonic · · Score: 5, Insightful

    I guess this might be due to a 32-bit signed integer being incremented at 100 Hz: 2^31 / 24 / 3600 / 100 = 248.5 days.

    --
    karma police: arrest this man, he talks in maths; he buzzes like a fridge, he's like a detuned radio. [radiohead]
    1. Re:Control unit runs at 100 Hz? by Anonymous Coward · · Score: 2, Funny

      I call BS. No WIndows 98 machine could possibly stay up for 7 weeks, so this was a non-issue.

    2. Re:Control unit runs at 100 Hz? by bosef1 · · Score: 2

      That makes a lot of sense. A lot of aviation power systems run with 400 Hz AC current (the higher frequency lets them use smaller transformers). They could be dividing down the power signal to 100 Hz, and using that to increment a counter.

      The other option is that many operating systems use 10 ms = 100 Hz for their internal interrupt timers. So it could just be a counter that is being incremented every interrupt cycle, and doesn't care what frequency of electricity is being used.
      (cf. the jiffy http://en.wikipedia.org/wiki/Jiffy_(time) )

    3. Re:Control unit runs at 100 Hz? by TheRealHocusLocus · · Score: 5, Funny

      I guess this might be due to a 32-bit signed integer being incremented at 100 Hz: 2^31 / 24 / 3600 / 100 = 248.5 days.

      Yes, the moment the big bird would shut down was correctly prognosticated by the Connecticut Yankee in King Arthur's Court. While testing a crowbar circuit he ran out of time and came to while munching on phattened feasant at Medieval Times, in a daze of King Arthur. He noticed an unused carrion bit, and realized that birds of prayer who managed the King's affairs were hard-sinewed to pluck quills for signing and always discarded the carrion bit. He caught the underflow was heralded by the people and befriended by the King, who set him to work hacking the Code of Chivalry and cracking the Y1K problem. In that time there were only punch cards and knights on horseback only had a resolution of 1 bit, so tournaments were long the fields were full of snakes, to avoid spooking the horses the knights would dismount and cleave them with sword, leaving half-adders strewn about. It was Pendragon who had built the famous Round Table with 12 seats, two complete I Chings, where Arthur and the knights would drop in and punch out binary sums in a rudimentary form of patty-cake, which inspired the mechanical circular adder of later years. The Yankee's refinement was a 13th chair left unoccupied to mark the betrayal of Judas, and also to serve as a carrion bit.

      There is a great deal more about gum-powder and 99 cent gamut of Steampunk-driven micro commerce, a Debian release called 'Guinevere' and a whole lotta Lancelot, but time is fun when you're having flies.

      --
      <blink>down the rabbit hole</blink>
  5. Re:Oh come on. by IndigoZulu · · Score: 5, Interesting

    It could be the overflow of a counter of 10ms intervals. There are 86400 seconds per day, so 8640000 10ms intervals per day ... 2147483648 / 8640000 = 248.55

  6. If Boeing believed in software QA.... by Bomarc · · Score: 2

    For all of the QA at Boing; they don't believe in software QA. Take a look at their job openings some time: In years of searching, I've seen only one software QA position, and it wasn't dealing with aircraft. Any such search results will return developers that are to write their own tests against the spec. Developers are not Testers.... and I'll ask: How many more such bugs are out there?

    I know of two other software "bugs" ... that can be attributed to a lack of QA. How many people will die due to a bad management decision on the part of Boeing?
    Disclosure: Yes, I'm a software QA / Test professional.

    1. Re:If Boeing believed in software QA.... by Anonymous Coward · · Score: 2, Informative

      The Primary Flight Computer software for the 777 was written in England by GEC. Indeed the hardware for the PFC was designed and built by GEC.

      I was on the software QA team for the PFC code. There were tens of us working three shifts 24 hours per day devising tests of the PFC against it's requirement spec. There were even more doing unit tests on all the Ada code.

      That is perhaps why you don't see Boeing advertising for QA engineers. They outsource the hardware and software.

    2. Re:If Boeing believed in software QA.... by Anonymous Coward · · Score: 2, Insightful

      Actually I took my work there testing the 777 software very seriously.

      On at least two occasions I escalated what I thought was a problem in the specification all the way back to Boeing. One of them turned out to be a "real-world" issue in the spec.

      I believe the rest of the team took the same attitude. We used to talk about that a lot.

      At the end of the day what you are asking for is impossible. The spec we worked to was a stack of paper 2 yards high when printed out. How many QA engineers know enough about flight dynamics to question if any of it is correct or not?
       

    3. Re:If Boeing believed in software QA.... by Required+Snark · · Score: 5, Informative
      You have no idea what you are talking about. All FAA certified aircraft software has to conform to the DO-178B / DO-178C standard. The standard imposes design, testing, process and documentation standards that are extremely demanding.

      QC isn't just a department or a step in the release process, it is built into the full life cycle of the software. Safety is the goal, and the requirement for good practice starts at the beginning of the process, with the requirement documents.

      For example, there are five levels of error severity defined from A to E. E has no impact on safety and A is catastrophic, where a crash could occur. The level of software test and validation depends on the severity level.

      The number of objectives to be satisfied (eventually with independence) is determined by the software level A-E. The phrase "with independence" refers to a separation of responsibilities where the objectivity of the verification and validation processes is ensured by virtue of their "independence" from the software development team. For objectives that must be satisfied with independence, the person verifying the item (such as a requirement or source code) may not be the person who authored the item and this separation must be clearly documented. In some cases, an automated tool may be equivalent to independence. However, the tool itself must then be qualified if it substitutes for human review.

      Your inability to find a "QC" position is because you don't know the structure of aerospace software development and have no idea of the job titles or terminology used to describe the standards used. You are projecting your lack of knowledge into a inconceivable lapse of competence on the part of Boeing and the FAA. In what universe would there be no software safety requirements for the civilian aircraft industry? All you have shown is that you are ignorant and have a basic lack of common sense.

      --
      Why is Snark Required?
  7. Re:Lesson Here by Megane · · Score: 2

    Also, use the difference of the current time minus the start time, instead of computing the end time and using a simple less than/greater than comparison. This properly handles wraparounds, and only has a problem with differences more than half of the full range. (so don't keep comparing the time after it's ended!)

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  8. Graceful degradation by thisisauniqueid · · Score: 2

    The plane's control systems should have several levels of degraded-mode operation, so if one system stops working, the plane still hobbles along the best it can without the non-working system. Google's self-driving cars have something like 7 layers of nested failure modes, each with slightly degraded functions relative to the next higher level. It's almost impossible to trigger enough failures to completely shut the system down, which is a good thing if you're traveling at highway speeds. It's very concerning that a company like Boeing didn't catch this before product release, but even more concerning that they didn't design the system to be resilient against this sort of failure.

  9. It is probably a non-issue. by 140Mandak262Jamuna · · Score: 5, Funny

    The company is said to have found the problem during laboratory testing of the plane, and thankfully there are no reports of it being triggered on the field.

    The spokesman continued, "The battery would have caught fire long before that integer overflow."

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  10. Re:Oh come on. by Mirar · · Score: 2

    Oh, so they can make it fine for 497.10 days by changing the type to unsigned!

  11. Re:3 shifts? by Anonymous Coward · · Score: 3, Informative

    The reason for the three shifts was that we were using actual PFC computers connected to hardware that could simulate all the inputs and read all the outputs.

    That hardware was a big complicated rack of electronics and there were maybe 8 or 10 such units in a lab.

    As such, to optimize use of the facilities it was necessary to have three shifts 24 hours per day. This went on for a year or more.

    Very good planning in fact.

    Now I could tell you stories of the real corners cut to meet the schedule. But that's a complicated story.

     

  12. Re:Oh come on. by SJHillman · · Score: 3, Informative

    Which is apparently what Windows does:

    https://www.ctm-it.com/it-supp...

    You'd think they would have learned since Windows 95/98 did the same thing.

    https://support.microsoft.com/...

    But hey, at least it goes 10 times as long now.

  13. Re:queue the.. by jones_supa · · Score: 4, Informative

    As a sidenote, there exists a somewhat famous bug in Windows 95 and 98 (later patched) that caused these operating systems to stop functioning after 49.7 days of uptime.

  14. Re:Failsafe by X0563511 · · Score: 2

    ... not when they would all have nearly the exact same runtime - they would all hit the failsafe at around the same time.

    Not that this should ever happen in the air - as others have said, if the thing manages to run for this long, someone hasn't been doing maintenance.

    --
    For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
  15. Re:What idiot doesn't know what "failsafe"means? by Anonymous Coward · · Score: 2, Insightful

    If you actually read the AD it will say "We are issuing this AD to prevent loss of all AC electrical power, which could result in loss of control of the airplane."

    COULD lose control, not WILL. The 787 has at least 3 additional backup systems against this sort of failure, the APU, DC battery backup, and Ram Air Turbine.

  16. Enough of this by confused+one · · Score: 5, Informative

    This story is being way overblown. Yes, it's a bug. Yes, it should be fixed. However...

    248 days of continuous operation is well past the scheduled major maintenance for the aircraft. By this point, a 787 would have to go through many minor maintenance cycles which would have required shutting down the electrical system. In addition, loss of all 4 generators would not result in a loss of vehicle because there are batteries, an APU (a backup generator) and Ram Air Turbines (RATs), generators that deploy from the wing if the APU won't start. To have to rely on any of these would not make for a good day for the pilots; but, they would certainly provide the necessary power to safely land the aircraft at the nearest airport. They might even be able to continue on and finish their flight if they successfully reset the generators.

    This is not the OMG Planes Are Going to Fall From The Sky! event the media is making it out to be.

    1. Re:Enough of this by PPH · · Score: 2

      This is not the OMG Planes Are Going to Fall From The Sky!

      No. This is a "What the f* were you goofballs thinking when you wrote this code? And if this is all the better you can do, what other gotchas are hiding in there?"

      --
      Have gnu, will travel.
    2. Re:Enough of this by NicBenjamin · · Score: 2

      Dude, this is a for-profit company, not a research university. It's not written by people whose entire job is to prove to the world they write the most robustest code ever designed with zero bugs. If it doesn't kill people or delay flights it doesn't cost them money and nobody, except computer geeks, gives a shit.

      In this case the Dreamliner's designed to have all the relevant systems turned off for routine maintenance once every two weeks. Which means if they go more then 248 days without being restarted the airline has skipped several dozen (25 or 26 according to another slashdotter) routine maintenance cycles, which is likely a much bigger problem then the pilot needing to a) restart the computers mid-flight, or b) needing to glide to an emergency landing.

      Given that there've been something on the order of 400 plane-years of actual flight performance, and nobody noticed this bug until now, the software design seems to be about right. Not perfect, but if the planes are even being given 10% of the maintenance the specs call for this bug is a non-iissue.

      OTOH, the problems with various batteries were dumb engineering. Altho those also seem to be solved.

    3. Re:Enough of this by ArylAkamov · · Score: 2

      and Ram Air Turbines (RATs), generators that deploy from the wing if the APU won't start.

      Holy shit. That is cool, though looking at the pictures I can't stop laughing at how comical it looks.

      http://en.wikipedia.org/wiki/R...

    4. Re:Enough of this by joe_frisch · · Score: 2

      Even though this bug isn't a direct threat, it could interact with other future software changes. If it is a counter overflow there is a risk that the counter would run at a higher rate in some future version where more functionality is needed. If 248 days went to 2.48 days, it might not be caught in testing, but could (rarely) happen in real life.

    5. Re:Enough of this by ray-auch · · Score: 2

      Bingo.

      If this was only spotted recently in "lab testing" (and why was it being tested now, and not before flight... what prompted the testing...) then it was known / not documented that overflow of this counter would cause shutdown. Some future revision could easily be to increase the precision, at the expense of range, or persist the counter across reboots, and that might not be considered a problem because the system was thought to handle the counter overflowing because no one documented that it didn't.

      That is why I think the AD is there - to ensure this issue is known when this software is messed with in future.

  17. Re:Oh come on. by jones_supa · · Score: 2

    I am not completely familiar with the matter, but I remember hearing that using signed types in some situations can be a better choice, even when the value would normally be used to represent only a non-negative value. It could make overflows more obvious and calculating deltas might be easier? If someone actually knows about this stuff, feel free to chime in.

  18. Re:Oh come on. by fisted · · Score: 3, Informative

    In C, overflowing a signed integer type is undefined behaviour; unsigned type wrap around to zero in a defined manner.
    Of course, either is often undesired, but the latter at least doesn't allow basically anything to happen.

  19. Re:Oh come on. by plopez · · Score: 2

    It doesn't matter what country programmers come from, in my experience too many programmers have no clue about reality outside of their cube. They are building software for things they do not understand. I am going to rant about this in another thread so I will leave it at that for now.

    --
    putting the 'B' in LGBTQ+
  20. Re:Oh come on. by fuzzyfuzzyfungus · · Score: 2

    Man, if only we could afford to use 64 bit values for things. I realize that transistors are simply too expensive right now; but perhaps, in the future, the miracles of science will make this possible...

  21. Re:queue the.. by dunkelfalke · · Score: 5, Informative

    Only theoretical, though. Windows 9x would crash long before reaching this uptime.

    --
    "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
  22. They should run OpenVMS by thedavidcathey · · Score: 2

    OpenVMS systems have had many systems up for several years without rebooting. Their equivalent of the "ps" utility had to fixed one time because systems were exceeding 9999 days uptime.

  23. Re:Oh come on. by dunkelfalke · · Score: 4, Funny

    And this is why C should never be used for mission critical software.

    --
    "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
  24. Re:queue the.. by roc97007 · · Score: 2

    "(psshsquawk)This is the Captain speaking, we are cruising at 30,000 feet, have a bit of a tail wind and will be in San Francisco a little ahead of schedule. ...Ummm... Ah.... I'm putting the seatbelt sign on now. Please return to your seats as we reboot the airplane.(pssshsquawk)"

    --
    Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
  25. Re:Keeps Living Up To It's Name by Ethanol · · Score: 2

    All three of them.

    Hey, 248 days is five dog-years.

  26. Re:Maybe they should have used Rust. by TheRealHocusLocus · · Score: 2

    This is a prime example of why we need to use the Rust programming language ... blazingly ... eliminates data races ... guaranteed memory ... threads ... greatest minds ... the great ... the superb ... the glorious ... the mightiest ... Git ... Hub ... ... properly ... where it's at ... what we need ... It's what [the world] need[s] now.

    Oh yeah? Sheeeit.
    Pump it up! (endorsed by M.I.A.).

    Ericsson Calling!
    Speak the Erlang now (Seattle boys say Wha? Penguin Girls say Wha-What [x2]

    Use Erlang Erlang Erlang, Ga la ga la ga la Land ga Lang ga Lang
    Con-currency get you down?
    Stack em flat, get down get down
    Too late you down D-down D-down D-down
    Ta na ta na ta na Ta na ta na ta

    Bench mark a-blaze Erlang a lang a lang lang
    Eager evaluation Erlang a lang a lang lang
    Single assignment Erlang a lang a lang lang
    Dynamic typing Erlang a lang a lang lang

    Who the hell is huntin' you?
    Distributed, fault-tolerant,
    In the BMW
    How the hell they find you?
    hot swapping,
    Feds gonna get you
    non-stop applications
    Pull the strings on the hood
    soft-real-time
    concurrency explicit
    message passing, Erlang a lang a lang lang
    Nah explicit locks Erlang a lang a lang lang
    open source Erlang a lang a lang lang.

    CHORUS:
    fib(1) -> 1; % If 1, then return 1, otherwise (note the semicolon ; meaning 'else')
    fib(2) -> 1; % If 2, then return 1, otherwise
    fib(N) -> fib(N - 2) + fib(N - 1).

    Needs some work though.
    An AIRPLANE would make a good sandbox. The price of failure is so high no one will make a mistake.

    --
    <blink>down the rabbit hole</blink>