Slashdot Mirror


Programming Error Doomed Russian Mars Probe

astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined." According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."

16 of 276 comments (clear)

  1. Excuse me... not a programmer's fault. by LostCluster · · Score: 5, Insightful

    We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.

    1. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 5, Funny

      sure, it missed:

      if(cpu_melted)
            abort();

    2. Re:Excuse me... not a programmer's fault. by Cochonou · · Score: 5, Informative

      Well... if you read TFA (or actually the first TFA linked), it is clearly written:
      In a report to be presented to Russian Deputy Prime Minister Dmitry Rogozin on Tuesday, investigators concluded that the primary cause of the failure was "a programming error which led to a simultaneous reboot of two working channels of an onboard computer [...] Likewise, cosmic rays and/or defective electronics are not the leading suspects behind Phobos-Grunt’s demise.
      The summary is clearly bolting together two contradicting reports.

    3. Re:Excuse me... not a programmer's fault. by MSesow · · Score: 5, Funny

      That could throw a ProcessorNotFoundException, be sure to code accordingly.

    4. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 5, Funny

      This has nothing to do with reading TFA. It has everything to do with the summary

      You just defined all of slashdot. What was your point again?

    5. Re:Excuse me... not a programmer's fault. by icebike · · Score: 5, Interesting

      Obviously the error handling routine was poorly written.

      I'll assume your tongue was firmly planted in your cheek, and suggest a +1 Funny mod.

      But on the chance you were serious, depending on where that chip was, it may have been beyond something manageable by software.

      A chip in a power controller could take down any or all of the processor components, or render access to control circuits impossible.

      The linked article also states

      Everything was working well with the spacecraft immediately after launch, including deployment of the solar panels, until the command to start the engines was issued. When that did not happen, the spacecraft went into a safe mode, keeping the solar panels pointed to the Sun to maintain power.

      How many times do you supposed they actually tested engine start IN THE SPACE CRAFT? I'm guessing ZERO.

      non-space qualified parts being used in some of the electronics circuits. This is a design failure by the spacecraft engineers that might have been caught had they performed adequate component and system testing prior to flight. But they did not.

      So design failure, due to radiation, prior to the craft getting near the strongest radiation belts. Unbelievable. Occam would be skeptical.

      This sounds to me like some on-board internal source of radiation, or induction, or simple overload, fried a chip somewhere in some un-specified circuitry, most probably in the engine controls. This seems far more likely than an external radiation source given the shielding the physical design would provide.

      I doubt space qualification made any difference at all. The window for space radiation in the brief time it was operational was small.
      Rather I suspect under-spec parts, over voltage or high current draw, or internal shielding oversights.

      --
      Sig Battery depleted. Reverting to safe mode.
    6. Re:Excuse me... not a programmer's fault. by K.+S.+Kyosuke · · Score: 5, Informative

      I'm not a satellite engineer, but wouldn't it be easy enough to just install a lead shield around the PCB to protect from most radiation? As long as the shield's not too thick, it shouldn't add too much weight, especially compared to using older-technology chips that'll take up more board space.

      Well, that depends. Even on Earth's surface, we have to use ECC in more demanding application. In LEO, you lose the protection of the atmosphere but you still have Earth's rather strong and large magnetosphere. But this was an interplanetary probe. Once you get out of the radiation belts, interstellar and intergalactic particles start hitting you. You can't protect from those with a lead shield of any reasonable size. Pretty much the only way is simply to make the chip simple, rugged and design it with components (transistors) large enough that a particle flying through won't bother you much. Or add redudnancy. Or both, if possible (that's the usual case).

      --
      Ezekiel 23:20
    7. Re:Excuse me... not a programmer's fault. by pixelpusher220 · · Score: 5, Funny

      Except no one knows for certain the computers crashed at all.

      I'm quite sure that the computers crashed. Right along with the spacecraft ;-)

      --
      People in cars cause accidents....accidents in cars cause people :-D
    8. Re:Excuse me... not a programmer's fault. by bughunter · · Score: 5, Informative

      As another EE with experience in rad hard space qualified design, he's not being self-contradictory. He's spot on.

      If your CMOS structures are prone to latchup in the presence of single high energy events, then shielding does you no good. The amount of shielding necessary would more than consume the entire payload mass budget. Adding insufficient shielding just creates showers of secondary particles, each with more than enough energy to cause latchup alone, therefore rendering you at a statistical loss compared to no shielding whatsoever.

      With this in mind means designing the CMOS structure to make shielding unnecessary. For example, build your circuits on bulk insulators instead of bulk semiconductor.

      Just because you can't understand it doesn't mean he's self contradictory. You just missed his point. And then attacked him.

      --
      I can see the fnords!
    9. Re:Excuse me... not a programmer's fault. by hairyfeet · · Score: 5, Interesting

      Which makes me think of something I've been wondering for awhile, now that Intel has quit making the 386 are we gonna be seeing more failures like this in the future? Because from what i understand Intel kept making the 386 rev for so damned long (last chip rolled out in 09 IIRC) because its large die area and primitive but functional design made it trivial to harden for military and aerospace use. Now again from what I've been told due to the die shrinks that a modern chip, even something as old as the P3 or P4 would be hell to harden simply because its smaller dies and tighter tolerances would make it hell to protect from bit flips caused by cosmic rays, not to mention outright frying the chip from radiation exposure.

      so are there any modern chips that would be easy to harden without being insanely expensive? Atom? AMD Geode? I'm sure with its GPU and dual cores Bobcat would be right out, maybe Via C3s? While ARM would be a good guess its die shrinks to fit in mobile phones would probably make it insanely expensive to harden yes? So while i'm sure the military probably bought a warehouse full of 386s before intel shut down what happens when they are gone? do we have a viable modern chip that withstand the rigors of space without costing insane amounts of money?

      --
      ACs don't waste your time replying, your posts are never seen by me.
  2. Programming error? by mehrotra.akash · · Score: 5, Funny

    the ultimate cause was the use of non-space-qualified electronic components

    Programming error?
    Perhaps in the software used to order the parts

  3. Contradictions by Aladrin · · Score: 5, Informative

    The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.

    --
    "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
  4. Sounds like a editor failure to me by kbob88 · · Score: 5, Funny

    In other news, U.S. radars were not responsible for the highly confusing and contradictory summary posted this morning to a Slashdot story about Russia's Phobos-Grunt probe. A thorough investigation has determined that the story's chips should have been able to withstand the radiation received when the story was transmitted through the intertubes and routed over northern Alaska. Instead, investigators blamed a typing failure on the story editors. "A series of tests showed that the editing was lousy and sloppy, and disciplinary action will be taken on those responsible," a spokesman said.

  5. Re:How is "chip failure" a "programming error"? by Hognoxious · · Score: 5, Funny

    A 4 digit ID and never heard of microcode.

    Seriously Gramps, the distinction between hardware and software isn't as clear cut as it was when shit was all powered by steam.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  6. Fun to read the comments by vlm · · Score: 5, Insightful

    Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?

    Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  7. Re:So how much? by jd · · Score: 5, Interesting

    Space Micro doesn't list the prices of their components or systems, nor can I find any from anyone else. Honeywell don't list their prices either. Atmel seem to have dropped out of the field. Linear don't list the prices for their space-hardened stuff. Don't see any for BAE either, or Intersil. Empire Magnetics require a lot of personal data before they give you access to even the price classification information. Not the prices, just how they're classified.

    You've got to allow for a year's worth of traveling outside of an atmosphere and then operating on Mars for the duration of the mission. This analysis of radiation for manned missions suggests you're looking at 3.5 mSv per day, then 20 rems per year in most of the places of interest.

    Converting everything to rads, it's 0.1 rads per mSv and 1 rad per rem, so that's 12.75 rads to get to Mars if you assume a year-long trip, plus 20 rads for the mission, so anything with a rating of less than 32.75 rads is pretty much guaranteed to fail. However, over the course of a two years, the odds of there being a solar flare are not insignificant. To be safe, you want resistance to a further 400 rad. 432.75 rad is within the tolerance of most of the space-hardened components (some components can be taken up to 1000 rad, others up to 10,000). However, the cheapest space components would NOT survive. You're talking high-end on the space scale.

    I'm going to figure that the top-line components will cost 100x that of their conventional counterparts, due to the higher-level of precision and QA that are required. It might well be a good deal more. In Russia, you've also got to pay for smuggling decent-grade hardware out of the US, as all of this stuff will be under massive amounts of regulation.

    My guess is that the cuts would have saved enough that those doing the cost-cutting could buy second homes in Switzerland.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)