Slashdot Mirror


Programming Error Doomed Russian Mars Probe

astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined." According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."

276 comments

  1. Excuse me... not a programmer's fault. by LostCluster · · Score: 5, Insightful

    We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.

    1. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 3, Insightful

      Obviously the error handling routine was poorly written.

    2. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 5, Funny

      sure, it missed:

      if(cpu_melted)
            abort();

    3. Re:Excuse me... not a programmer's fault. by Tsingi · · Score: 0

      I concur. I didn't RTFA, TFS is contradictory.

    4. Re:Excuse me... not a programmer's fault. by Cochonou · · Score: 5, Informative

      Well... if you read TFA (or actually the first TFA linked), it is clearly written:
      In a report to be presented to Russian Deputy Prime Minister Dmitry Rogozin on Tuesday, investigators concluded that the primary cause of the failure was "a programming error which led to a simultaneous reboot of two working channels of an onboard computer [...] Likewise, cosmic rays and/or defective electronics are not the leading suspects behind Phobos-Grunt’s demise.
      The summary is clearly bolting together two contradicting reports.

    5. Re:Excuse me... not a programmer's fault. by MSesow · · Score: 5, Funny

      That could throw a ProcessorNotFoundException, be sure to code accordingly.

    6. Re:Excuse me... not a programmer's fault. by Rary · · Score: 1

      The second link makes the following claim:

      In a report to be presented to Russian Deputy Prime Minister Dmitry Rogozin on Tuesday, investigators concluded that the primary cause of the failure was "a programming error which led to a simultaneous reboot of two working channels of an onboard computer," the Russian state-owned news agency RIA Novosti reported.

      However, the third link says nothing of the sort. It sounds like TFS is just a mishmash of conflicting theories from different articles.

      --

      "You cannot simultaneously prevent and prepare for war." -- Albert Einstein

    7. Re:Excuse me... not a programmer's fault. by Rary · · Score: 3, Interesting

      To follow up, the article saying that it was a chip failure is dated yesterday, while the article claiming it was a programming failure is dated today. Presumably, this is new information to shoot down the previous claims, but TFS (in typical Slashdot "editorial" style) fails to actually make that distinction, and puts both claims together as part of a single summary.

      --

      "You cannot simultaneously prevent and prepare for war." -- Albert Einstein

    8. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 1

      You wouldn't need those chips if the probes software was processed by The Cloud(tm)!

    9. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Except that the OP said the SUMMARY was contradictory. And it is. This has nothing to do with reading TFA. It has everything to do with the summary.

    10. Re:Excuse me... not a programmer's fault. by MindStalker · · Score: 1

      Chip failure, but it was a software error that lead to not handling the chip failure gracefully. Space qualified stuff has to be much more redundant and capable of handing failures of multiple components.

    11. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 5, Funny

      This has nothing to do with reading TFA. It has everything to do with the summary

      You just defined all of slashdot. What was your point again?

    12. Re:Excuse me... not a programmer's fault. by icebike · · Score: 5, Interesting

      Obviously the error handling routine was poorly written.

      I'll assume your tongue was firmly planted in your cheek, and suggest a +1 Funny mod.

      But on the chance you were serious, depending on where that chip was, it may have been beyond something manageable by software.

      A chip in a power controller could take down any or all of the processor components, or render access to control circuits impossible.

      The linked article also states

      Everything was working well with the spacecraft immediately after launch, including deployment of the solar panels, until the command to start the engines was issued. When that did not happen, the spacecraft went into a safe mode, keeping the solar panels pointed to the Sun to maintain power.

      How many times do you supposed they actually tested engine start IN THE SPACE CRAFT? I'm guessing ZERO.

      non-space qualified parts being used in some of the electronics circuits. This is a design failure by the spacecraft engineers that might have been caught had they performed adequate component and system testing prior to flight. But they did not.

      So design failure, due to radiation, prior to the craft getting near the strongest radiation belts. Unbelievable. Occam would be skeptical.

      This sounds to me like some on-board internal source of radiation, or induction, or simple overload, fried a chip somewhere in some un-specified circuitry, most probably in the engine controls. This seems far more likely than an external radiation source given the shielding the physical design would provide.

      I doubt space qualification made any difference at all. The window for space radiation in the brief time it was operational was small.
      Rather I suspect under-spec parts, over voltage or high current draw, or internal shielding oversights.

      --
      Sig Battery depleted. Reverting to safe mode.
    13. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 3, Funny

      The linux kernel throws an error about unsupported CPU's, how that code should execute in the first place is a mystery.

    14. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      This one time, I almost agree, but why didn't you software retards code in redundancy?

    15. Re:Excuse me... not a programmer's fault. by icebike · · Score: 2

      The second link in summary leads to an article that is internally contradictory. That page from Discovery News is all over the place.
      Which is not surprising given the bio of the author:

      Klotz came to Brevard County, Fla. (aka The Space Coast) as a copy editor for the local paper 24 years ago. She switched to writing because it was obvious the reporters were having way more fun than the editors for the same money. After a year or so of writing for the business section,
      Journalism major trying to wear the big girl shoes.

      The Link to the planetary society page seems much more reliable.

      --
      Sig Battery depleted. Reverting to safe mode.
    16. Re:Excuse me... not a programmer's fault. by smcdow · · Score: 3, Funny

      You can't possibly call yourself a programmer if your code can't recover from a hardware fault.

      --
      In the course of every project, it will become necessary to shoot the scientists and begin production.
    17. Re:Excuse me... not a programmer's fault. by tripleevenfall · · Score: 4, Funny

      In Soviet Russia, code executes you!

    18. Re:Excuse me... not a programmer's fault. by 0123456 · · Score: 2

      A while back I read some interesting discussions between satellite engineers about the tradeoffs between space qualified and not space qualified chips. From what I remember you gain resistance to radiation, but lose in other areas such as resistance to physical damage (e.g. a solder joint coming loose due to launch vibrations) because they're so far behind the state of the art that you may have to put a lot more chips on the same circuit board.

      So it doesn't seem a clear-cut choice... rebooting the computer when it crashes is typically easier than fixing a solder joint when it's fifty million miles from Earth.

    19. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      So something like:

      try { // anything
      } catch (ProcessorNotFoundException e) {
              System.out.println(e.printJTAGTrace());
      }

      That was easy... lazy Russian programmers.

    20. Re:Excuse me... not a programmer's fault. by Squidlips · · Score: 1
    21. Re:Excuse me... not a programmer's fault. by geekoid · · Score: 1

      If a software failover fails, and the currently used chip fail, then it's both.

      Please, do some low lever software /hardware work before opening you mouth.
      This isn't one of your slapped together VB3 front end.

      Yeah, YOU Herd me.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    22. Re:Excuse me... not a programmer's fault. by wjsteele · · Score: 4, Funny

      Actually, that code worked perfectly!!!

      Bill

      --
      It's my Sig and you can't have it. Mine! All Mine!
    23. Re:Excuse me... not a programmer's fault. by Grishnakh · · Score: 1

      I'm not a satellite engineer, but wouldn't it be easy enough to just install a lead shield around the PCB to protect from most radiation? As long as the shield's not too thick, it shouldn't add too much weight, especially compared to using older-technology chips that'll take up more board space.

    24. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      correct.
      Most likely cause of the two half-sets the device restarts TSVM22 ( digital computer ) IOO ( onboard computer system ) is the local impact of heavy charged particles (TZCH) space, which led to faulty RAM ( random access memory ) computational modules TSVM22 sets during the flight of spacecraft “Phobos-Grunt” on the second turn. RAM failure could be caused by short-term inability of the ERI ( elektroradioizdely ) due to exposure to cells TZCH TSVM22 computational modules, which contain two chips of the same type WS512K32V20G24M, located in a single case in parallel with each other. Exposure to lead to a distortion of the code that caused the “restart” of the two half-sets TSVM22.

    25. Re:Excuse me... not a programmer's fault. by gatkinso · · Score: 1

      Space rated hardware, software, and (more relevantly) firmware is designed to handle this type of problem (to the fullest extent possible).

      --
      I am very small, utmostly microscopic.
    26. Re:Excuse me... not a programmer's fault. by alienzed · · Score: 3, Insightful

      On the other hand, this demonstrates so aptly why they failed in the first place. "Yep, it's a software problem, because the hardware failed to run any after it was damaged."

      --
      Never say never. Ah!! I did it again!
    27. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Surface-mounted discrete parts are less vibration-damage-prone, but are more prone to radiation because of their smaller size. The opposite is true for through-hole parts. As for integrated circuits, I don't know what transistors they use in space vehicles, but the same radiation limitations probably apply to newer and smaller FET transistors.

    28. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 3, Informative

      In that case, the primary CPU is already up and running; it's booting additional processors.

    29. Re:Excuse me... not a programmer's fault. by Rakishi · · Score: 2

      How many times do you supposed they actually tested engine start IN THE SPACE CRAFT? I'm guessing ZERO.

      I'm sure they tested the engine multiple times. I'd figure the stress of the launch (vibrations, etc, etc.) causes something to fail either due to shoddy construction or small debris falling onto something.

      I doubt space qualification made any difference at all. The window for space radiation in the brief time it was operational was small.

      Exactly. I doubt all those laptops on the ISS are radiation hardened but they last quiet a while anyway.

    30. Re:Excuse me... not a programmer's fault. by jamstar7 · · Score: 2

      At least they didn't fuck up a meters-to-feet conversion.

      --
      Understanding the scope of the problem is the first step on the path to true panic.
    31. Re:Excuse me... not a programmer's fault. by crutchy · · Score: 3, Interesting

      to my knowledge, only the Apollo Guidance Computer has ever truly achieved hardware failure tolerance. the Apollo 11 LM radar fault overloaded the computer, but was able to continue due to restart logic built into the AGC that was able to pick up critical tasks from where they were when the computer was restarted and drop non-critical tasks, and all with a very small fraction of the capabilities of current technology (although I think from memory they were able to fit 2 transistors on a single chip!). the AGC is really a marvel of (past) engineering and computer science. the reliability problem alone would be insurmountable with today's garbage. probably part of the reason why we haven't been back there since.

    32. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      The linux kernel throws an error about unsupported CPU's, how that code should execute in the first place is a mystery.

      Easy, the boot loader passed the wrong processor info to the Kernel.

    33. Re:Excuse me... not a programmer's fault. by ajlitt · · Score: 0

      In Soviet Russia, fault tolerates YOU!

    34. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.

      perhaps they meant the error was in how the engineers were programmed...

    35. Re:Excuse me... not a programmer's fault. by K.+S.+Kyosuke · · Score: 5, Informative

      I'm not a satellite engineer, but wouldn't it be easy enough to just install a lead shield around the PCB to protect from most radiation? As long as the shield's not too thick, it shouldn't add too much weight, especially compared to using older-technology chips that'll take up more board space.

      Well, that depends. Even on Earth's surface, we have to use ECC in more demanding application. In LEO, you lose the protection of the atmosphere but you still have Earth's rather strong and large magnetosphere. But this was an interplanetary probe. Once you get out of the radiation belts, interstellar and intergalactic particles start hitting you. You can't protect from those with a lead shield of any reasonable size. Pretty much the only way is simply to make the chip simple, rugged and design it with components (transistors) large enough that a particle flying through won't bother you much. Or add redudnancy. Or both, if possible (that's the usual case).

      --
      Ezekiel 23:20
    36. Re:Excuse me... not a programmer's fault. by VortexCortex · · Score: 1

      You can't possibly call yourself a programmer if your code can't recover from a hardware fault.

      I agree.
      "Beware of programmers who carry screwdrivers." - Leonard Brandwein

    37. Re:Excuse me... not a programmer's fault. by icebike · · Score: 2

      How many times do you supposed they actually tested engine start IN THE SPACE CRAFT? I'm guessing ZERO.

      I'm sure they tested the engine multiple times. I'd figure the stress of the launch (vibrations, etc, etc.) causes something to fail either due to shoddy construction or small debris falling onto something.

      I'm sure they tested the engines too. Its probably a tried and true engine. The Russians tend to make very good motors.

      But I seriously doubt they tested it in the space craft using the space craft's wiring harness. They used the harness on the test bed platform.

      --
      Sig Battery depleted. Reverting to safe mode.
    38. Re:Excuse me... not a programmer's fault. by Colourspace · · Score: 0

      An interesting point. I have been a satellite HW engineer (ground technology demonstrator), and sold hi-rel rad hard silicon into aerospace. And I it's embarrassing for me to admit a bit, but I don't know why they just don't do that. Bit obvious though, so there must be a good reason it isn't done. Someone remind me? I forgot.

    39. Re:Excuse me... not a programmer's fault. by Colourspace · · Score: 0

      absolute bullshit.

    40. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      I wish I had mod points.. It never occurred to me that simply making stuff bigger had the side effect of hardening against random particles (I guess because higher capacity means lower change per particle, and higher working voltage means that a particle has to be more charged to change from 0 to 1 a signal)

    41. Re:Excuse me... not a programmer's fault. by Colourspace · · Score: 1

      Actually... having thought a bit, probably down to launch weight. Lead is heavy, and I'm not sure how much you would effectively need to stop *all* types of ionizing radiation getting through to the circuitry.

    42. Re:Excuse me... not a programmer's fault. by Colourspace · · Score: 1

      Sorry - probably IS down to launch weight...

    43. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 1

      I'm not a satellite engineer either. But I believe the problem is cosmic rays, when a cosmic ray hits something it creates a shower of radiation, so shielding unless it's substantial creates more problems than it solves. Instead of a one cosmic ray zipping through and maybe flipping a bit, you get hundreds.

    44. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      If they didn't install radiation hardened chips, thats an incredible piece of mismanagement. Any engineer putting electronics in a radioactive environment would as a matter of course take that into account.

    45. Re:Excuse me... not a programmer's fault. by Beardo+the+Bearded · · Score: 0

      Nope.

      When the fab labs make parts, they try to make everything space-qualified. (Aerospace-rated). Those that fail some of the tests (operating temperature limits, g-shock resistance, environmental ruggedness, etc.) go into the mil-spec bin. Those that fail those tests get dumped into the automotive bin. Fail the automotive tests, and the parts go into the industrial bin. The chips that fail that set of tests go into the consumer bin. If you buy consumer-grade chips and put them into a case that's going into space, it's pretty much a guaranteed failure.

      I'm not sure where the idea that the parts are older comes from. That used to be the case because we couldn't make high-frequency parts in solid-state and had to use vacuum tubes, but that got fixed a decade ago.

      --

      ---
      ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
    46. Re:Excuse me... not a programmer's fault. by Beardo+the+Bearded · · Score: 3, Funny

      Amateur. My software is so good it doesn't even NEED hardware.

      --

      ---
      ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
    47. Re:Excuse me... not a programmer's fault. by Hadlock · · Score: 1

      This sounds to me like some on-board internal source of radiation, or induction, or simple overload, fried a chip somewhere in some un-specified circuitry, most probably in the engine controls. This seems far more likely than an external radiation source given the shielding the physical design would provide.

      If you send the signal to start the engines, and they can't/won't start, does it matter if the computer crashes because there was no error code for that? I guess at that point we start running in to chicken-or-the-egg problems soon afterwards.

      --
      moox. for a new generation.
    48. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 1

      This seems far more likely than an external radiation source given the shielding the physical design would provide.

      It isn't practical to shield spacecraft against cosmic rays: the shielding would need to be tens of metres thick. Adding small amounts of shielding actually makes the problem worse, because instead of a single cosmic ray, you get a shower of secondary particles knocked loose by the original cosmic ray from your shielding material.

      For comparison, note that cosmic rays are posited as a source of occasional bit-flip errors in computers on Earth's surface, which are shielded by tens of kilometres of atmosphere (equivalent in mass to ~10m of water).

      Space is a high-radiation environment. Hardware that needs to work there needs to be designed with this in mind. If you put generic hardware in space, it will fail with random bit-flip errors, probably as soon as it's turned on.

    49. Re:Excuse me... not a programmer's fault. by icebike · · Score: 1

      Except no one knows for certain the computers crashed at all.
      The second link is the only thing mentioning a computer restart, but that is a one line sentence thrown in with a bunch of other possible causes.

      A major short circuit in the engine controls could instantaneously trip both computers off line. Is that a programming error?

      --
      Sig Battery depleted. Reverting to safe mode.
    50. Re:Excuse me... not a programmer's fault. by icebike · · Score: 1

      Excuse me but your post is self contradictory.

      What happened to the tens of meters thick shielding? You throw that out there, then waive it away by implying you can make it all unnecessary by designing "with this in mind".

      You totally over look the fact that off the shelf laptops are being used daily on the ISS with no problems. Have been up there for years.

      Random bit flip errors as soon as its turned on you say?

      I can see why you post as AC.

      --
      Sig Battery depleted. Reverting to safe mode.
    51. Re:Excuse me... not a programmer's fault. by ChrisMaple · · Score: 3, Informative

      There are many aspects to radiation hardness. Radiation can flip one or more bits, resulting in bad data or program crash. Radiation can cause latchup, which will last until power is cycled; if the design is bad, latchup can fry a part. Rad hard parts are designed to be resistant to latchup. Really bad radiation can damage a part that isn't even powered.

      A laptop can live through bit flips, and with luck it can live through latchup, and be functional after power cycling. Spacecraft control generally has to be always on; power cycling in not an option. Thus the design requirements for spacecraft control must be much stricter.

      --
      Contribute to civilization: ari.aynrand.org/donate
    52. Re:Excuse me... not a programmer's fault. by LostCluster · · Score: 1

      The software crashed because there was no working processor left... can't fallover when the backup is the same chip with the same problem.

    53. Re:Excuse me... not a programmer's fault. by pixelpusher220 · · Score: 5, Funny

      Except no one knows for certain the computers crashed at all.

      I'm quite sure that the computers crashed. Right along with the spacecraft ;-)

      --
      People in cars cause accidents....accidents in cars cause people :-D
    54. Re:Excuse me... not a programmer's fault. by pixelpusher220 · · Score: 1

      to shoot down

      Wait it was shot down????? Quick update the summary!

      --
      People in cars cause accidents....accidents in cars cause people :-D
    55. Re:Excuse me... not a programmer's fault. by icebike · · Score: 1

      Well played sir!

      --
      Sig Battery depleted. Reverting to safe mode.
    56. Re:Excuse me... not a programmer's fault. by Samantha+Wright · · Score: 1

      Right. That's the difference between "caused" and "doomed". The chip failure proximally started the chain of events that led to the mission's death, whereas the programming failure spelled ultimate doom. (Dooooooooooooom!)

      ...or maybe it should be the other way around. Hmm.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    57. Re:Excuse me... not a programmer's fault. by DarwinSurvivor · · Score: 1

      I actually read somewhere that the main reason NASA uses old laptops and computers is that the new hardware (nanometer cpu's/etc) are made so small that they can't tollerate the radiation. Sort of like how SNES's live for 2 decades and PS3's last 2 years.

    58. Re:Excuse me... not a programmer's fault. by pixelpusher220 · · Score: 2

      Bah. My software turns hardware INTO software! Mostly molten pools....

      --
      People in cars cause accidents....accidents in cars cause people :-D
    59. Re:Excuse me... not a programmer's fault. by pixelpusher220 · · Score: 1

      That's one planetary entry I would loved to have seen live....

      --
      People in cars cause accidents....accidents in cars cause people :-D
    60. Re:Excuse me... not a programmer's fault. by ChrisMaple · · Score: 4, Insightful

      Many chips are never designed to meet military or space specifications: the extra certification is very, very expensive and there are design compromises between performance and ruggedness. Furthermore, the testing you suggest for space qualification, if failed, results not in a mil-spec component but a component that has been destroyed by the test. In some cases, samples of a given batch are heavily tested to verify the batch, but those devices are considered damaged and not sold.

      Some rad hard type devices are of no interest to consumer design due to the poor performance caused by the compromises involved in achieving hardness. Rad hard devices aren't designed as often due to the small market, and the design is more difficult and takes longer, and certification takes time, too. Thus, the devices are older technology. Additionally, rad-hard parts (the actual transistors inside the ICs) are bigger physically than conventional devices, which also means they can be fabricated on older technology equipment. Thus, with respect to current commercial technology, space-qualified devices are often older technology.

      --
      Contribute to civilization: ari.aynrand.org/donate
    61. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      weight and the capability of some particles to still penetrate basically any material. You have to design the hardware and software to compensate, because errors will occur. Redundancy is better than rad hard though.

    62. Re:Excuse me... not a programmer's fault. by Grishnakh · · Score: 1

      I don't think you're going to stop *all* radiation with a lead shield of any size; the idea is to get a good amount of attenuation for a certain size/weight. Obviously, this is absolutely no replacement for redundancy and other such measures to mitigate the problems seen in a high-radiation environment, I was only proposing it as a first step to help reduce the radiation levels so that you wouldn't see so many radiation-related errors. I mean, if a thin lead shield could reduce your radiation levels by a couple orders of magnitude (not saying it will, just suppose), then wouldn't that be worth it?

    63. Re:Excuse me... not a programmer's fault. by Grishnakh · · Score: 1

      I wonder if the problem is bouncing; I was told by CT scan tech recently that they don't use lead shielding on patients undergoing head-area CT scans because the radiation ends up bouncing around inside the chest cavity because of the shield, rather than escaping into the environment, causing even more damage.

    64. Re:Excuse me... not a programmer's fault. by Grishnakh · · Score: 1

      Exactly. You're not going to find an Intel Core i7 CPU in a space-qualified version.

    65. Re:Excuse me... not a programmer's fault. by JonySuede · · Score: 1

      "Beware of programmers who carry screwdrivers." - Leonard Brandwein

      "Be affraid of programmers who carry wire-cutters" - Me

      --
      Jehovah be praised, Oracle was not selected
    66. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Amazing how many people didn't notice that!

    67. Re:Excuse me... not a programmer's fault. by chispito · · Score: 1

      Sort of like how SNES's live for 2 decades and PS3's last 2 years.

      I would have guessed the difference there is due to moving parts, not scale or complexity.

      --
      The Daddy casts sleep on the Baby. The Baby resists!
    68. Re:Excuse me... not a programmer's fault. by garyebickford · · Score: 1

      I dunno about space, but down here on earth, IIRC the probability of a bit flip in computer memory is fairly high. Per This, "back in 1996 IBM estimated you would see one a month for every 256MB of RAM.", so in my 3G laptop that's about 12 per month. And that's at the bottom of the atmospheric well, using much larger scale memory technology that is more robust with respect to this problem. Today's RAM is what - 100 times smaller in area per bit? Which makes it 100 times more susceptible, all other things being equal. If 100 is the right number, that's about 40 per day on my laptop.

      You may be experiencing bit errors all the time, but many/most are occuring in memory blocks that are not in use just at that moment, or in data or code that your program doesn't happen to access for various reasons, or maybe that bit just isn't important. The more memory you have, the more likely it is that the bit error doesn't matter. But if you only have 16KiB RAM, one bit is much more likely to make a huge difference.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    69. Re:Excuse me... not a programmer's fault. by garyebickford · · Score: 1

      A comment higher up noted that when a cosmic ray strikes the lead, it will cause an avalanche of scattered secondary particles. So some shielding may be worse than no shielding.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    70. Re:Excuse me... not a programmer's fault. by bughunter · · Score: 5, Informative

      As another EE with experience in rad hard space qualified design, he's not being self-contradictory. He's spot on.

      If your CMOS structures are prone to latchup in the presence of single high energy events, then shielding does you no good. The amount of shielding necessary would more than consume the entire payload mass budget. Adding insufficient shielding just creates showers of secondary particles, each with more than enough energy to cause latchup alone, therefore rendering you at a statistical loss compared to no shielding whatsoever.

      With this in mind means designing the CMOS structure to make shielding unnecessary. For example, build your circuits on bulk insulators instead of bulk semiconductor.

      Just because you can't understand it doesn't mean he's self contradictory. You just missed his point. And then attacked him.

      --
      I can see the fnords!
    71. Re:Excuse me... not a programmer's fault. by icebike · · Score: 2

      100 times smaller in area per bit? Which makes it 100 times more susceptible,

      Or 100 times less susceptible assuming a random dispersal of cosmic rays. Smaller targets.
      Depends on the density of the rays I suppose.

      But in any case, that amount of errors WOULD be noticed if it were infact occurring and going undetected and uncorrected
      by the hardware. Just about zero memory goes unused in the modern computer. They strive to use it all in one way or
      another. Unused memory is wasted memory.

      Computers correct for these errors. Parity checking either in hardware or software. You can compare the content
      of files that have been sitting on disk or have been moving thru memory for years, and you never see unexplained
      changes to those files, even when such changes would be very evident (such as plain text files).

      So its either not happening as much as the article suggests, or its already handled via error detection
      and correction and redundancy.

       

      --
      Sig Battery depleted. Reverting to safe mode.
    72. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      So they should have used TWO Arduinos then.

    73. Re:Excuse me... not a programmer's fault. by OhSoLaMeow · · Score: 2, Funny

      I wonder if the chips were code named "Moose" and "Squirrel"...

      --
      They can take my LifeAlert pendant when they pry it from my cold dead fingers.
    74. Re:Excuse me... not a programmer's fault. by robot256 · · Score: 3, Interesting

      Actually, darwin is kind of right. The difference between 120nm transistors and 45nm transistors is quite substantial. Between random radiation, natural wear due to thermal cycling, and period electrostatic discharges from handling and plugging in connectors, it is not surprising that the older chips are sturdier in general.

      But he may have just invoked the "They don't make them like they used to" logical fallacy, because sure there are some 20-year-old SNES machines, but how many of them died 2 years after production? Compare that percentage to the figure for PS3's and you have your answer.

    75. Re:Excuse me... not a programmer's fault. by Grishnakh · · Score: 2

      Maybe they should try magnetic shielding. For a human spacecraft, it'd be quite an undertaking, but for protecting a small electronics module, maybe it wouldn't be so difficult.

    76. Re:Excuse me... not a programmer's fault. by SEWilco · · Score: 1

      We noticed it, we just didn't bother to process the notification.

    77. Re:Excuse me... not a programmer's fault. by SEWilco · · Score: 1

      I'm not a satellite engineer, but wouldn't it be easy enough to just install a lead shield around the PCB to protect from most radiation? As long as the shield's not too thick, it shouldn't add too much weight, especially compared to using older-technology chips that'll take up more board space.

      To cosmic rays, a lead shield is just a bigger impact target and a source for more secondary particles. Unless your lead shield is larger than current spacecraft.

    78. Re:Excuse me... not a programmer's fault. by ankhank · · Score: 1

      > a lead shield
      Nope, look up "secondary radiation"

    79. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      It wasn't a fancy restart logic or "true hardware failure tolerance," but just simple task prioritization. Today's "garbage" should do the same thing with any OS written in the last several decades, otherwise your computer would crash every time the cpu load tried to go over 100%.

    80. Re:Excuse me... not a programmer's fault. by hairyfeet · · Score: 5, Interesting

      Which makes me think of something I've been wondering for awhile, now that Intel has quit making the 386 are we gonna be seeing more failures like this in the future? Because from what i understand Intel kept making the 386 rev for so damned long (last chip rolled out in 09 IIRC) because its large die area and primitive but functional design made it trivial to harden for military and aerospace use. Now again from what I've been told due to the die shrinks that a modern chip, even something as old as the P3 or P4 would be hell to harden simply because its smaller dies and tighter tolerances would make it hell to protect from bit flips caused by cosmic rays, not to mention outright frying the chip from radiation exposure.

      so are there any modern chips that would be easy to harden without being insanely expensive? Atom? AMD Geode? I'm sure with its GPU and dual cores Bobcat would be right out, maybe Via C3s? While ARM would be a good guess its die shrinks to fit in mobile phones would probably make it insanely expensive to harden yes? So while i'm sure the military probably bought a warehouse full of 386s before intel shut down what happens when they are gone? do we have a viable modern chip that withstand the rigors of space without costing insane amounts of money?

      --
      ACs don't waste your time replying, your posts are never seen by me.
    81. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Why would you doubt space qualification making a difference? Space qualified parts are actually designed differently -- per wikipedia, the different design elements include:

                Insulating substrate (silicon or sapphire for instance) instead of normal substrate.
                Physical shielding
                Using SRAM instead of DRAM, DRAM is essentially a bunch of capacitors and so is highly susceptible to "bit flipping".
                Redundant components, so if a stage disagrees the CPU knows there is a problem and can potentially rerun that stage; ECC memory to detect and correct errors; a watchdog timer to reset the whole thing if it hangs.
                Not mentioned on wikipedia, I've red radiation hardened CPUs tend to use a larger process too -- larger traces and such are less likely to be physically damaged than some trace that is like 5 or 10 atoms wide.

                Per wikipedia, a standard commercial-grade chip can withstand about 50-100 grays (which is 5-10 krads), while the IBM RAD750 can withstand 2000-10000 grays (and the actual RAD750 motherboard is normally set up to withstand 1000 grays.) How much radiation is there in space? I don't know, but they wouldn't bother with these costly radiation-hardened chips if they weren't needed!

    82. Re:Excuse me... not a programmer's fault. by khallow · · Score: 1

      to my knowledge, only the Apollo Guidance Computer has ever truly achieved hardware failure tolerance.

      [...]

      the reliability problem alone would be insurmountable with today's garbage. probably part of the reason why we haven't been back there since.

      I don't buy that claim. My suspicion is that almost everything that gets thrown into space has some degree of hardware failure tolerance. Maybe it doesn't reach the "no true Scotsman" level of "true" hardware failure tolerance, but this isn't a magical idea that was ignored for the past 40 years.

    83. Re:Excuse me... not a programmer's fault. by icebike · · Score: 1

      Why would you doubt space qualification making a difference?

      Because Nasa has been testing off the shelf laptops and Android devices on the ISS for the last 3 years and have found no problems
      at all. Off the shelf. Go to Dell, buy a laptop. Launch. Use for over a year. No reported problems.

      --
      Sig Battery depleted. Reverting to safe mode.
    84. Re:Excuse me... not a programmer's fault. by Tastecicles · · Score: 1

      Be wary of programmers who speak fluent C.
        - Me.

      --
      Operation Guillotine is in effect.
    85. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Everyone here is forgetting that this was Russia. Engineers used cheap components and pocketed the difference.
      That's how they roll around here.

    86. Re:Excuse me... not a programmer's fault. by Sulphur · · Score: 1

      But it worked great in the simulator.

    87. Re:Excuse me... not a programmer's fault. by Sulphur · · Score: 1

      This one time, I almost agree, but why didn't you software retards code in redundancy?

      Ada FTW

    88. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      It seems like a modern microcontroller could substitute for a 386. For example, an Atmel AVR generally runs at 1 clock per instruction which is a hell of a lot faster, clock for clock, than even a 386.

      The die and form factor is large like a 386 too. Larger even.

      However, the memory management and such is a lot more primitive. Running a full blown, memory protected, operating system on an AVR would be an exercise in futility.

    89. Re:Excuse me... not a programmer's fault. by int19 · · Score: 2

      Other industries are starting to be hit by a similar problem as flash manufacturers ever-increase the density of the chips and start EOL-ing their lower density models. This comparatively extreme density makes them unreliable for certain high-integrity, critical data logging applications. One technique the manufacturers seem to employ (I have seen this first hand but am not an EE) is to stack multiple dies within a single IC with some type of very thin metallic(?) padding material between them. This padding in turn wreaks havoc on the IC when subjected to high temperatures (>200C), which other chips could handle just fine in terms of not losing data.

    90. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      "should" is not the same as "will"... the difference between these two is what reliability is all about

      you're right that there was nothing "fancy", but only by today's standards.

      it might seem like "simple" task prioritization, but can you identify any other computer that does this? the marvel of the AGC wasn't merely task priorities, but its ability to restart and continue from where it left off without the LM crashing on the moon, without losing track of where the spacecraft was or its orientation, all while being barraged by garbage data from a fault in a radar.

      it would seem in this day and age where server availability is critical for business operations that this feature would be prevalent in server hardware. I've never heard of a web server restart and commence from where it left off without losing track of sessions active before the restart (not to be confused with live migration). The problem with today's operating systems is that they aren't task-specific like the AGC, so they are bloated to the point where it takes at least 30-odd seconds for a restart which is enough for a web browser to treat as a timed out connection.

    91. Re:Excuse me... not a programmer's fault. by crutchy · · Score: 1

      Would you fly in an aircraft controlled by a Dell PC running Windows 7?

      The flight attendant would tell you that you should arrive safely at your destination.

    92. Re:Excuse me... not a programmer's fault. by EETech1 · · Score: 4, Interesting

      I asked one of the main AVR designers from Norway if it was ok to set a configuration, or a constant in RAM during initialization and trust with 100% certainty that it would not change during operation. He said that even on the worlds cleanest power supply, and absent the presence of any EMI, he would still NOT recommend it.

      If you run 10 AVRs for 1000 hours you will see bits flipped. Many times it only effects a RAM variable that is constantly being recalculated anyways, so it causes little if any disruption to the operation of the device.

      It really sucks when its something critical like a timer counter control register.

      If anyone would like to duplicate my testing, I'd be glad to send code, but all you have to do is set everything to a known value, and then read it over and over til it changes. It doesn't take as long as you think (or hoped) it would! It also gives you a good idea on how well your PCB takes care of your Micro.

      Always check, and if necessary, reset your hardware configs during runtime! Those "all of the sudden it started acting up, so I turned it off and back on again and it was fine" problems just disappear!

      I still remember the time my CON_0 register read 8! Although I'm sure it'll happen again, you'll never notice it!

      Cheers

    93. Re:Excuse me... not a programmer's fault. by EETech1 · · Score: 1

      ON ERROR:
      JMP Reset

      Is the absolute worst way to keep a program running:)

    94. Re:Excuse me... not a programmer's fault. by rwv · · Score: 1

      power cycling in not an option

      Power cycling becomes an option when redundancy is part of the design. If there are 2, 3, or 4 computers doing the same calculations it decreases the likelihood that a hardware problem will compromise the mission.

    95. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Except there are quite a few programs these days that will restart where they left off when they crash and restart. Important programs like web servers that don't do so are likely because they judged the performance hit of keeping and updating data about the current state in a easy to re-grab format as not worth it. Of course, a large part of this is having error handling routines that can catch most errors without requiring to restart the software. If programmers excepted the server to restart every time it had to issue a 404 error to someone, they could make it recover the state upon restart as needed.

    96. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      That was not saying all modern computers are going to have the same fault tolerance, especially if you are going to try to compare an early, industrial real-time operating system to a modern, consumer, non-RT operating system. Nonetheless, the example of the AGC recovering from that issue isn't a good reason to argue it is more fault tolerance than modern computers, as it was just an antiquated error handling method. Difference in fault tolerance comes from it being a small, narrow task-specific, heavily reviewed code base.

    97. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      The official statement is probably very political in nature and may not reflect reality so much as spin.

    98. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      to my knowledge, only the Apollo Guidance Computer has ever truly achieved hardware failure tolerance. the Apollo 11 LM radar fault overloaded the computer ...

      That wasn't a hardware fault. The rendezvous radar wasn't intended to be turned during landing, and the crew was trained not to do so, but Aldrin left it turned on because it seemed a good idea to him in case they had to do an abort ... and because he didn't know that leaving it on would swamp the computer with useless data. But your point that the (good) software design was able to recover from this problem is a good one.

    99. Re:Excuse me... not a programmer's fault. by hutsell · · Score: 1

      In Soviet Russia, code executes you!

      In Soviet Russia, doomed programming error has Mars probe you!!

      --
      Yesterday's Weirdness is Tomorrow's Reason Why
    100. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      apparently they knew about the fault in the radar too. something to do with frequency phase shift of the power supply to the LM radar, but it was opted to stick with the radar as test rather than fix the problem and potentially introduce new problems.

      ~ crutchy

    101. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      This is why the good lord gave us SBC/DBD ECC circuits. And why all high-end, mission critical systems use them. And why it's so ill-conceived that home computers almost always do not have them.

    102. Re:Excuse me... not a programmer's fault. by hairyfeet · · Score: 1

      Oh I believe you friend and what's more i can tell you why, its cosmic rays. Even on earth there are enough cosmic rays that you WILL get bit flips even in the best shielded chips, you just get less flips the better shielded the chip is. I have a friend that is a retired NASA engineer, he designed and helped build most of the models and mockups and he said even with the hardened chips they had a formula to figure how much ECC to build into the system based on worst case scenarios for cosmic rays. You should look up the Google study on ECC and bit flips, they figured there was something like 3000+ bit flips a month in your average PC and the OS simply compensates for the error or reruns the code. of course in space you can't really afford to do that which is why they have to have everything extra hardened because even with ECC there is so many rays out there that your code would be salad.

      Anyway if you are interested in that kind of thing, look up the Google paper as they did a multi year study on their server farms and came up with all kinds of interesting data. But I believe you, my friend still makes rockets and robots for NASA competition entries for the local college and I've seen bit flip with those small fast embedded chips they use for the instruments. You just have to code in a "fudge factor" as he put it and if you win the competition you can always use more hardened chips.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    103. Re:Excuse me... not a programmer's fault. by theArtificial · · Score: 1

      Because Nasa has been testing off the shelf laptops and Android devices on the ISS for the last 3 years and have found no problems at all. Off the shelf. Go to Dell, buy a laptop. Launch. Use for over a year. No reported problems.

      Isn't the ISS shielded where the sensitive humans are using the non shielded devices?

      --
      Man blir trött av att gå och göra ingenting.
    104. Re:Excuse me... not a programmer's fault. by K.+S.+Kyosuke · · Score: 1

      You totally over look the fact that off the shelf laptops are being used daily on the ISS with no problems. Have been up there for years.

      First, ISS is well within the radiation belts and thus it has a partial protection from most of the charged particles. This does not hold for interplanetary probes.

      Second, those laptops are not mission-critical, much less life-critical. If an astronaut's email package crashes or its data gets corrupted, the station's O2 valve won't close suddenly and refuse to reopen. As far as I know, ISS is using rad-hardened Astrium computers for its critical systems, not a bunch of laptops.

      --
      Ezekiel 23:20
    105. Re:Excuse me... not a programmer's fault. by icebike · · Score: 1

      The prob was also lost inside the radiation belts.
      And nasa sent the laptops and ipads specifically to test for radiation problems.

      Not happening.

      --
      Sig Battery depleted. Reverting to safe mode.
    106. Re:Excuse me... not a programmer's fault. by K.+S.+Kyosuke · · Score: 1

      The prob was also lost inside the radiation belts.

      Non sequitur. You seem to have a problem with logical thinking. GGP was talking about the design of the probe for the whole trip and of the inadvisibility of any thicker shielding, he wasn't even talking about the failure modes of this particular probe on LEO. (I'm inclined to think that your post was right about those non-rad-related failure modes as being the more likely cause, though, I'll give you that.) The fact that this particular probe failed (for wharever reason) before leaving LEO does not change the game with respect to rad protection.

      Then, he mentioned space (in general, without qualification) to which you reacted with some laptops-on-board-of-ISS nonsense. Again, that hardly refutes what he said.

      To sum it up, he was simply arguing that your notion of the probe designers (any probe designers) trying to put a strong shield around the probe ("the shielding the physical design would provide", you wrote) to increase rad protection is flawed because it is counterproductive to attempt to provide one in the first place; you save weight, money and nerves by doing a HW design that survives the rad exposition instead of preventing the exposition.

      He didn't explicitly mention that such HW designs preclude rad-caused system failure by both internal and external radiation sources to exacly the same degree and therefore, you hypothesis about a failure from an internal radiation source being more likely than one from an external source (because of your alleged "internal shielding oversights") is completely without merit. He probably thought you'd be able to figure out on your own what he was talking about. Seems that he was overly optimistic about that.

      Oh, and by the way...the major internal rad source on any spacecraft is its RTG source, if the probe has any. But you forgot that these use Pu238 which is an alpha radiation source and extremely easily shielded, as opposed to high-energy interstellar particles. So, no, the risks from internal rad sources are lower than the ones from external radiation even in LEO, not to mention heliocentric trajectories outside radiation belts..

      And nasa sent the laptops and ipads specifically to test for radiation problems.

      And here I was thinking that they have the laptops to run experiments and collect useful data. Silly me...

      --
      Ezekiel 23:20
    107. Re:Excuse me... not a programmer's fault. by Lord_Jeremy · · Score: 1

      I'm also going to suggest that a NES chipset produces a great deal less heat (and thus undergoes significantly less thermal stress) than a PS3 GPU.

    108. Re:Excuse me... not a programmer's fault. by EETech1 · · Score: 1

      Thanks for adding that. I missed that in my reply. It was the topic of the parent post, but I started dozing off, so I called it quits, and hit submit. 3:30 Bedtime:)

      I asked the question after failing validation with an AVR when they first came out. I wrote a little assembly MemTest loop, and let them run, I knew there were no bugs in that loop, yet it still failed occasionally on multiple p# devices.

      I cringe when I look at the the code that is running so many things. The failures mode present in so many of the standard libraries used everywhere in the embedded world.

      I've read the Google paper, and it's very interesting BTW.

      seen?: www.amazon.com/Embedded-Systems-Firmware-Demystified-CD-ROM/dp/1578200997

      Or?: www.bookf.net/p/6920-math-toolkit-for-real-time-programming

      Cheers

    109. Re:Excuse me... not a programmer's fault. by Anonymous Coward · · Score: 0

      Especially when power cycling the CPU, say, need not include power cycling the RAM, or even power cycling one block of RAM need not include power cycling all of it.

  2. Programming error? by mehrotra.akash · · Score: 5, Funny

    the ultimate cause was the use of non-space-qualified electronic components

    Programming error?
    Perhaps in the software used to order the parts

    1. Re:Programming error? by Anonymous Coward · · Score: 0

      they probably made the mistake of ordering parts from America

    2. Re:Programming error? by Anonymous Coward · · Score: 0

      "Corruption detected in supply chain; continue?" "Da."

    3. Re:Programming error? by rubycodez · · Score: 1

      indeed, that's a surefire way to acquire a pile of poor quality chinese-manufactured crap

  3. Typical by Anonymous Coward · · Score: 0

    The electronic engineers here are always trying to blame programmers for their design faults too.

  4. headline fail by jamessnell · · Score: 3, Informative

    "the ultimate cause was the use of non-space-qualified electronic component" != "programming error" hardware fail.

    1. Re:headline fail by X0563511 · · Score: 1

      Even better... a design fail! The hardware worked (or not) as per it's specifications. It's not the hardware's fault you put it where it wasn't meant to go!

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    2. Re:headline fail by jamessnell · · Score: 1

      Pretty much. Though in a sense, it probably wasn't a design fail necessarily.. They probably just had someone ordering parts that didn't know to order mil spec (I'm assuming mil spec is fine for space stuff). Seems to me like most ICs are available implemented in mil spec packages - so the part seems the same in basically every way, but it costs a lot more and resists environmental crap better. It's a sad story, really.

    3. Re:headline fail by Anonymous Coward · · Score: 2, Informative

      They probably just had someone ordering parts that didn't know to order mil spec (I'm assuming mil spec is fine for space stuff)

      No, not even close. "Mil spec" is basically industrial grade with a little bit extended temperature range. Radiation hardened stuff is completely different ballpark.

    4. Re:headline fail by Tastecicles · · Score: 2

      mil spec isn't proofed against hard radiation; it does some soft radiation and EM not quite up to airburst-strength pulse. Space spec has to withstand high energy radiation such as Cosmic, X- and Gamma rays way beyond what you'd encounter 5 miles below a thermonuclear burst, otherwise it'll get outside the VA belts and simply die.

      --
      Operation Guillotine is in effect.
    5. Re:headline fail by smitty97 · · Score: 4, Funny

      (I'm assuming mil spec is fine for space stuff)

      You don't happen to work at the Russian Space Agency purchasing department, do you?

      --
      mod me funny
    6. Re:headline fail by jamessnell · · Score: 1

      ha ha ha ha, no. Though I am a junior in R&D for aeronautically deployed survey equipment.. So I'm a little familiar with hardening systems.. Do you happen to know from practice (or some substantial experience) if mil spec is insufficient for that application?

    7. Re:headline fail by sjames · · Score: 1

      Sometimes mil spec isn't even extended at all, but just has more rigorous testing to make sure it's within the standard specs.

    8. Re:headline fail by geekoid · · Score: 2

      A) Some hardware has software embedded into it, yeah shocking.

      B) Parts fail in space craft. If the software failed to detects a failed piece and roll to back up, the software has it's roll in the incident as well.

      C) If it jump to the wrong mode after the error, that's also a software error.

      I'm not saying one way or another in the specific incident. The idea that there is a hard line between all software and hardware is false, and technical people should know better.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    9. Re:headline fail by jamessnell · · Score: 1

      Obviously software and hardware are more conceptual partitions used to help divide and conquer the overall challenge at hand. That said, picking a part that from a digital logic level up was the right part, and it merely failed due to improper sheidling (etc), sounds fairly deep in the hardware camp to me. I think your "B" point is pretty solid. But in terms of jumping to the wrong point of code due to physical error, that's getting to be kind of unreasonable to demand of software - as that case is indicative of the software not executing correctly. You can probably do some neat things in software to help mitigate a little of that, but for the most part, that seems unreasonable to me. I suppose that the key to a meaningful conversation here is understanding what actually caused the failure, as we're all kind of spinning off in to speculation, which means we're basically trolling ourselves.

    10. Re:headline fail by jd2112 · · Score: 1

      Programmer didn't include if($component.SpaceRating != TRUE) {throw "INITIALIZATION ERROR: NON SPACE RATED COMPONENT!"}

      --
      Any insufficiently advanced magic is indistinguishable from technology.
    11. Re:headline fail by yurtinus · · Score: 1

      In an embedded system - particularly a critical system - you usually have software aware of the state of interfacing hardware. Additionally, you should have some redundant systems so you can handle a hardware fault on one of them. The article says "two chips failed," with no further details. I'd assume the guys calling it a software error are doing so for a reason - likely those chips were part of some databus interface, D/A or A/D converter, or something that the software *talks* to (as opposed to runs on). These are all things that the software can detect and report faults on - and if it's a redundant system, fail over to a different channel. I'm making a big assumption that if it is a programming error, it's a hardware fault that should have been detected and handled instead causing the software to crash. Possibly both channels of a redundant system (from the article quote "a programming error which led to a simultaneous reboot of two working channels of an onboard computer").

      Admittedly all of us here (especially me) are talking out of our asses and none of the listed links seem to have any details (not that we'd read them anyway...). But you know that, and I know that - so we can hopefully speculate without trolling ourselves! We don't know the system, we don't know what failed, but we can certainly use it as a talking point to discuss when software can be responsible for not handling a hardware fault.

      --
      +1 Disagree
    12. Re:headline fail by EETech1 · · Score: 1

      If I can use 2 magic numbers to increase software reliability, why can't they?

      If this 1/65535 chance of random error causing correct value didn't happen, AND all configs are still still correct, AND the program counter has not fallen into one of the many traps that would keep it from getting here on its own, AND The stack pointer is where it should be based on how I think I can only get here, AND the second 1/65535 chance of random error causing correct values ALL didn't happen, AND this is my Nth time trying this exact same thing without success, THEN DO whatever.

      ELSE determine WTF is wrong piece by piece, and report the failure before attempting to work around it.

      Never let it blindly execute a subroutine just because that's where the program counter is! You have to know that when something executes, that it really should execute to prevent bad from becoming really really bad.

      The normal what to do, is the easy (20%) part, the should I do it, and what if I can't, is what makes a system reliable.

      If your bootloader is at the end of your program space, it better not execute just because the program counter got there!

      Cheers

    13. Re:headline fail by Whillowhim · · Score: 1

      As someone who is currently writing up a dissertation dealing with this topic, I can assure you that mil spec is not sufficient. Hardening chips for radiation is completely different from hardening them for other hostile environments, especially when you look at the heavy ion strikes you can get in space.

      Radiation effects are generally split into two basic categories, Total Ionizing Dose (TID) effects and Single Event Effects (SEEs). TID results from lots of little ion strikes, which gradually build up charge and/or defects and screws with transistor characteristics. Often the result is that transistors leak a lot more current when off, reducing your margins. Since this takes time to build up, it is highly unlikely that this caused the issues with the probe. Since mil spec chips often have a bit more tolerance for this, mil spec does help, but it does not help enough for long exposures.

      SEEs are the result of a single, high energy particle hitting the chip. The area of effect varies greatly depending on the energy of the particle, but the typical results of a strike are than a logic gate or cluster of nearby logic gates end up forced to output the wrong value. Essentially, one or more of your "0"s just became "1"s, and vice versa. If these values happened to be important to the current state of the machine or OS running on it, then congratulations, you just got screwed. The two most common ways to harden a chip against this are temporal redundancy and logic redundancy. Temporally redundant circuits assume that any ion will only upset the logic for a short period of time, and wait for the signal to become stable before storing values. This has been the staple of custom hardened chips for a while now, because it is relatively easy to convert all your flip flops into hardened flip flops, and thus harden the entire circuit.

      Logically redundant circuits essentially have 3 copies of the logic that vote to determine the correct value. This was often used in the early days of hardening, since you could just stick 3 chips in there and add some basic voting circuits outside the chips to correct the values. However, as processors got more complex, it became harder and harder to restore their state properly in a reasonable amount of time, so people tended to move to temporal hardening for custom chips, and only used logic hardening for things like FPGAs.

      Currently, however, temporal hardening is breaking down, since it doesn't scale well with smaller processes. A heavy ion deposits a fixed amount of charge, but smaller processes have less current flow per transistor, so it takes longer to remove that charge and restore proper operation. Thus, the length of time temporal designs have to wait for the signal to stabilize keeps increasing. This is one of the main reasons why hardened chips lag behind in terms of transistor size and the processes they can use. My graduate research has created a method to do high speed, logically redundant circuits that are highly scalable, meaning that you can automatically create three circuits that vote on the same chip, using commercial synthesis and APR tools to automate the process. I firmly believe that this is going to be the standard once people realize how much faster they can make chips run on new processes.

    14. Re:headline fail by jamessnell · · Score: 1

      That's extremely interesting information, thank you very much! I've been wondering about this for a while - as I'd honestly (yes, it's crazy) like to build my own rover and deliver it to the moon. I believe that out-of-the-box thinking could possibly acheive this on a remotely modest budget. The whole thing is just a puzzle project that floats around my brain, but I would LOVE to do it. And the alien (to me) EMR characteristics of the environment involved have been a subject of great internal conjecture. Can you tell me, why can't one simple "faraday cage" the shit out of their electronics? I can think of many reasons why just encasing everything in shielding would be difficult - mostly in terms of a solar array and other purposefully external devices. Though for that, I've been toying with ideas regarding levels of optical isloation and the like. I'll add, I hate solar power. It's cool in many ways and it completely sucks in others. RTGs make me hot in geeky and in R rated ways. Regardless thanks for you excellent comment. More input from you would be awesome.

    15. Re:headline fail by Anonymous Coward · · Score: 0

      A Faraday cage stops an external electromagnetic field. It does not stop a charged particle.

      Think Star Wars-style particle vs. energy shields.

  5. Gamma rays by Anonymous Coward · · Score: 1

    Gamma rays, X-rays and the products of their collisions are attenuated by the upper atmosphere, not the Van Allen belts. This is why you get more exposure at altitude in an airplane.

  6. Translation Fail? by DemonicMember · · Score: 0

    Maybe this is all just a translation error, could have been either or both?

  7. So how much? by cvtan · · Score: 2

    How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?

    --
    Sorry, but gray text on gray background is making my eyes bleed.
    1. Re:So how much? by Spykk · · Score: 4, Funny

      Not even the government could save money by buying something at Radio Shack.

    2. Re:So how much? by Dasuraga · · Score: 1

      Space-qualified microchips can cost something around 5000 euros. Equivalent chips that are "only" rated for automobile usage(for example), cost 10 cents.

    3. Re:So how much? by stewbee · · Score: 3, Informative

      If only. The reason ICs cost so little is that the cost is spread out over millions of parts. As my analog circuits Prof would say. "Your very first IC off the line is going to cost a million dollars. Everything else after that is free." So to buy one or two ICs that are radiation hardened is probably going to cost that much since it will most likely be custom. Now that's not to say they can't reuse some of the masks for an existing IC to make it cheaper, but It won't be that much cheaper. My guess is that they would want to redesign the part anyway if it is going to be in a radiation intense environment. The radiation could cause some weird quantum effects in the IC that might mean they want the transistors to be larger for reliability purposes. But that last part is just a guess since I am not an IC designer and thought my electronic materials class was nothing short of voodoo.

      Long story short, they probably saved more than $5 for using a COTS part, but they probably lost the probe by the part not being radiation hardened.

    4. Re:So how much? by John+Bresnahan · · Score: 1

      How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?

      Based on my last visit to Radio Shack, I don't think their parts are any cheaper than the special-purpose, radiation-hardened parts they should have used.

      But when you can't wait until tomorrow for a part for your space probe, Radio Shack is convenient.

    5. Re:So how much? by systemeng · · Score: 2

      When I worked in the test equipment industry, we had a term for the lowest grade of parts that still worked when binning components: The radio shack bin. I once built part of an emergency prototype for a test equipment cooling system with radio shack parts. The prototype was sent to Taiwan where it failed prematurely due to the marginal components. Never Again!

    6. Re:So how much? by K.+S.+Kyosuke · · Score: 2

      How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?

      This is not the first time something like this happened to the Russians. In the 1970's, the Soviet Mars 4 probe failed in flight. The reason? Due to cost savings, the transistors used had had their gold parts replaced with aluminium ones, which were prone to chemical degradation (a.k.a. corrosion). The Soviets then realized that they had manufactured three more probes of the same series using the same (unfit) transistors. Now what did they do? Of course they launched them! Guess what happened? Mars 5 failed two weeks after reaching the target orbit. Mars 6 first stopped sending its telemetry, but it operated autonomously just fine and launched a transmitting lander...which stopped working before touching the surface. Mars 7 failed again in flight and launched a lander onto an interplanetary trajectory instead of the surface of Mars.

      See, when you're Russian and know that a probe as designed might fail, you just build more of them until one succeeds. :D

      --
      Ezekiel 23:20
    7. Re:So how much? by jd · · Score: 5, Interesting

      Space Micro doesn't list the prices of their components or systems, nor can I find any from anyone else. Honeywell don't list their prices either. Atmel seem to have dropped out of the field. Linear don't list the prices for their space-hardened stuff. Don't see any for BAE either, or Intersil. Empire Magnetics require a lot of personal data before they give you access to even the price classification information. Not the prices, just how they're classified.

      You've got to allow for a year's worth of traveling outside of an atmosphere and then operating on Mars for the duration of the mission. This analysis of radiation for manned missions suggests you're looking at 3.5 mSv per day, then 20 rems per year in most of the places of interest.

      Converting everything to rads, it's 0.1 rads per mSv and 1 rad per rem, so that's 12.75 rads to get to Mars if you assume a year-long trip, plus 20 rads for the mission, so anything with a rating of less than 32.75 rads is pretty much guaranteed to fail. However, over the course of a two years, the odds of there being a solar flare are not insignificant. To be safe, you want resistance to a further 400 rad. 432.75 rad is within the tolerance of most of the space-hardened components (some components can be taken up to 1000 rad, others up to 10,000). However, the cheapest space components would NOT survive. You're talking high-end on the space scale.

      I'm going to figure that the top-line components will cost 100x that of their conventional counterparts, due to the higher-level of precision and QA that are required. It might well be a good deal more. In Russia, you've also got to pay for smuggling decent-grade hardware out of the US, as all of this stuff will be under massive amounts of regulation.

      My guess is that the cuts would have saved enough that those doing the cost-cutting could buy second homes in Switzerland.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    8. Re:So how much? by autophile · · Score: 3, Interesting

      For want of a rad-hard chip, the board died.

      For want of a board, the software couldn't cope.

      For want of good software, the engine start failed.

      For want of engine start, the probe died.

      For want of a probe, the human race didn't detect the slimy aliens from Phobos and all perished in a hot and somewhat greasy fireball.

      --
      Towards the Singularity.
    9. Re:So how much? by Anonymous Coward · · Score: 0

      Probably Best Buy

      Most of the budget went for Monster cables.

    10. Re:So how much? by yurtinus · · Score: 1

      I dunno, seems to me it'd be quicker just to order your parts from Digikey instead of going to Radio Shack, buying a cell phone and contract, then dismantling the phone to desolder the part you need (and hope you didn't bust the part in the process)... Sure, Radio Shack is convenient for a lot of things, as long as all of those things are cell phones and expensive Ethernet cables.

      --
      +1 Disagree
    11. Re:So how much? by Anonymous Coward · · Score: 2, Informative

      I have worked (not long) as an electrical engineer in a team developing electronics for scientific instruments mounted aboard space probes, rovers, etc. This means interplanetary travel and operation, so this is the kind of place where you definitely want to use rad-hard components, unlike low orbit where you are still well within the magnetosphere. Phobos-Grunt orbit-boosting stage had no good reason to use hardened components.

      Concerning prices: I have done some design/prototyping but I wasn't involved with the procurement process of flight-qualified rad-hard components, so what I know is from discussion with colleagues. First, lead times can reach one year, even for quite basic components. Then, the cheapest rad-hard discrete MOSFET from International Rectifier (which is basically the only rad-hard MOSFET manufacturer - there is no room for competition in such a small market as rad-hard components) is in the vicinity of 400 €. And this is no high-power transistor, but the closest equivalent (although with higher specs most often not needed) to the 2N2222, the most basic low-power, logic-level MOSFET ever that you can buy for a few cents. The price ratio is more around 1000 here...

    12. Re:So how much? by Anonymous Coward · · Score: 0

      For want of a probe, the human race didn't detect the slimy aliens from Phobos

      Surely you mean leather goddesses?

    13. Re:So how much? by Anonymous Coward · · Score: 0

      Woops, I meant 2N7000/2N7002. The 2N2222 is the "basic" bipolar junction transistor, I guess I'm much more tired than what I thought...

    14. Re:So how much? by garyebickford · · Score: 1

      a hot and somewhat greasy fireball.

      I knew her!

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    15. Re:So how much? by garyebickford · · Score: 1

      good info, thanks!

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    16. Re:So how much? by garyebickford · · Score: 1

      As I am working on a space-related proposal, even though it's not directly HW related, this info and the links will be very useful to me in the near future. Thanks! :)

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    17. Re:So how much? by jd · · Score: 3, Interesting

      The links for International Rectifier, for those *#$% off with Congress and wanting to build their own damn Rover:

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  8. Huh? Its hardware failure by Anonymous Coward · · Score: 0

    doesn't sound like programming, if the part did not fail, then the mission should have continued as planned.

  9. Always Blame Software by invid · · Score: 4, Insightful

    Is it just me, or is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?

    --
    The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    1. Re:Always Blame Software by Hognoxious · · Score: 1

      is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?

      Find someone else, I'm busy.

      In any case, it's usually orders of magnitude easier to blame the spec. It's written by management/users, after all...

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:Always Blame Software by Anonymous Coward · · Score: 0

      Then, the software engineer is expected to code around the HW problem, since fixing the hardware is too expensive.

      This is followed by everyone blaming the SW. A new SW release fixed the problem so it must have been a SW problem right?

    3. Re:Always Blame Software by rwv · · Score: 2

      In my experience... hardware problems are acceptable if there's a software work-around. Special acknowledgement isn't given to software for fixing hardware bugs... it's just expected since hardware is arguably more expensive to change.

    4. Re:Always Blame Software by Anonymous Coward · · Score: 0

      You know what? In my experience, hardware engineers have to jump through hoops to prove if there's a software problem, otherwise people are too quick to blame hardware in a sufficiently complex design/system. Maybe where you work it's the opposite -- in my organization there's 3:1 or 4:1 software to hardware.

      Personally, I think no one should be blaming each other, the idea is problems (hardware and/or software) are solved through cooperation of both sides. Unfortunately, people in management positions set absurd schedules and lean on engineers (HW or SW) to 'make it so' and as a result individuals become sensitive to problem in their area and/or become defensive and thus we have one side blaming the other.

  10. How is "chip failure" a "programming error"? by vleo · · Score: 1

    I'm not first to ask... but still wonder how that's possible on Slashdot that is *supposed* to be technologically literate.

    --
    Vassili Leonov ...it is the actions that affect us, not the motive...RMS
    1. Re:How is "chip failure" a "programming error"? by Hognoxious · · Score: 5, Funny

      A 4 digit ID and never heard of microcode.

      Seriously Gramps, the distinction between hardware and software isn't as clear cut as it was when shit was all powered by steam.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:How is "chip failure" a "programming error"? by Anonymous Coward · · Score: 0

      Context. You fail it.

    3. Re:How is "chip failure" a "programming error"? by Capt.DrumkenBum · · Score: 2

      Stop dissing Steam, it is the power source of the future. :)
      Also, get off my lawn.

      --
      If I were God, wouldn't I protect my churches from acts of me?
    4. Re:How is "chip failure" a "programming error"? by Anonymous Coward · · Score: 0

      In pseudocode: let's say you had two redundant hardware devices with interfaces DevA and DevB. You might code this:

      Try
              DevA.Write(InputWord)
      Catch
          Try
                DevB.Write(InputWord)
          Catch
                WarningFlag.Raise
                Continue

      Now, suppose you coded this instead:

              DevA.Write(InputWord)
              DevB.Write(InputWord)

      That can be a case where a chip failure (in DevA) becomes a programming error.

    5. Re:How is "chip failure" a "programming error"? by geekoid · · Score: 1

      Problem came on board during the first SW:EP1 discussion, not any of the technical ones. Not that there was any real technical ones at the time.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    6. Re:How is "chip failure" a "programming error"? by invid · · Score: 1

      The turbines in a nuclear power plant are run by steam, sonny.

      --
      The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    7. Re:How is "chip failure" a "programming error"? by ceoyoyo · · Score: 1

      You can't blame the chip dying because it wasn't radiation hardened on the microcode. "Gramps" probably knows that. Do you?

    8. Re:How is "chip failure" a "programming error"? by garyebickford · · Score: 1

      But there is an infinite recursion of possible failures. For instance, what if one of the memory cells that contain the machine instructions for 'Try' is hit, and the jump address is now off by one, or 64, or 65536? That's a one bit error.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    9. Re:How is "chip failure" a "programming error"? by gmhowell · · Score: 1

      Stop dissing Steam, it is the power source of the future. :)

      Also, get off my lawn.

      Besides, how would I be able to download new copies of Portal and HL2 without it?

      --
      Jesus was all right but his disciples were thick and ordinary. -John Lennon
    10. Re:How is "chip failure" a "programming error"? by Anonymous Coward · · Score: 0

      Steam's just a transmission fluid isn't it?

  11. Obligatory Armageddon quote by Kinthelt · · Score: 1

    Components. American components, Russian Components, ALL MADE IN TAIWAN!

    http://www.imdb.com/title/tt0120591/quotes?qt=qt0459113

    --

    "Evil will always triumph over good, because good is dumb." - Dark Helmet (Spaceballs)

    1. Re:Obligatory Armageddon quote by Tastecicles · · Score: 1

      Ob. Clancy (mis?)quote:

      "See? We have the best technology in our missiles, Tovarisch."
      "What does it say?"
      "Texas Instruments."

      --
      Operation Guillotine is in effect.
    2. Re:Obligatory Armageddon quote by K.+S.+Kyosuke · · Score: 1

      Hey, my movie beats your movie every day!

      No wonder this circuit failed. It says "Made in Japan".

      --
      Ezekiel 23:20
  12. Description Fail by Anonymous Coward · · Score: 0

    The OP is stating from 2 different sources. One saying it was a programming error while the other was a seemingly earlier report about the defective or off-spec components

    1. Re:Description Fail by expatriot · · Score: 4, Interesting

      The Planetary Society entry says that two modules failed and then the main computer crashed. Probably irrelevant if the computer crashed or not if there were significant failures in the electronics. Perhaps if the computer had kept going there woud have been some communication of what had gone wrong.

      One of the commenters wrote "It is rather unlikely radiation caused the failure. Russians said the failure was due to an SRAM WS512K32V20G24M from White Electronics. This part is a module containing 4 CY7C1049 chips from Cypress and is actually screened. While the Cypress part is very susceptible to Latchup," No idea if this is true or not.

    2. Re:Description Fail by Johann+Lau · · Score: 1

      this might be interesting for you and others (it's pretty much gibberish to me :D)

      http://russianspaceweb.com/phobos_grunt_aftermath.html

    3. Re:Description Fail by EXrider · · Score: 1

      I like the table describing possible failure causes at the bottom, most of them are officials accusing the US of directly or indirectly causing the satellite's failure. Conspiracy theories alive and well.

      --
      grep -iw skynet /etc/services
    4. Re:Description Fail by Donwulff · · Score: 1

      What we gleam from this, rather old article, together with other common knowledge... apparently the flight-control computer had two identical processors, presumably for redundancy, that according to Roskmos both rebooted at the same time, possibly due to "heavy particles" in space. This is not unthinkable, especially as the rebooting of such robust processors could take significant time, during which another one could encounter failure.

      There is also reference to a watchdog procedure, which muddles waters somewhat - I'm wondering if the watchdog procedures could have triggered on some other condition than total unresponsiveness of the unit in question, and if it could have led to rebooting them both at the same time, for example due to checking them at the same time on an interval. Regardless, after both redundant processors booted at the same time, the probe interrupted flight program, and - quite correctly - entered into "safe mode" awaiting further instructions and diagnostics.

      Then comes up the further engineering SNAFU, and where a software-specification error most likely comes into play: In safe-mode, the probe switched to its X-band radio, which was never intended to be operated on orbit, but only in deep space on way to Mars. The problem with this was two-fold. First of all, the bulky Russian deep space antennas could not track the probe at orbital speeds long enough to receive let alone transmit data. And secondly, as the probe was orbiting Earth it was spending long times with its solar panels in Earth's shadow, while the high power interplanetary radio was draining its batteries. And so the probe was doomed.

    5. Re:Description Fail by garyebickford · · Score: 2

      It's worth noting that the Space Shuttle's navigation system had three identical computers who all 'voted' on the result, and if one disagreed it took itself out of the system. And there was a fourth computer made by a different company, using a different architecture and different programming language, that monitored the three. In retrospect, I think that's a pretty good idea. Having two different architectures makes having the same programming error occur in two different systems very unlikely.

      Of course, as you add nodes to such a system, it gets more 'interesting' to figure out how to handle the set of possible differences. What constitutes a failure? What constitutes agreement?

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    6. Re:Description Fail by Tastecicles · · Score: 1

      I think in a **perfect world**, the chances of catastrophic failure in the collective hardware (or relevant to this discussion, the decisionmaking process) of such a system are zero. What I've experienced in terms of hardware is that the chances of an individual component failing does not change the more you add to the system. What does change is that the chances of any single component failing resulting in the total failure of the system is multiplied by the number of similar components in the system.

      As an example:

      A hard disk has an MTBF of say, 100,000 hours.

      You build a RAID array of ten drives. The MTBF of each component drive is still 100,000 hours, but the MTBF of the array (the system) is 100,000/10 or 10,000 hours.

      You build an identical array of ten drives and mirror the two to try and mitigate against data loss in case of failure of an array. Here's where the numbers get interesting.
      The two arrays have an individual MTBF of 10,000 hours. Taken as a single system their combined MTBF is 5,000 hours. Since the system is composed of two mirrored arrays, all you have done is halve the MTBF (so it's 10,000 hours again), halve the data capacity and double your power consumption.

      So every 416 days, you should expect one drive in the 10-disk array or two disks in the 20-disk array to fail.

      -
      Can I do a lightbulb analogy?

      Say a lightbulb has an MTBF of 100 hours. You have an array of 1,000 similar bulbs on a display board. You're replacing a bulb every six minutes.

      --
      Operation Guillotine is in effect.
    7. Re:Description Fail by garyebickford · · Score: 1

      Yep.
      I don't recall the exact numbers, but in the early ENIAC vacuum-tube-relay computer I think the mean time to failure was something like 20 minutes. I'm not sure how they could tell though - looking for tubes that weren't lit? maybe they had a sensing circuit that noticed when the current through the tube dropped.

      And I read somewhere that at Google or one of the other zillion-computer facilities, there were folks who worked full time just walking round and replacing dead computing nodes.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    8. Re:Description Fail by Donwulff · · Score: 1

      Actually, while its possible to design a system to be "virtually" fault-tolerant, in engineering that always comes down the a cost-benefit analysis. Also this naturally does not entirely eliminate so-called "human error" and other freak incidences, but with enough resources tossed into it, you can get very close. It's obvious the safety-requirement and thus allowed cost for manned mission is set much higher; for an unmanned probe it will be accepted some of them will inevitably be lost and accepted, and design target set to for example 1 unrecoverable failure out of 100 missions (pulled that example out of my ass, and in practice Russia of course has 100% failure rate on Mars-probes, which I'm sure is nowhere near design target).
      Also we do not know the number of redundant processors of the kind that were in Phobos-Grunt. If there are three and a monitoring unit, going into "safe-mode" in case two of the processors failed at the same time would be entirely reasonable response - there would be no redundant processor left to compare the results to. But only Roscosmos knows the design for sure, I'm even guessing the redundancy just from the reported facts that there were (at least) two identifcal and both booting at the same time was somehow a problem.
      It is of course kind of confirmation bias as that's generally the main way a redundant system can fail, but the way there stories generally seem to go there is some unthought issue causing all redundant units to fail at the same time, and the control logic responds in some unexpected way that makes matters worse because nobody ever thought the redundant systems could fail at the same time let alone bothered to test it. I work in automotive industry, and we have unwritten in-house rule that whenever an engineer says "But just what are the odds that..." we HAVE to make the design hardened against just that possibility.
      RAID is actually a good example of the redundancy failure. You may be led to assume that with 100,000 hours MTBF per drive the odds of losing two drives at the same time are practically non-existent. In practice, as the hard-drives are from the same manufacturing batch and subject to identical operating-conditions and usage patterns (including external dangers like somebody dropping it etc.), it will actually be unlikely for the hard-drives to fail at significantly different times. If it were up to me, I'd randomly swap around drives between RAID arrays, preferably acquired at different times, for just that reason.

    9. Re:Description Fail by Tastecicles · · Score: 1

      That's what I do. I never use two drives from the same batch in an array*, because a physical fault on one is more likely to be present on another - and that's a guaranteed fail. I think this is why they not only use several different systems to check each other on the Shuttle, they use different //architectures// so a physical flaw that affects one/of a batch/of a series/of a type is less likely to affect the others. I guess it would be like using a Z80, a Motorola 68k and a 80386 to check each other - well blow me, it's old tech, but what fries a 68k a Z80 would most likely survive.

      *it's actually rare that I use drives with the same capacity in my arrays! In my current scratch array, built for very high throughput, I have 80GB Hitachi, 80GB Seagate, 120GB Seagate, 160GB Seagate, 200GB Maxtor, in a RAID0 for 400GB. OK there's lots of space wasted, but hey - I built it for throughput not capacity.

      --
      Operation Guillotine is in effect.
    10. Re:Description Fail by Anonymous Coward · · Score: 0

      On a multi-million/billion dollar project why not use a redundant 2 system approach? If one side fails the other keeps going.
      I can't imagine that weight is the problem as most systems are relatively light and having the backup system is more than worth the cost in weight and money (weight is money when hefting something into space).

    11. Re:Description Fail by Tastecicles · · Score: 1

      no, your first guess was right - the valves were mounted on pegboards (literally) with walk-through access that a tech could visually inspect the valves and replace any that weren't lit. It was a full time job.

      --
      Operation Guillotine is in effect.
  13. It wasn't the programming... by afabbro · · Score: 0, Offtopic

    ...it was the name. Phobos Grunt sounds like a porn star.

    Male, female, or transgendered, I'm not sure.

    --
    Advice: on VPS providers
    1. Re:It wasn't the programming... by vlm · · Score: 1

      Mythologically, which is where the moon got its name, Phobos is a dude. He's got a twin brother Deimos. Given that datapoint, guess the name of another Martian satellite...

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    2. Re:It wasn't the programming... by crawling_chaos · · Score: 1

      Steve?

      --
      You can only drink 30 or 40 glasses of beer a day, no matter how rich you are.
      -- Colonel Adolphus Busch
    3. Re:It wasn't the programming... by geekoid · · Score: 1

      What we do know for sure: Bottom.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    4. Re:It wasn't the programming... by Anonymous Coward · · Score: 0

      In Soviet Russia, you'll become transgendered on short notice for this kind of trash talk.

      (if you feel like commenting on a Russian name, why not, you know, learn Russian first? "grunt" - and that reads "gr-oo-nt" by the way, not "gr-ah-nt" means "soil", as the probe was meant to take soil samples and deliver them back)

    5. Re:It wasn't the programming... by Tastecicles · · Score: 1

      you made me spray coffee!

      --
      Operation Guillotine is in effect.
  14. Contradictions by Aladrin · · Score: 5, Informative

    The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.

    --
    "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    1. Re:Contradictions by mbone · · Score: 1

      The summary is so contradictory because it quotes from 2 articles, and each of them is completely different.

      " A foolish consistency is the hobgoblin of little minds, (Emerson)

    2. Re:Contradictions by David+Gould · · Score: 1

      The summary is so contradictory because it quotes from 2 articles, and each of them is completely different.

      " A foolish consistency is the hobgoblin of little minds, (Emerson)

      "Look Ye not unto Slashdot for Answers, for Ye shall be told both Yea and Nay." (seen in a sig some years ago)

      --
      David Gould
      main(i){putchar(340056100>>(i-1)*5&31|!!(i<6)<< 6)&&main(++i);}
  15. Sounds like a editor failure to me by kbob88 · · Score: 5, Funny

    In other news, U.S. radars were not responsible for the highly confusing and contradictory summary posted this morning to a Slashdot story about Russia's Phobos-Grunt probe. A thorough investigation has determined that the story's chips should have been able to withstand the radiation received when the story was transmitted through the intertubes and routed over northern Alaska. Instead, investigators blamed a typing failure on the story editors. "A series of tests showed that the editing was lousy and sloppy, and disciplinary action will be taken on those responsible," a spokesman said.

    1. Re:Sounds like a editor failure to me by Anonymous Coward · · Score: 0

      Those. Responsible have been sacked. Lamas!!!!

    2. Re:Sounds like a editor failure to me by gmhowell · · Score: 1

      Those. Responsible have been sacked. Lamas!!!!

      Unfortunately, the way slashdot runs, tomorrow they'll have to sack those responsible for sacking those who were responsible.

      --
      Jesus was all right but his disciples were thick and ordinary. -John Lennon
  16. In Soviet Russia by Anonymous Coward · · Score: 1

    The chips program you.

  17. Blame it on software? by Anonymous Coward · · Score: 0

    Wow, us software folks get blamed for everything....

    So they picked the wrong components, had a hardware failure, and it's software's fault for not anticipating the failure? I know they always say "we'lll fix it in software" but this is ridiculous.

    I can say at NASA, when we needed 2 fault tolerance, we had 3 CPUs....

  18. Obligatory... by Cruciform · · Score: 1

    In Soviet Russia probe causes programming bug!

    They have very strict security measures. It can be traumatic.

  19. What is it with Mars and probes? by g0bshiTe · · Score: 1

    What's with Mars and probes? Seriously, how many have been lost either going or coming from?

    --
    I am Bennett Haselton! I am Bennett Haselton!
    1. Re:What is it with Mars and probes? by Squidlips · · Score: 1

      There seems to be a Mars curse for Russian probes. They have sent 4-5 probes, and they have all failed (two are at the bottom of the Pacific right now). However the Ruskies have done very well with other probes; it is just Mars. It is like the Patriots versus the Giants... NASA (actually JPL) has done better. Off the top of my head I would say that only two out of the last 5-6 have failed. [The failures spelled the doom of NASA new mantra "Better, Cheaper, Faster", although one lander did make it using better-cheaper-faster (using off the shelf electronics).] The last 3 JPL Mars probes have been spectacular successes ( MRO and the two MER rovers).

    2. Re:What is it with Mars and probes? by geekoid · · Score: 1

      It's HARD.
      I mean, we have pretty much mapped every spot on the planet, yet airplanes still crash.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    3. Re:What is it with Mars and probes? by gmhowell · · Score: 1

      It's HARD.
      I mean, we have pretty much mapped every spot on the planet, yet airplanes still crash.

      NP hard, or....

      --
      Jesus was all right but his disciples were thick and ordinary. -John Lennon
  20. Considering the cost of a launch by Anonymous Coward · · Score: 0

    I'm really surprised to hear that it's a programming error, but considering what was done with SCADA I wonder if there isn't something else afoot here.

  21. Staffing Error Doomed American Tech News Site by billcopc · · Score: 4, Insightful

    Okay, we still have a respectable though dwindling community of commenters, so can we please get rid of these editors who can't even be bothered to read four lines of summary text before posting ?

    The headline and summary do not make sense. Come on, we're supposed to be nerds, aka intelligent, focused, attentive knowledge aggregators.

    the fuck is wrong with this goddamned site?! These failures are starting to make Digg look good!

    --
    -Billco, Fnarg.com
    1. Re:Staffing Error Doomed American Tech News Site by Anonymous Coward · · Score: 0

      I think the more important point is that it's hard to put together the usual aggregate news summary about the lost mission investigation, because the official Russian response has been chaotic. The hardware people are probably pointing fingers at the software engineers, and vice versa; and procurement is blaming the testers, etc. Meanwhile, some of their bosses may still suspect American sabotage.

      Good luck to those guys trying to move the program forward.

    2. Re:Staffing Error Doomed American Tech News Site by geekoid · · Score: 1

      No. You are welcome to go to the times and pay for a subscription that uses actual editors.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    3. Re:Staffing Error Doomed American Tech News Site by gmhowell · · Score: 1

      Then the summary should indicate this. Something along the lines of: "there are conflicting reports coming out regarding the cause of the Russian Mars probe failure..."

      --
      Jesus was all right but his disciples were thick and ordinary. -John Lennon
  22. Fun to read the comments by vlm · · Score: 5, Insightful

    Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?

    Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:Fun to read the comments by Anonymous Coward · · Score: 0

      Most comments on Slashdot are from software-oriented people.

    2. Re:Fun to read the comments by Anonymous Coward · · Score: 0

      RAM failure in this case. your code is corrupted. now how do you deal with it ?

    3. Re:Fun to read the comments by invid · · Score: 1

      When you're given a month to finish a project you calculated at three months, it's kinda hard to compensate for every possibility. But then, I don't program for aircraft, I hope their schedules allow for programming proper error recovery. As it does so happens, I'm designing an error handling system right now. This article reminds me I've got to put in the "fried by radiation" routine.

      --
      The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    4. Re:Fun to read the comments by systemeng · · Score: 2

      You checksum memory with all processor cycles that are not dedicated to a specific task. If you detect a failure, you reload the system from read-only memory. . .

    5. Re:Fun to read the comments by vlm · · Score: 1

      And/or you have multiple copies of the software or other data in memory, and spend your wasted cycles looking for mismatches and keeping the best 2 outta 3.

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    6. Re:Fun to read the comments by Anonymous Coward · · Score: 0

      twice

      And if the checksum is different, you do it again...

    7. Re:Fun to read the comments by trout007 · · Score: 1
      --
      I love Jesus, except for his foreign policy.
  23. TFS - obviously written by a hardware guy by Thud457 · · Score: 2

    "Cosmic rays?"
    "That's a software problem...

    They're lucky those chips they bought from China weren't made of lead, or contain deadly melamine!!!

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

    1. Re:TFS - obviously written by a hardware guy by sconeu · · Score: 4, Interesting

      You laugh, but how many of you low level guys had to work around buggy hardware?

      I once sent a memo to my boss that I was doing the equivalent of "working around a burnt out lightbulb in software".

      E.g.: How many hardware guys does it take to change a lightbulb? None, we'll just have the software work around it.

      --
      General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
    2. Re:TFS - obviously written by a hardware guy by jd2112 · · Score: 1

      "Cosmic rays?" "That's a software problem...

      They're lucky those chips they bought from China weren't made of lead, or contain deadly melamine!!!

      If they were made of lead they might have blocked enough radiation to prevent them from crashing.

      --
      Any insufficiently advanced magic is indistinguishable from technology.
    3. Re:TFS - obviously written by a hardware guy by mevets · · Score: 4, Informative

      Try this one on your hardware guys:
      "The main purpose of software is to make hardware reliable".

      Drives them nuts...

    4. Re:TFS - obviously written by a hardware guy by Anonymous Coward · · Score: 1

      Wanhhhhh - it's the network, it's the hardware, it's not me, mommyyyyyyyyyyy!!!

    5. Re:TFS - obviously written by a hardware guy by Anonymous Coward · · Score: 0

      NAND flash proves this. Just look at the hoops for NAND booting...

    6. Re:TFS - obviously written by a hardware guy by Anonymous Coward · · Score: 0

      I agree with you, this drives me nuts! But I am also and the software guy...is that ok? I don't know, but its great fun!

    7. Re:TFS - obviously written by a hardware guy by garyebickford · · Score: 4, Interesting

      Not even necessarily low level. I once had a weird intermittent problem in a PHP driven web system. After a couple of weeks of diagnosing (largely trying to find a case the could more-or-less reliably tickle the bug), it turned out to be an interaction of a bug in the Redhat version of that day (2001) with a bug in the particular CPU we were using. PHP code just happened to trigger it under certain conditions. Since the box was at Level 3, we had to drive an hour down there and replace the machine.

      And long ago I worked on Perq workstations, which had a stack-machine CPU (the CPU was a 15x15 inch board filled with TTL). The expression stack was four chips. The system was designed around the chip spec - NEVER DO THAT!!! Chips can not be depended to go at exactly the design spec - some are slow, some are fast. As a result, every CPU had to be tested at installation with those four chips inserted in different locations, essentially in order of speed. If a fast one came after a slow one in the slots, the CPU would barf. Basically someone just kept swapping chips around until it worked.

      We were just discussing some of the remarkable repairs done in software to accommodate problems in various interplanetary probes - truly amazing stuff.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    8. Re:TFS - obviously written by a hardware guy by Anonymous Coward · · Score: 0

      Sounds like every VIA chipset ever made. Software to patch over the shitty hardware.

    9. Re:TFS - obviously written by a hardware guy by sconeu · · Score: 1

      It was a f***ing data request interrupt with only two ways to reset the interrupt:

      1. strobe data out to the device (thereby triggering another interrupt)
      2. Master reset.

      That sounds like a f***ing hardware issue to me.

      --
      General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
  24. Baloney by mbone · · Score: 4, Interesting

    What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion ?

    No, I bet this was a programming error, coupled with a near total failure to test the software.

     

    1. Re:Baloney by DerekLyons · · Score: 1

      What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion?

      Small, but decidedly non-zero. So I should point out that "improbable" != "impossible".

    2. Re:Baloney by Anonymous Coward · · Score: 0

      The chance of failing in the first 20-30 minutes are the same as failing during the second 20-30 minutes, which are the same as failing during ANY 20-30 minute period of the mission. If it happened at some arbitrary point in space, would you ask "what are the odds of it happening exactly THERE???"

    3. Re:Baloney by mbone · · Score: 1

      Yes, of course, but think of this from an engineering sense. You try and do something for the very first time with a new device and it instantly fails. Now, it is possible that some random error happened to happen just then, but it is much more probably that the failure is connected with the use of new capabilities.

      Now, as it happens, the failure is software related, so it is natural to blame a software error. Could be wrong, but that (IMO) is the way to bet.

    4. Re:Baloney by fgodfrey · · Score: 1

      That's only true if conditions are the same in every 20 to 30 minute chunk, which in this case, they aren't. While the spacecraft is on the ground, it has the earth's magnetosphere and then the atmosphere protecting it from radiation. The further away from the ground it gets, the less protection it has. So, in theory then, the likelihood of a failure due to radiation is *lower* in the initial phases of flight than it is during cruise when it is out of range of any type of natural radiation protection.

      Meantime, you've also got the high vibration and g-force environment of a launch going on, so if the parts were not soldered well (hopefully they used x-ray testing for that, but....) you have the possibility of failure being higher there than during cruise when it isn't shaking at all.

      So no, in space flight, any random time is not created equal.

      --
      Go Badgers! -- #include "std/disclaimer.h"
    5. Re:Baloney by ChrisMaple · · Score: 1

      You need to read up on failure mechanisms in electronic systems. Pay particular attention to "infant mortality", "wearout", and "bathtub".

      --
      Contribute to civilization: ari.aynrand.org/donate
    6. Re:Baloney by DerekLyons · · Score: 1

      Yes, of course, but think of this from an engineering sense.

      I am - but thinking of it from an engineering sense means looking at all plausible causes (which you aren't) and reviewing all the available evidence (which you haven't, unless you read Russian).
       

      You try and do something for the very first time with a new device and it instantly fails. Now, it is possible that some random error happened to happen just then, but it is much more probably that the failure is connected with the use of new capabilities.

      However, "more probable" does not mean "actually did happen". (I repeat this because it doesn't seem to have sunk in the first time.)
       

      Now, as it happens, the failure is software related, so it is natural to blame a software error.

      Now, as it happens, where the failure occurred depends on which (rough and fast) translation of the report you read. Thanks, but I'll stick with engineering and leave the bias to you.

    7. Re:Baloney by DCFusor · · Score: 1

      Or, someone thought they could get away with COTS stuff that requires air to carry away its own heat. NASA's even forgotten that one. Even a relatively low power chip in a vacuum will burn itself out from its own heat, no radiation required. Could just be a gap in the heatsink mate up...that's all it takes.

      --
      Why guess when you can know? Measure!
  25. Oh come on. by JustAnotherIdiot · · Score: 1

    I read the title and I was going to make a joke forgetting a ;, or something in the like.
    But this wasn't a programming error, it was a hardware failure |:
    Did the editor even read what he wrote?

    --
    What do I know, I'm just an idiot, right?
    1. Re:Oh come on. by Fnord666 · · Score: 1

      Did the editor even read what he wrote?

      The editors no longer write or read anything. They just cut and paste. Submitters no longer write anything, they just copy the first paragraph or two of an article. I swear that some days all of the articles are probably just submitted by a very short perl script.

      --
      'The tyrant will always find pretext for his tyranny.' - Aesop's Fables
    2. Re:Oh come on. by Anonymous Coward · · Score: 0

      Did the editor even read what he wrote?

      The editors no longer write or read anything. They just cut and paste. Submitters no longer write anything, they just copy the first paragraph or two of an article. I swear that some days all of the articles are probably just submitted by a very short perl script.

      Or... maybe the POSTS THEMSELVES are REALLY LONG PERL SCRIPTS?!

      SOMEBODY COMPILE SLASHDOT!

    3. Re:Oh come on. by garyebickford · · Score: 1

      Random quote:

      no longer write or read anything

      Random reply: I think so too!
      Random caveat: But on the other hand, maybe not.
      Random embellishment: Eliza knows!

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    4. Re:Oh come on. by Tastecicles · · Score: 1

      My summary (first time submitter today!) was typed by my very own fingers. No cut/paste involved.

      --
      Operation Guillotine is in effect.
  26. how long does it take YOU to walk a mile? by Thud457 · · Score: 2

    Mars is 60,000,000 miles away.
    Phobos Grunt would have taken three years to get there.
    If it didn't die of dysentery on the journey there.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

    1. Re:how long does it take YOU to walk a mile? by PlatyPaul · · Score: 1

      Radiation bites Phobos Grunt.
      Radiation bites Phobos Grunt.
      Phobos Grunt dies.

      --
      Misery loves company. Online misery loves unsuspecting random strangers.
    2. Re:how long does it take YOU to walk a mile? by Anonymous Coward · · Score: 0

      its even farther away if you count by kilometers. since that was one of the problems with probes, why not count in larger units, so it take less time. maybe if we measure it in centons, the time in light years will be less. God, im so proud of my US physics education. thanks to all the TV i watched, i can solve technical problems with ease. Beam me up!

  27. Top Ten reasons for failure of Mars Probe. by walterbyrd · · Score: 3, Funny

    Ripped from old David Letterman "Top Ten List"

    10. "Mars probe? What Mars probe?"
    9. Forgot to use The Club
    8. Those lying weasels at Radio Shack
    7. Too much Tang
    6. Made by G.E.
    5. Them Martians musta shot it down with a ray gun
    4. Heh, heh, heh ... Our space probe sucks -- heh, heh, heh
    3. At least we didn't blow all our money on some dork screwing around with a car phone
    2. Remember Watergate? Well, Nixon's up to his old tricks again!
    1. Space monkeys

  28. ...programming? by Anonymous Coward · · Score: 0

    Right, so, if I throw my PC into the fire and it shuts down, is that a programming error too?

  29. Worse than on the ground... by larys · · Score: 0

    is a BSOD in space... -- once it exits the atmosphere, you can't hit the reset button anymore... :/

    1. Re:Worse than on the ground... by Panaflex · · Score: 3, Informative

      There's hardware to deal with that - a watchdog timer can reboot the system quickly.

      Assuming the system comes back up with a working CPU and RAM, then the main computer should be able to work around bad peripheral or components on the bus. I think that's what the article is getting at.

      On military aircraft, they use VM's to run the OS and software. Communicate between systems is passed synchronously and requires that each module know the state of the other modules. There is never an assumption that the other system will just work - all messages require acknowledgement and verification of results.

      --
      I said no... but I missed and it came out yes.
    2. Re:Worse than on the ground... by garyebickford · · Score: 1

      This is on the order of oblig:
      I just learned that there is a special version of Windows, Windows for Warships.

      --
      It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
    3. Re:Worse than on the ground... by Tastecicles · · Score: 1

      no... on military aircraft they use hardcoded RTOS embedded systems. Layers of interfacing and the associated lag can mean the difference between a missile flying through the correct window and blowing an ammo dump or flying through the wrong window and blowing up a school full of kids.

      --
      Operation Guillotine is in effect.
    4. Re:Worse than on the ground... by Panaflex · · Score: 1

      You are (mostly) correct, sir. However, message passing is certainly being used on all new developments rather than IPC. Certainly there has been a long-term adherence to RTOS development in military avionics, but commercial avionics has moved strongly to VM based systems as the recovery is faster and debugging critical software components is easier. Additionally, hardware can be allowed to advance without requiring total rewrites for software.

      I've already seen java run on the F-35 platform - and I'm pretty sure you'll see much more as time goes.

      --
      I said no... but I missed and it came out yes.
  30. Russian spacecraft... American spacecraft.... by Anonymous Coward · · Score: 0

    All made in CHINA.

  31. Cited Blog Says Otherwise by boddhisatva · · Score: 1

    The cited Planetary Society blog with translated explanation describes hardware failure and not programming failure.

  32. Told them not too by Anonymous Coward · · Score: 0

    Even though it seemed like a really cool idea at the time, we warned them not to use iPhones as the onboard CPUs on the spacecraft.

  33. Radiation Damage? by funkboy · · Score: 2

    Well, if there was an RTG onboard, then maybe the radiation damage was from inside the spacecraft.

    It seems strange to me that they'd blame radiation damage as they have a separate institution dedicated to developing rad-hard SPARC chips for space applications that has a very successful track record.

    Question: how do they know it was radiation damage if they never heard back from the probe?

    1. Re:Radiation Damage? by XrayJunkie · · Score: 1

      The chips send an S.O.S. before going down!
      "Huston, I am feeling dizzy"

  34. Top 10 reasons for failure of Mars Probe. by EnsilZah · · Score: 3, Funny

    01 Hardware
    10 Software

    And it seems the article opted for 11 which is an undefined state.
    (Monospace used for effect)

  35. Mission went as expected by alexmin · · Score: 1

    After reading open letter from one of the designers of Fobos Nikolai Morozov to russian vice-premier Sergei Ivanov from 03/08/2011 it's hard to believe that Fobos-Grunt launch was anything but a success.

    The goal was not to send something to Mars as officially stated but to get rid of material evidence of gross incompetence and graft going on in KB Lavochkin for many years.

    Link (in Russian): http://apervushin.livejournal.com/179226.html

  36. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  37. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  38. Darn you Id Software! by Darth+Hubris · · Score: 2

    Who saw "Doom", "Mars", and "Phobos" and reached for your shotgun?

    --
    The party's over ... the drink ... and the luck ... ran out
  39. Oblig. Armageddon by Anonymous Coward · · Score: 0

    *said in an overtly russian sounding dialect of english* "Russian components, american components, it's all made in Taiwan!"

  40. Mars Reconnaissance Orbiter by rk · · Score: 1

    Has/had (don't know if it's been patched) a nifty bug where a 4-bit group identifies the state the spacecraft is supposed to be in. The problem is when the spacecraft reboots, that value starts off uninitialized, so whatever value just happens to be sitting at that point in memory gets used. Not a huge problem, because when the spacecraft reboots (it happens) we can just telecommand it to the right state. Except for one problem: One of those states is "I'm on the launching pad and shouldn't listen to any radio telecommands, but only commands from the hardwire interface." Which means we can't remotely command it out of that state anymore, and it will at that point be a dead orbiter.

    Space software is exciting!

  41. Which frickin chip by Anonymous Coward · · Score: 0

    This entire conversation is moot, because we don't know which "chip" failed.

    Also, it's the popular media, so "chip" could mean anything from a wire to an electro-mechanical actuator or power supply component, or firmware, or even software.
    It's only one step above "computer glitch" which tends to mean "something went wrong somewhere in the system", even if there's no actual computer in the system.

  42. Irrelevant, really by Anonymous Coward · · Score: 0

    People have been successfully using non space qualified components on LEO spacecraft since at least the late seventies (8086, 80186,80386 and others). These do not need much, if any, screening in low earth orbit. Of course, some other non space qualified components are not that robust, but there still are options and there are relatively simple and cheap tests you can do yourself to determine if it will work, even if it is not a full set of qualification tests.

    And there is no way ANY radar should affect onboard electronics.

    Any way (US radar, bad components, programming, or whatever other excuse Roskosmos comes up with), it was bad design on the part of the Russians.

    These things happen. The Russians are good engineers. It is a pity that their leaders are using weak excuses to cover what seems to be mostly leadership failings.

  43. reliability calculation fail by Anonymous Coward · · Score: 0

    You are doing it wrong.

    1. Re:reliability calculation fail by Tastecicles · · Score: 1

      Please, AC, to enlighten us do explain where I am in error?

      --
      Operation Guillotine is in effect.
  44. what's the over-under.... by slashmydots · · Score: 1

    I'm taking bets right now. We're paying 10:1 odds that it wasn't a Via chip. In other words, I bet it was a Via chip lol. They probably pulled an ECS/Foxconn and said, "weeeeeeeell, for $2 cheaper, we can skip the Realtek chips and put on Via ones. Yeah, let's do that." That's right, I'm implying that the sound card and ethernet controller chips crashed it, lol. You try to turn on the subwoofer channel on that rover to let the martians know you're coming (and you're totally riced out too) and then BOOM, sorry, that's not supported by this version of the driver - CRASH! Yeah, that's what really happened and they're just covering it up.

  45. Seen It Before .. Even Worse by Anonymous Coward · · Score: 0

    The NSADA/JAXA ADEOS-II was rendered useless because a "genius technichan" on the ground forgot to upload a very small scrip-code to overide the fale-safe script-code whose purpose is to turn off all systems in the event of communication loss.

    On about the 6th month ADEOS-II started turning off all sub-systems and in the end turned off the main power source thus rendering it useless.

    Wonderfull.

  46. Boogeymen by Anonymous Coward · · Score: 0

    First it was US sabatoge and now it is cosmic rays... At this point the only excuse I believe is systemic incompetence.

  47. Self-Repair by XrayJunkie · · Score: 1

    There are several institutes that research self-repair features of chips.
    To cope with (space) radiation, they use chips that can restructure themselves to avoid damaged parts. Self-repair is an alternative to various shielding layers. A combination of both - in the right mix - would improve reliability by factors.
    See: Same application scenario like the fukushima-crisis (http://tech.slashdot.org/story/12/01/08/1420254/where-were-the-robots-in-fukushima-crisis), where robots could not be used or simply failed in the field.

  48. Why Not Blame the Papist Heretics? by Anonymous Coward · · Score: 0

    I - I'd have believed it more easily if they had blamed it on a Gregorian x Traditional calendar pogramming conflict.
    II - Radiation strong enough to tickle a Russian craft's insides probably can melt spark plugs as well.

  49. is there any bugfix? by Anonymous Coward · · Score: 0

    Is there any bugfix for correcting the programmer's failure? (aka Service Pack for the buggy operating system in the deep space).

    A possible solution could be sending a 2nd rescue probe that does the following steps: [1] to eject the failed chipset's motherboard (as the tongue of a CD driver), [2] to insert a recovery chipset base (as inserting the CD recovery program), and [3] to rescue the ejected failed chipset's motherboard for forensic Space's C.S.I. (Space Crime Scene Investigation) returning itself to the Earth.

    Are they interested the discovering of how were fried the chipset and what things or who did cause them?

    Next lesson: don't let the hardware that the software controlates the signals of shared wires (aka buses). Why did the hw engineers use shared wires for intercommunicating sw-controlled signals instead of share-less wires with more logic gates?

    Don't point to programmers as the max. responsible of the facts, the hw engineers did come before than them.

    In the space, the ISS and the satellites could be tools for this kind of rescue missions "on-fly".

    JCPM: i don't like the Mars missions that could be used for human colonizing purposes and violating the godsent Earth's prophecies.