Programming Error Doomed Russian Mars Probe
astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined."
According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."
We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.
the ultimate cause was the use of non-space-qualified electronic components
Programming error?
Perhaps in the software used to order the parts
"the ultimate cause was the use of non-space-qualified electronic component" != "programming error" hardware fail.
How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?
Sorry, but gray text on gray background is making my eyes bleed.
Is it just me, or is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?
The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
In other news, U.S. radars were not responsible for the highly confusing and contradictory summary posted this morning to a Slashdot story about Russia's Phobos-Grunt probe. A thorough investigation has determined that the story's chips should have been able to withstand the radiation received when the story was transmitted through the intertubes and routed over northern Alaska. Instead, investigators blamed a typing failure on the story editors. "A series of tests showed that the editing was lousy and sloppy, and disciplinary action will be taken on those responsible," a spokesman said.
A 4 digit ID and never heard of microcode.
Seriously Gramps, the distinction between hardware and software isn't as clear cut as it was when shit was all powered by steam.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
The Planetary Society entry says that two modules failed and then the main computer crashed. Probably irrelevant if the computer crashed or not if there were significant failures in the electronics. Perhaps if the computer had kept going there woud have been some communication of what had gone wrong.
One of the commenters wrote "It is rather unlikely radiation caused the failure. Russians said the failure was due to an SRAM WS512K32V20G24M from White Electronics. This part is a module containing 4 CY7C1049 chips from Cypress and is actually screened. While the Cypress part is very susceptible to Latchup," No idea if this is true or not.
Okay, we still have a respectable though dwindling community of commenters, so can we please get rid of these editors who can't even be bothered to read four lines of summary text before posting ?
The headline and summary do not make sense. Come on, we're supposed to be nerds, aka intelligent, focused, attentive knowledge aggregators.
the fuck is wrong with this goddamned site?! These failures are starting to make Digg look good!
-Billco, Fnarg.com
Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?
Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Stop dissing Steam, it is the power source of the future. :)
Also, get off my lawn.
If I were God, wouldn't I protect my churches from acts of me?
"Cosmic rays?"
"That's a software problem...
They're lucky those chips they bought from China weren't made of lead, or contain deadly melamine!!!
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion ?
No, I bet this was a programming error, coupled with a near total failure to test the software.
Mars is 60,000,000 miles away.
Phobos Grunt would have taken three years to get there.
If it didn't die of dysentery on the journey there.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Ripped from old David Letterman "Top Ten List"
10. "Mars probe? What Mars probe?" ... Our space probe sucks -- heh, heh, heh
9. Forgot to use The Club
8. Those lying weasels at Radio Shack
7. Too much Tang
6. Made by G.E.
5. Them Martians musta shot it down with a ray gun
4. Heh, heh, heh
3. At least we didn't blow all our money on some dork screwing around with a car phone
2. Remember Watergate? Well, Nixon's up to his old tricks again!
1. Space monkeys
There's hardware to deal with that - a watchdog timer can reboot the system quickly.
Assuming the system comes back up with a working CPU and RAM, then the main computer should be able to work around bad peripheral or components on the bus. I think that's what the article is getting at.
On military aircraft, they use VM's to run the OS and software. Communicate between systems is passed synchronously and requires that each module know the state of the other modules. There is never an assumption that the other system will just work - all messages require acknowledgement and verification of results.
I said no... but I missed and it came out yes.
Well, if there was an RTG onboard, then maybe the radiation damage was from inside the spacecraft.
It seems strange to me that they'd blame radiation damage as they have a separate institution dedicated to developing rad-hard SPARC chips for space applications that has a very successful track record.
Question: how do they know it was radiation damage if they never heard back from the probe?
01 Hardware
10 Software
And it seems the article opted for 11 which is an undefined state.
(Monospace used for effect)
Who saw "Doom", "Mars", and "Phobos" and reached for your shotgun?
The party's over
It's worth noting that the Space Shuttle's navigation system had three identical computers who all 'voted' on the result, and if one disagreed it took itself out of the system. And there was a fourth computer made by a different company, using a different architecture and different programming language, that monitored the three. In retrospect, I think that's a pretty good idea. Having two different architectures makes having the same programming error occur in two different systems very unlikely.
Of course, as you add nodes to such a system, it gets more 'interesting' to figure out how to handle the set of possible differences. What constitutes a failure? What constitutes agreement?
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/