Programming Error Doomed Russian Mars Probe
astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined."
According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."
We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.
the ultimate cause was the use of non-space-qualified electronic components
Programming error?
Perhaps in the software used to order the parts
The electronic engineers here are always trying to blame programmers for their design faults too.
"the ultimate cause was the use of non-space-qualified electronic component" != "programming error" hardware fail.
Gamma rays, X-rays and the products of their collisions are attenuated by the upper atmosphere, not the Van Allen belts. This is why you get more exposure at altitude in an airplane.
Maybe this is all just a translation error, could have been either or both?
How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?
Sorry, but gray text on gray background is making my eyes bleed.
doesn't sound like programming, if the part did not fail, then the mission should have continued as planned.
Is it just me, or is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?
The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
I'm not first to ask... but still wonder how that's possible on Slashdot that is *supposed* to be technologically literate.
Vassili Leonov
Components. American components, Russian Components, ALL MADE IN TAIWAN!
http://www.imdb.com/title/tt0120591/quotes?qt=qt0459113
"Evil will always triumph over good, because good is dumb." - Dark Helmet (Spaceballs)
The OP is stating from 2 different sources. One saying it was a programming error while the other was a seemingly earlier report about the defective or off-spec components
...it was the name. Phobos Grunt sounds like a porn star.
Male, female, or transgendered, I'm not sure.
Advice: on VPS providers
The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
In other news, U.S. radars were not responsible for the highly confusing and contradictory summary posted this morning to a Slashdot story about Russia's Phobos-Grunt probe. A thorough investigation has determined that the story's chips should have been able to withstand the radiation received when the story was transmitted through the intertubes and routed over northern Alaska. Instead, investigators blamed a typing failure on the story editors. "A series of tests showed that the editing was lousy and sloppy, and disciplinary action will be taken on those responsible," a spokesman said.
The chips program you.
Wow, us software folks get blamed for everything....
So they picked the wrong components, had a hardware failure, and it's software's fault for not anticipating the failure? I know they always say "we'lll fix it in software" but this is ridiculous.
I can say at NASA, when we needed 2 fault tolerance, we had 3 CPUs....
In Soviet Russia probe causes programming bug!
They have very strict security measures. It can be traumatic.
What's with Mars and probes? Seriously, how many have been lost either going or coming from?
I am Bennett Haselton! I am Bennett Haselton!
I'm really surprised to hear that it's a programming error, but considering what was done with SCADA I wonder if there isn't something else afoot here.
Okay, we still have a respectable though dwindling community of commenters, so can we please get rid of these editors who can't even be bothered to read four lines of summary text before posting ?
The headline and summary do not make sense. Come on, we're supposed to be nerds, aka intelligent, focused, attentive knowledge aggregators.
the fuck is wrong with this goddamned site?! These failures are starting to make Digg look good!
-Billco, Fnarg.com
Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?
Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
"Cosmic rays?"
"That's a software problem...
They're lucky those chips they bought from China weren't made of lead, or contain deadly melamine!!!
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion ?
No, I bet this was a programming error, coupled with a near total failure to test the software.
I read the title and I was going to make a joke forgetting a ;, or something in the like.
But this wasn't a programming error, it was a hardware failure |:
Did the editor even read what he wrote?
What do I know, I'm just an idiot, right?
Mars is 60,000,000 miles away.
Phobos Grunt would have taken three years to get there.
If it didn't die of dysentery on the journey there.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Ripped from old David Letterman "Top Ten List"
10. "Mars probe? What Mars probe?" ... Our space probe sucks -- heh, heh, heh
9. Forgot to use The Club
8. Those lying weasels at Radio Shack
7. Too much Tang
6. Made by G.E.
5. Them Martians musta shot it down with a ray gun
4. Heh, heh, heh
3. At least we didn't blow all our money on some dork screwing around with a car phone
2. Remember Watergate? Well, Nixon's up to his old tricks again!
1. Space monkeys
Right, so, if I throw my PC into the fire and it shuts down, is that a programming error too?
is a BSOD in space... -- once it exits the atmosphere, you can't hit the reset button anymore... :/
All made in CHINA.
The cited Planetary Society blog with translated explanation describes hardware failure and not programming failure.
Even though it seemed like a really cool idea at the time, we warned them not to use iPhones as the onboard CPUs on the spacecraft.
Well, if there was an RTG onboard, then maybe the radiation damage was from inside the spacecraft.
It seems strange to me that they'd blame radiation damage as they have a separate institution dedicated to developing rad-hard SPARC chips for space applications that has a very successful track record.
Question: how do they know it was radiation damage if they never heard back from the probe?
01 Hardware
10 Software
And it seems the article opted for 11 which is an undefined state.
(Monospace used for effect)
After reading open letter from one of the designers of Fobos Nikolai Morozov to russian vice-premier Sergei Ivanov from 03/08/2011 it's hard to believe that Fobos-Grunt launch was anything but a success.
The goal was not to send something to Mars as officially stated but to get rid of material evidence of gross incompetence and graft going on in KB Lavochkin for many years.
Link (in Russian): http://apervushin.livejournal.com/179226.html
Comment removed based on user account deletion
Comment removed based on user account deletion
Who saw "Doom", "Mars", and "Phobos" and reached for your shotgun?
The party's over
*said in an overtly russian sounding dialect of english* "Russian components, american components, it's all made in Taiwan!"
Has/had (don't know if it's been patched) a nifty bug where a 4-bit group identifies the state the spacecraft is supposed to be in. The problem is when the spacecraft reboots, that value starts off uninitialized, so whatever value just happens to be sitting at that point in memory gets used. Not a huge problem, because when the spacecraft reboots (it happens) we can just telecommand it to the right state. Except for one problem: One of those states is "I'm on the launching pad and shouldn't listen to any radio telecommands, but only commands from the hardwire interface." Which means we can't remotely command it out of that state anymore, and it will at that point be a dead orbiter.
Space software is exciting!
This entire conversation is moot, because we don't know which "chip" failed.
Also, it's the popular media, so "chip" could mean anything from a wire to an electro-mechanical actuator or power supply component, or firmware, or even software.
It's only one step above "computer glitch" which tends to mean "something went wrong somewhere in the system", even if there's no actual computer in the system.
People have been successfully using non space qualified components on LEO spacecraft since at least the late seventies (8086, 80186,80386 and others). These do not need much, if any, screening in low earth orbit. Of course, some other non space qualified components are not that robust, but there still are options and there are relatively simple and cheap tests you can do yourself to determine if it will work, even if it is not a full set of qualification tests.
And there is no way ANY radar should affect onboard electronics.
Any way (US radar, bad components, programming, or whatever other excuse Roskosmos comes up with), it was bad design on the part of the Russians.
These things happen. The Russians are good engineers. It is a pity that their leaders are using weak excuses to cover what seems to be mostly leadership failings.
You are doing it wrong.
I'm taking bets right now. We're paying 10:1 odds that it wasn't a Via chip. In other words, I bet it was a Via chip lol. They probably pulled an ECS/Foxconn and said, "weeeeeeeell, for $2 cheaper, we can skip the Realtek chips and put on Via ones. Yeah, let's do that." That's right, I'm implying that the sound card and ethernet controller chips crashed it, lol. You try to turn on the subwoofer channel on that rover to let the martians know you're coming (and you're totally riced out too) and then BOOM, sorry, that's not supported by this version of the driver - CRASH! Yeah, that's what really happened and they're just covering it up.
The NSADA/JAXA ADEOS-II was rendered useless because a "genius technichan" on the ground forgot to upload a very small scrip-code to overide the fale-safe script-code whose purpose is to turn off all systems in the event of communication loss.
On about the 6th month ADEOS-II started turning off all sub-systems and in the end turned off the main power source thus rendering it useless.
Wonderfull.
First it was US sabatoge and now it is cosmic rays... At this point the only excuse I believe is systemic incompetence.
There are several institutes that research self-repair features of chips.
To cope with (space) radiation, they use chips that can restructure themselves to avoid damaged parts. Self-repair is an alternative to various shielding layers. A combination of both - in the right mix - would improve reliability by factors.
See: Same application scenario like the fukushima-crisis (http://tech.slashdot.org/story/12/01/08/1420254/where-were-the-robots-in-fukushima-crisis), where robots could not be used or simply failed in the field.
http://en.wikipedia.org/wiki/VIPER_microprocessor
I - I'd have believed it more easily if they had blamed it on a Gregorian x Traditional calendar pogramming conflict.
II - Radiation strong enough to tickle a Russian craft's insides probably can melt spark plugs as well.
Is there any bugfix for correcting the programmer's failure? (aka Service Pack for the buggy operating system in the deep space).
A possible solution could be sending a 2nd rescue probe that does the following steps: [1] to eject the failed chipset's motherboard (as the tongue of a CD driver), [2] to insert a recovery chipset base (as inserting the CD recovery program), and [3] to rescue the ejected failed chipset's motherboard for forensic Space's C.S.I. (Space Crime Scene Investigation) returning itself to the Earth.
Are they interested the discovering of how were fried the chipset and what things or who did cause them?
Next lesson: don't let the hardware that the software controlates the signals of shared wires (aka buses). Why did the hw engineers use shared wires for intercommunicating sw-controlled signals instead of share-less wires with more logic gates?
Don't point to programmers as the max. responsible of the facts, the hw engineers did come before than them.
In the space, the ISS and the satellites could be tools for this kind of rescue missions "on-fly".
JCPM: i don't like the Mars missions that could be used for human colonizing purposes and violating the godsent Earth's prophecies.