Programming Error Doomed Russian Mars Probe
astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined."
According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."
We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.
the ultimate cause was the use of non-space-qualified electronic components
Programming error?
Perhaps in the software used to order the parts
"the ultimate cause was the use of non-space-qualified electronic component" != "programming error" hardware fail.
Gamma rays, X-rays and the products of their collisions are attenuated by the upper atmosphere, not the Van Allen belts. This is why you get more exposure at altitude in an airplane.
How much did they save by using Radio Shack parts in a Mars probe? $5.00 even?
Sorry, but gray text on gray background is making my eyes bleed.
Is it just me, or is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?
The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
I'm not first to ask... but still wonder how that's possible on Slashdot that is *supposed* to be technologically literate.
Vassili Leonov
Components. American components, Russian Components, ALL MADE IN TAIWAN!
http://www.imdb.com/title/tt0120591/quotes?qt=qt0459113
"Evil will always triumph over good, because good is dumb." - Dark Helmet (Spaceballs)
The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
In other news, U.S. radars were not responsible for the highly confusing and contradictory summary posted this morning to a Slashdot story about Russia's Phobos-Grunt probe. A thorough investigation has determined that the story's chips should have been able to withstand the radiation received when the story was transmitted through the intertubes and routed over northern Alaska. Instead, investigators blamed a typing failure on the story editors. "A series of tests showed that the editing was lousy and sloppy, and disciplinary action will be taken on those responsible," a spokesman said.
The chips program you.
In Soviet Russia probe causes programming bug!
They have very strict security measures. It can be traumatic.
The Planetary Society entry says that two modules failed and then the main computer crashed. Probably irrelevant if the computer crashed or not if there were significant failures in the electronics. Perhaps if the computer had kept going there woud have been some communication of what had gone wrong.
One of the commenters wrote "It is rather unlikely radiation caused the failure. Russians said the failure was due to an SRAM WS512K32V20G24M from White Electronics. This part is a module containing 4 CY7C1049 chips from Cypress and is actually screened. While the Cypress part is very susceptible to Latchup," No idea if this is true or not.
What's with Mars and probes? Seriously, how many have been lost either going or coming from?
I am Bennett Haselton! I am Bennett Haselton!
Okay, we still have a respectable though dwindling community of commenters, so can we please get rid of these editors who can't even be bothered to read four lines of summary text before posting ?
The headline and summary do not make sense. Come on, we're supposed to be nerds, aka intelligent, focused, attentive knowledge aggregators.
the fuck is wrong with this goddamned site?! These failures are starting to make Digg look good!
-Billco, Fnarg.com
Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?
Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
"Cosmic rays?"
"That's a software problem...
They're lucky those chips they bought from China weren't made of lead, or contain deadly melamine!!!
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Mythologically, which is where the moon got its name, Phobos is a dude. He's got a twin brother Deimos. Given that datapoint, guess the name of another Martian satellite...
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion ?
No, I bet this was a programming error, coupled with a near total failure to test the software.
I read the title and I was going to make a joke forgetting a ;, or something in the like.
But this wasn't a programming error, it was a hardware failure |:
Did the editor even read what he wrote?
What do I know, I'm just an idiot, right?
Mars is 60,000,000 miles away.
Phobos Grunt would have taken three years to get there.
If it didn't die of dysentery on the journey there.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Steve?
You can only drink 30 or 40 glasses of beer a day, no matter how rich you are.
-- Colonel Adolphus Busch
What we do know for sure: Bottom.
The Kruger Dunning explains most post on
this might be interesting for you and others (it's pretty much gibberish to me :D)
http://russianspaceweb.com/phobos_grunt_aftermath.html
Ripped from old David Letterman "Top Ten List"
10. "Mars probe? What Mars probe?" ... Our space probe sucks -- heh, heh, heh
9. Forgot to use The Club
8. Those lying weasels at Radio Shack
7. Too much Tang
6. Made by G.E.
5. Them Martians musta shot it down with a ray gun
4. Heh, heh, heh
3. At least we didn't blow all our money on some dork screwing around with a car phone
2. Remember Watergate? Well, Nixon's up to his old tricks again!
1. Space monkeys
There's hardware to deal with that - a watchdog timer can reboot the system quickly.
Assuming the system comes back up with a working CPU and RAM, then the main computer should be able to work around bad peripheral or components on the bus. I think that's what the article is getting at.
On military aircraft, they use VM's to run the OS and software. Communicate between systems is passed synchronously and requires that each module know the state of the other modules. There is never an assumption that the other system will just work - all messages require acknowledgement and verification of results.
I said no... but I missed and it came out yes.
The cited Planetary Society blog with translated explanation describes hardware failure and not programming failure.
I like the table describing possible failure causes at the bottom, most of them are officials accusing the US of directly or indirectly causing the satellite's failure. Conspiracy theories alive and well.
grep -iw skynet
Well, if there was an RTG onboard, then maybe the radiation damage was from inside the spacecraft.
It seems strange to me that they'd blame radiation damage as they have a separate institution dedicated to developing rad-hard SPARC chips for space applications that has a very successful track record.
Question: how do they know it was radiation damage if they never heard back from the probe?
01 Hardware
10 Software
And it seems the article opted for 11 which is an undefined state.
(Monospace used for effect)
What we gleam from this, rather old article, together with other common knowledge... apparently the flight-control computer had two identical processors, presumably for redundancy, that according to Roskmos both rebooted at the same time, possibly due to "heavy particles" in space. This is not unthinkable, especially as the rebooting of such robust processors could take significant time, during which another one could encounter failure.
There is also reference to a watchdog procedure, which muddles waters somewhat - I'm wondering if the watchdog procedures could have triggered on some other condition than total unresponsiveness of the unit in question, and if it could have led to rebooting them both at the same time, for example due to checking them at the same time on an interval. Regardless, after both redundant processors booted at the same time, the probe interrupted flight program, and - quite correctly - entered into "safe mode" awaiting further instructions and diagnostics.
Then comes up the further engineering SNAFU, and where a software-specification error most likely comes into play: In safe-mode, the probe switched to its X-band radio, which was never intended to be operated on orbit, but only in deep space on way to Mars. The problem with this was two-fold. First of all, the bulky Russian deep space antennas could not track the probe at orbital speeds long enough to receive let alone transmit data. And secondly, as the probe was orbiting Earth it was spending long times with its solar panels in Earth's shadow, while the high power interplanetary radio was draining its batteries. And so the probe was doomed.
After reading open letter from one of the designers of Fobos Nikolai Morozov to russian vice-premier Sergei Ivanov from 03/08/2011 it's hard to believe that Fobos-Grunt launch was anything but a success.
The goal was not to send something to Mars as officially stated but to get rid of material evidence of gross incompetence and graft going on in KB Lavochkin for many years.
Link (in Russian): http://apervushin.livejournal.com/179226.html
Comment removed based on user account deletion
Comment removed based on user account deletion
Who saw "Doom", "Mars", and "Phobos" and reached for your shotgun?
The party's over
Has/had (don't know if it's been patched) a nifty bug where a 4-bit group identifies the state the spacecraft is supposed to be in. The problem is when the spacecraft reboots, that value starts off uninitialized, so whatever value just happens to be sitting at that point in memory gets used. Not a huge problem, because when the spacecraft reboots (it happens) we can just telecommand it to the right state. Except for one problem: One of those states is "I'm on the launching pad and shouldn't listen to any radio telecommands, but only commands from the hardwire interface." Which means we can't remotely command it out of that state anymore, and it will at that point be a dead orbiter.
Space software is exciting!
It's worth noting that the Space Shuttle's navigation system had three identical computers who all 'voted' on the result, and if one disagreed it took itself out of the system. And there was a fourth computer made by a different company, using a different architecture and different programming language, that monitored the three. In retrospect, I think that's a pretty good idea. Having two different architectures makes having the same programming error occur in two different systems very unlikely.
Of course, as you add nodes to such a system, it gets more 'interesting' to figure out how to handle the set of possible differences. What constitutes a failure? What constitutes agreement?
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
This is on the order of oblig:
I just learned that there is a special version of Windows, Windows for Warships.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
I think in a **perfect world**, the chances of catastrophic failure in the collective hardware (or relevant to this discussion, the decisionmaking process) of such a system are zero. What I've experienced in terms of hardware is that the chances of an individual component failing does not change the more you add to the system. What does change is that the chances of any single component failing resulting in the total failure of the system is multiplied by the number of similar components in the system.
As an example:
A hard disk has an MTBF of say, 100,000 hours.
You build a RAID array of ten drives. The MTBF of each component drive is still 100,000 hours, but the MTBF of the array (the system) is 100,000/10 or 10,000 hours.
You build an identical array of ten drives and mirror the two to try and mitigate against data loss in case of failure of an array. Here's where the numbers get interesting.
The two arrays have an individual MTBF of 10,000 hours. Taken as a single system their combined MTBF is 5,000 hours. Since the system is composed of two mirrored arrays, all you have done is halve the MTBF (so it's 10,000 hours again), halve the data capacity and double your power consumption.
So every 416 days, you should expect one drive in the 10-disk array or two disks in the 20-disk array to fail.
-
Can I do a lightbulb analogy?
Say a lightbulb has an MTBF of 100 hours. You have an array of 1,000 similar bulbs on a display board. You're replacing a bulb every six minutes.
Operation Guillotine is in effect.
you made me spray coffee!
Operation Guillotine is in effect.
no... on military aircraft they use hardcoded RTOS embedded systems. Layers of interfacing and the associated lag can mean the difference between a missile flying through the correct window and blowing an ammo dump or flying through the wrong window and blowing up a school full of kids.
Operation Guillotine is in effect.
Yep.
I don't recall the exact numbers, but in the early ENIAC vacuum-tube-relay computer I think the mean time to failure was something like 20 minutes. I'm not sure how they could tell though - looking for tubes that weren't lit? maybe they had a sensing circuit that noticed when the current through the tube dropped.
And I read somewhere that at Google or one of the other zillion-computer facilities, there were folks who worked full time just walking round and replacing dead computing nodes.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
Actually, while its possible to design a system to be "virtually" fault-tolerant, in engineering that always comes down the a cost-benefit analysis. Also this naturally does not entirely eliminate so-called "human error" and other freak incidences, but with enough resources tossed into it, you can get very close. It's obvious the safety-requirement and thus allowed cost for manned mission is set much higher; for an unmanned probe it will be accepted some of them will inevitably be lost and accepted, and design target set to for example 1 unrecoverable failure out of 100 missions (pulled that example out of my ass, and in practice Russia of course has 100% failure rate on Mars-probes, which I'm sure is nowhere near design target).
Also we do not know the number of redundant processors of the kind that were in Phobos-Grunt. If there are three and a monitoring unit, going into "safe-mode" in case two of the processors failed at the same time would be entirely reasonable response - there would be no redundant processor left to compare the results to. But only Roscosmos knows the design for sure, I'm even guessing the redundancy just from the reported facts that there were (at least) two identifcal and both booting at the same time was somehow a problem.
It is of course kind of confirmation bias as that's generally the main way a redundant system can fail, but the way there stories generally seem to go there is some unthought issue causing all redundant units to fail at the same time, and the control logic responds in some unexpected way that makes matters worse because nobody ever thought the redundant systems could fail at the same time let alone bothered to test it. I work in automotive industry, and we have unwritten in-house rule that whenever an engineer says "But just what are the odds that..." we HAVE to make the design hardened against just that possibility.
RAID is actually a good example of the redundancy failure. You may be led to assume that with 100,000 hours MTBF per drive the odds of losing two drives at the same time are practically non-existent. In practice, as the hard-drives are from the same manufacturing batch and subject to identical operating-conditions and usage patterns (including external dangers like somebody dropping it etc.), it will actually be unlikely for the hard-drives to fail at significantly different times. If it were up to me, I'd randomly swap around drives between RAID arrays, preferably acquired at different times, for just that reason.
That's what I do. I never use two drives from the same batch in an array*, because a physical fault on one is more likely to be present on another - and that's a guaranteed fail. I think this is why they not only use several different systems to check each other on the Shuttle, they use different //architectures// so a physical flaw that affects one/of a batch/of a series/of a type is less likely to affect the others. I guess it would be like using a Z80, a Motorola 68k and a 80386 to check each other - well blow me, it's old tech, but what fries a 68k a Z80 would most likely survive.
*it's actually rare that I use drives with the same capacity in my arrays! In my current scratch array, built for very high throughput, I have 80GB Hitachi, 80GB Seagate, 120GB Seagate, 160GB Seagate, 200GB Maxtor, in a RAID0 for 400GB. OK there's lots of space wasted, but hey - I built it for throughput not capacity.
Operation Guillotine is in effect.
I'm taking bets right now. We're paying 10:1 odds that it wasn't a Via chip. In other words, I bet it was a Via chip lol. They probably pulled an ECS/Foxconn and said, "weeeeeeeell, for $2 cheaper, we can skip the Realtek chips and put on Via ones. Yeah, let's do that." That's right, I'm implying that the sound card and ethernet controller chips crashed it, lol. You try to turn on the subwoofer channel on that rover to let the martians know you're coming (and you're totally riced out too) and then BOOM, sorry, that's not supported by this version of the driver - CRASH! Yeah, that's what really happened and they're just covering it up.
Please, AC, to enlighten us do explain where I am in error?
Operation Guillotine is in effect.
no, your first guess was right - the valves were mounted on pegboards (literally) with walk-through access that a tech could visually inspect the valves and replace any that weren't lit. It was a full time job.
Operation Guillotine is in effect.
There are several institutes that research self-repair features of chips.
To cope with (space) radiation, they use chips that can restructure themselves to avoid damaged parts. Self-repair is an alternative to various shielding layers. A combination of both - in the right mix - would improve reliability by factors.
See: Same application scenario like the fukushima-crisis (http://tech.slashdot.org/story/12/01/08/1420254/where-were-the-robots-in-fukushima-crisis), where robots could not be used or simply failed in the field.
http://en.wikipedia.org/wiki/VIPER_microprocessor
You are (mostly) correct, sir. However, message passing is certainly being used on all new developments rather than IPC. Certainly there has been a long-term adherence to RTOS development in military avionics, but commercial avionics has moved strongly to VM based systems as the recovery is faster and debugging critical software components is easier. Additionally, hardware can be allowed to advance without requiring total rewrites for software.
I've already seen java run on the F-35 platform - and I'm pretty sure you'll see much more as time goes.
I said no... but I missed and it came out yes.