Debugging The Spirit Rover
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
MoFscker
No. You can't make a mechanical device like a car that requires no maintainence. Bearings wear out. Hoses and belts have a limited lifespan even you never drive the car, etc. This is the real world. We will obey the laws of thermodynamics. Entropy always wins.
What you can do is make it require less maintainence, make that maintainence cheaper to perform, and make the car last until you hit something really hard so long as you maintain it. You should be able to hand your car down to your kids.
Other than that you're bang on though.
I wonder what we can learn from that about maintaining our computers?
KFG
Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.
Seriously though, the key lessons to take away from this are.
1) Gather all of the clues you can.
2) Take those clues and build a model.
With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!
BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.
To the making of books there is no end, so let's get started
There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Never, the rovers are only going to operate for ~100 days, the number of writes for modern flash ram is 100K cycles minimum, over a million typical. So unless they are really screwing something up that shouldn't be a limitation, also distributing file placement shouldn't be a software function, good CF cards do it in the controller logic.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
What surprises me is that they don't have a 'twin' of the rover's computer system set up on earth. When commands are run on the rover, the same commands could be run on the computer system on earth. Then, if the rover's software, fails (as it did), the software on earth would (theoretically) fail in a similar way, and be MUCH easier to debug. Of course, the systems wouldn't be identical (without building an entire duplicate and expensive rover), and the data gatehred wouldn't be identical, but if the twin was carefully planned and fed dummy data that aproximately mirrored that data the rover was gathering. For example, the twin could be fed dummy pictures about as often as the rover took a real picture.
From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).
Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.
Stupid like a fox!
First wxWindows, now Vx-works?
I have a 90 mustang 5.0 with 168k on the clock.
I've got a '70 Mustang with 190K on the clock. Ran fine before I took the engine out. (It needed new head gaskets and the intake manifold was cracked, but it ran well).
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
I'm a journalism undergrad at a large university. One of the points I brought up with some of our administrators is that the innumeracy and scientific illiteracy of the graduates of our program is appalling. I think this is one reason why many important stories don't get reported accurately or in depth: the writers simply don't understand the story, and don't want to understand the story. They actually feel that math and science are somehow beneath them, and that the average reader doesn't need to be bothered with the facts. So we get vagueness instead of specifics in the articles we read.
I suggested we allow j-students to substitute math or hard science minors in place of the foreign language requirement. Most graduates of college foreign language programs don't translate at a level any higher than Babelfish. It seems wasteful to force people to spend so much time learning a language that most will never use, when that time could be more productively spent introducing them to the languages of math and science, which they will undoubtedly use in the future. We'd get better reporting that way, and isn't that what going to j-school is all about? Science and technology are too important to our day-to-day lives and governance to be left to illiterates.
I'm serious. http://physics.nist.gov/cuu/Units/binary.html for all the groovy details. If anything, it's a move away from the hd manufacturers lingo.
you are quite aware that software verification is far from being usable for any languages found in the wild?!
first you need a model to verify your software against and as a matter of fact the model will again be written in some pseudo-language so that you not only double the workload but also introduce slight incompatibilites between the implementation language and the model language!
and then somebody still has to prove that there are no errors in your model for which you will need a meta-model, etc, etc, etc.
i'm not saying that verification is not practicale or that it wouldn't be nice to have it, but there are obstacles that won't be solved for quite some years to come!
Software verification is essentially mathematically proving the software....
I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.
Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.
We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.
There was also pressure not to drop your stack of punched cards in those days!
(hint - draw a diagonal line across their top edges so you can get them in order again quickly.)
Some people seem to no know why "batch" files were so-called, it seems.
YAW.
Your head of state is a corrupt weasel, I hope you're happy.
Unless you own an early pentium.
You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
Except that fully mirrored RAM would use way too much power, something that no space probe can afford.
-- *My* journal is more interesting than *yours*...
One word: outsourcing.
When I worked at JPL, every 6 months to a year there'd be talks of layoffs because the headcount was too high; people would leave and return to the same projects as contractors, then get a higher hourly wage for doing the same work with less accountability.
The whole reason for that lost probe (feet vs meters, anyone?) was because of a political squabble between two teams (one JPL-internal, one outside contractors as I recall) who simply failed to cooperate productively. The whole management structure inside that world is screwed. People's project leads are not the same as their section/department leads, so the reporting chain is a mes{h,s}. Time and energy is wasted in contract(or) management, all in the name of "reduced costs" even though having all the work done in-house would eliminate a full layer or two of mid-level management waste.
NASA/JPL are totally hamstrung by beancounters who think they're saving the public's money, but truly can't see the big picture, missing the forest for the trees. (Either that, or they *do* see the big picture, and are busily lining their own pockets with the excess that gets tossed around thru all the churn.)
-- *My* journal is more interesting than *yours*...
considering the distance i'd say a while, couple hours doesn't make much diffrence when you got a billion $$ probe on another planet, it surviveing is more important then a fast boot time heh. and you can always login and tell it to continue booting
"We discovered a system log in which the problem was documented,"
Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
...and I'm not saying that just because we agree. Yours are good additional insights (hence your "insightful" mods up! :-)
:-( "
:-)
I agree with the reply-post below too, saying that if they'd made their system a bit more fault-tolerant, then the problem might have been more easily recovered from. Sixty reboots in a row in a day seems a little excessive! Don't they have counters to detect that very thing? Don't they have a failsafe/debug OS burned into ROM (not flash) to load automatically in just such an event? Such are the risks when you're reloading a whole new OS remotely!
However, maybe they do have such things, or equivalent. I don't think their method of recovery was "accidental" (or a hack) either, although I'm making assumptions and I haven't seen their spec. The key is that they recovered from the error... and I now assume that they have recovered completely.
What I found interesting was NASA's initial assessment that the flash ROM was failing -- a hardware failure. The media jumped all over that and reported it, so the rest of us were thinking, "Great, the rover is crippled and will never be the same.
Now, turns out it was just a software error. Where's the mainstream media now? ("EE Times" is hardly mainstream!) Can the rover's recovery now be considered a "complete recovery"?
If this story goes mainstream, will it make NASA look bad for screwing up... or look good for making a full recovery? I'm not sure. (Of course, smart people make mistakes too, lots of them, but the key to being smart is covering your ass beforehand!
well, this presupposes that what caused the problem in the first place also didn't mess up the hardware watchdog as well.
Nothing's perfect. It also presupposes that the sun didn't explode and vaporize the Earth and that God didn't get ticked off and squish it with his thumb, So What?
A watchdog is a VERY simple device. A simple countdown timer, a control register with associated address decode, etc. It's quite unlikely to fail. When the timer hits zero, it strobes reset. Any access to the port address resets the countdown timer.
Some dual processor boards are even set up to alternate which is the boot processor, so they can come up with a single failed CPU.
There is always some sort of problem that precludes recovery. No amount of software or clever design can help you if the device is destroyed. However, that doesn't mean don't even try.
The enroute time for Cassini to get to Saturn was 7 years; rather than push back an already long mission they launched with feature-incomplete code. They knew they had 7 years to get the software fully functional and debugger; they've updated it remotely from millions of miles away a number of times now.
I'm sure the rovers did the same thing... Develop the launch/cruise software before you launch (and of course try to get as much of the entry/landing code done as you can!), and then uplink the final code before it's needed. Therefore it doesn't surprise me one bit that the JPL engineer knew there were shortcomings in the launch software.
Hell, I develop BIOS for servers and we do it all the time. The BIOS image we give the hardware engineers for initial bringup is usually *way* short of features that will be there when it actually gets used by the customers!
--Rob