Debugging The Spirit Rover
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?
As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."
A programmer is a machine for converting coffee into code.
That's the thing that amaze me. Any technology having to do with space seem that much more advanced.
Here on earth we can't even build cars that require no maintainance and last more than 10 years.
I hope they use SSH or something .. who's to say a future mission ..some hax0r doesnt grab control of a space probe and have it send goatse.cx pics back??
.. after all the probe communicates using known frequencies. There may be probs picking up the return signal without an expensive antenna i suppose. But then again maybe some hax0r can build one cheaply and or do what captin midnight did ( www.signaltonoise.net/library/captmidn.htm ).
All it takes is a transmitter out in the middle of nowhere africa or some island
I wouldnt worry about signal jamming though as that will probably be discovered easily.
The Martians are pissed that the repair labor was outsourced to Earth.
Table-ized A.I.
In other news stories, the Microsoft Corporation decided to sue NASA, apparently since the right to crash systems was only theirs. Not to be left behind, SCO insisted that the code that caused the failure was unethically copied from their source repositories. This has indeed caused a flutter in the space communities
Sounds like NASA forgot to empty the rover's recycle bin. =)
Steal This Sig
Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.
A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.
This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.
Slashdot monitor for your Mozilla sidebar or Active Desktop.
I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science. A miracle that you take part in every day. And because of men like you, we don't have to rely on the abacus anymore. We sent a pentium to the Moon, and soon, Mars will be colonized by G5s. America salutes you, for all the things that you do.....
Like a rock! I was strong as I could be be!
Ooooooohh! Like a rock!
(-1, Raw and Uncut is the only way to read)
'How do you diagnose an embedded system that has rendered itself unobservable?'
The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.
This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.
(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)
I have something in common with Stephen Hawking...
Yeah, but they thought they could save a few bucks and got the Gateway consumer version.
."
"Oh, you've got the on-site warranty, huh? Ok, first thing you have to do is ship it to South Dakota. .
Oh, hey, looks just like Mars.
KFG
"The outcome strikes me as an extremely Lucky Hack..."
The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?
My pet peeve when I'm doing remote troubleshooting is 'ifconfig eth0 down'...oops. At least NASA is smarter than that.
Peter.
You know what I hate? Wait, what do you like? I hate that!
Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.
Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.
Seriously though, the key lessons to take away from this are.
1) Gather all of the clues you can.
2) Take those clues and build a model.
With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!
BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.
To the making of books there is no end, so let's get started
Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Operating System not found. Press any key to continue.
Damn! Left the floppy in!
To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.
If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.
Score: -1, Didn't Read Article
The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.
This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.
That must have been some feat to get the arm on the rover to press Ctrl, Alt and Delete at the same time!
You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.
The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.
Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.
Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
Software verification is essentially mathematically proving the software....
I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.
Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.
We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.
If you're really worried about your remote server being unreachable, here's what I would suggest doing:
Have a hardware watchdog. If the machine is lost or confused, it reboots itself.
Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.
If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.
From the description of how they got Spirit back, it looks like this is exactly how it was set up.
Who'da thunk it!!
455fe10422ca29c4933f95052b792ab2