Debugging The Spirit Rover

← Back to Stories (view on slashdot.org)

Posted by timothy on Saturday February 21, 2004 @05:43PM from the at-a-distance dept.

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

28 of 390 comments (clear)

Oh, sure... by inertia187 · 2004-02-21 17:43 · Score: 5, Funny

Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?

As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."

--
A programmer is a machine for converting coffee into code.
Space Technology by superpulpsicle · 2004-02-21 17:49 · Score: 5, Insightful

That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

Here on earth we can't even build cars that require no maintainance and last more than 10 years.
1. Re:Space Technology by beeplet · 2004-02-21 17:57 · Score: 5, Insightful
  
  Actually any technology making it into space is more likely to be 10 years out of date... Getting anything certified for space is a long process. The technology in space isn't more advanced, just much better documented and well-understood.
2. Re:Space Technology by kfg · 2004-02-21 18:12 · Score: 5, Insightful
  
  Ten years out of date, but ten years more reliable for the effort.
  
  Sort of like Debian.
  
  Cutting edge ain't always what it's cracked up to be.
  
  KFG
do they use SSH ? by Anonymous Coward · 2004-02-21 17:50 · Score: 5, Funny

I hope they use SSH or something .. who's to say a future mission ..some hax0r doesnt grab control of a space probe and have it send goatse.cx pics back??

All it takes is a transmitter out in the middle of nowhere africa or some island .. after all the probe communicates using known frequencies. There may be probs picking up the return signal without an expensive antenna i suppose. But then again maybe some hax0r can build one cheaply and or do what captin midnight did ( www.signaltonoise.net/library/captmidn.htm ).

I wouldnt worry about signal jamming though as that will probably be discovered easily.
1. Re:do they use SSH ? by mcbridematt · 2004-02-21 17:59 · Score: 5, Insightful
  
  I don't think they would bother using anything to do with TCP. Anything you do send you will have to wait 9 minutes for. Just imagine the ping times:
  
  Pinging mars-rover with 32 bytes of data:
  request timed out
  request timed out
  request timed out
  64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms :(
  
  If it has anything to do with current internet protocols, it would be UDP.
Pissed Martians by Tablizer · 2004-02-21 17:53 · Score: 5, Funny

The Martians are pissed that the repair labor was outsourced to Earth.

--
Table-ized A.I.
Re:Local Debugging by srichand · 2004-02-21 17:56 · Score: 5, Funny

In other news stories, the Microsoft Corporation decided to sue NASA, apparently since the right to crash systems was only theirs. Not to be left behind, SCO insisted that the code that caused the failure was unethically copied from their source repositories. This has indeed caused a flutter in the space communities
Uh-oh by z0ink · 2004-02-21 17:57 · Score: 5, Funny

"We recognized early in the planning process that the flash file system had a limited capacity for files."

Sounds like NASA forgot to empty the rover's recycle bin. =)

--
Steal This Sig
Hindsight by FTL · 2004-02-21 17:57 · Score: 5, Insightful

The article (I know, I know, this is Slashdot) is really good. It contains everything that is missing from traditional media. The story, the background, technical details, and follow through.
Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.
A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.
This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.

--
Slashdot monitor for your Mozilla sidebar or Active Desktop.
1. Re:Hindsight by Anonymous Coward · 2004-02-21 20:16 · Score: 5, Interesting
  
  I'm a journalism undergrad at a large university. One of the points I brought up with some of our administrators is that the innumeracy and scientific illiteracy of the graduates of our program is appalling. I think this is one reason why many important stories don't get reported accurately or in depth: the writers simply don't understand the story, and don't want to understand the story. They actually feel that math and science are somehow beneath them, and that the average reader doesn't need to be bothered with the facts. So we get vagueness instead of specifics in the articles we read.
  
  I suggested we allow j-students to substitute math or hard science minors in place of the foreign language requirement. Most graduates of college foreign language programs don't translate at a level any higher than Babelfish. It seems wasteful to force people to spend so much time learning a language that most will never use, when that time could be more productively spent introducing them to the languages of math and science, which they will undoubtedly use in the future. We'd get better reporting that way, and isn't that what going to j-school is all about? Science and technology are too important to our day-to-day lives and governance to be left to illiterates.
Re:What's the big deal?? by Gizzmonic · 2004-02-21 18:06 · Score: 5, Funny

I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science. A miracle that you take part in every day. And because of men like you, we don't have to rely on the abacus anymore. We sent a pentium to the Moon, and soon, Mars will be colonized by G5s. America salutes you, for all the things that you do.....

Like a rock! I was strong as I could be be!

Ooooooohh! Like a rock!

--
(-1, Raw and Uncut is the only way to read)
Mod this "redundant" by Penguinshit · 2004-02-21 18:06 · Score: 5, Informative

'How do you diagnose an embedded system that has rendered itself unobservable?'

The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.

This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.

(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)

--
I have something in common with Stephen Hawking...
Re:Remote debugging? by kfg · 2004-02-21 18:08 · Score: 5, Funny

Yeah, but they thought they could save a few bucks and got the Gateway consumer version.

"Oh, you've got the on-site warranty, huh? Ok, first thing you have to do is ship it to South Dakota. . ."

Oh, hey, looks just like Mars.

KFG
Lucky Hack? by electromaggot · 2004-02-21 18:11 · Score: 5, Insightful

"The outcome strikes me as an extremely Lucky Hack..."

The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?
Remote debugging pet peeve by Peter+McC · 2004-02-21 18:14 · Score: 5, Funny

My pet peeve when I'm doing remote troubleshooting is 'ifconfig eth0 down'...oops. At least NASA is smarter than that.

Peter.

--
You know what I hate? Wait, what do you like? I hate that!
Lucky Hack? by SuperKendall · 2004-02-21 18:21 · Score: 5, Insightful

Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.

Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley
NASA Rocks! by blueZhift · 2004-02-21 18:23 · Score: 5, Interesting

Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.

Seriously though, the key lessons to take away from this are.

1) Gather all of the clues you can.

2) Take those clues and build a model.

With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!

BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.

--
To the making of books there is no end, so let's get started
Re:What's the big deal?? by afidel · 2004-02-21 18:27 · Score: 5, Interesting

Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
whoops by usillyman · 2004-02-21 18:38 · Score: 5, Funny

Operating System not found. Press any key to continue.
Damn! Left the floppy in!
Ran out of INODES. No really. by dorko · 2004-02-21 18:51 · Score: 5, Informative

If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space!
Well, I did read the article and I wouldn't say it quite like that. The article says: "Spirit attempted to allocate more files than the RAM-based directory structure could accommodate." Furthermore, the article says that the low-level file manipulation commands "worked directly on the flash memory without mounting the volume or building the directory table in RAM ."
To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.
If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.
Re:The proper fix... by KewlPC · 2004-02-21 19:15 · Score: 5, Informative

Score: -1, Didn't Read Article

The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.

This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.
How'd they do it? by alwaystheretrading · 2004-02-21 19:28 · Score: 5, Funny

That must have been some feat to get the arm on the rover to press Ctrl, Alt and Delete at the same time!
1. Re:How'd they do it? by oohgodyeah · 2004-02-21 23:58 · Score: 5, Funny
  
  Maybe it's all lies and the Martians hit Ctrl+Alt+Del...
  
  --
  
  - OohGodYeah!
2. Re:How'd they do it? by rjamestaylor · 2004-02-22 05:45 · Score: 5, Funny
  
  Actually, a friend of mine is a system admin with JPL and he had to drive out to the San Bernadino soundstage where the rovers are being filmed and reboot the computer a 4AM. The funny thing is he left a tool chest and sleeping bag (he was using it to minimize footprints and body impression, not sleep on the job!) where the Opportunity rover was scheduled to peek over the horizon and the ensuing photo of the tool chest / sleeping bag on the horizon had to be quickly -- and deftly, I must say -- explained away as being Opportunity's back shell and parachuete.
  Just another day in the life of a sys admin!
  
  --
  -- @rjamestaylor on Ello
Re:only 120 megs ram? by KewlPC · 2004-02-21 20:05 · Score: 5, Informative

You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.

The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.

Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.

Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
Re:Verifying the software !!! by WayneConrad · 2004-02-21 20:50 · Score: 5, Interesting

Software verification is essentially mathematically proving the software....

I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.

Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.

We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.
Re:One reasonable anology by zcat_NZ · 2004-02-21 21:15 · Score: 5, Informative

If you're really worried about your remote server being unreachable, here's what I would suggest doing:

Have a hardware watchdog. If the machine is lost or confused, it reboots itself.

Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.

If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.

From the description of how they got Spirit back, it looks like this is exactly how it was set up.

Who'da thunk it!!

--
455fe10422ca29c4933f95052b792ab2