Slashdot Mirror


Debugging The Spirit Rover

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

5 of 390 comments (clear)

  1. Mod this "redundant" by Penguinshit · · Score: 5, Informative


    'How do you diagnose an embedded system that has rendered itself unobservable?'

    The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.

    This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.

    (well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)

  2. Ran out of INODES. No really. by dorko · · Score: 5, Informative
    If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space!
    Well, I did read the article and I wouldn't say it quite like that. The article says: "Spirit attempted to allocate more files than the RAM-based directory structure could accommodate." Furthermore, the article says that the low-level file manipulation commands "worked directly on the flash memory without mounting the volume or building the directory table in RAM ."

    To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.

    If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.

  3. Re:The proper fix... by KewlPC · · Score: 5, Informative

    Score: -1, Didn't Read Article

    The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.

    This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.

  4. Re:only 120 megs ram? by KewlPC · · Score: 5, Informative

    You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.

    The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.

    Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.

    Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).

  5. Re:One reasonable anology by zcat_NZ · · Score: 5, Informative

    If you're really worried about your remote server being unreachable, here's what I would suggest doing:

    Have a hardware watchdog. If the machine is lost or confused, it reboots itself.

    Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.

    If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.

    From the description of how they got Spirit back, it looks like this is exactly how it was set up.

    Who'da thunk it!!

    --
    455fe10422ca29c4933f95052b792ab2