Slashdot Mirror


Debugging The Spirit Rover

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

29 of 390 comments (clear)

  1. Space Technology by superpulpsicle · · Score: 5, Insightful

    That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

    Here on earth we can't even build cars that require no maintainance and last more than 10 years.

    1. Re:Space Technology by beeplet · · Score: 5, Insightful

      Actually any technology making it into space is more likely to be 10 years out of date... Getting anything certified for space is a long process. The technology in space isn't more advanced, just much better documented and well-understood.

    2. Re:Space Technology by Billly+Gates · · Score: 4, Insightful

      The Japanese started that.

      They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.

      I will buy only Toyatas and Honda's for that reason.

      It amazes me consumers are too stupid to read consumer reports and buy cars on looks. Repair costs for things like Cadallacs and BMW's are not cheap for TCO! Yes consumer products have TCO too and we and not just businesses should look at that as well.

    3. Re:Space Technology by kfg · · Score: 5, Insightful

      Ten years out of date, but ten years more reliable for the effort.

      Sort of like Debian.

      Cutting edge ain't always what it's cracked up to be.

      KFG

  2. What's the big deal?? by prakslash · · Score: 4, Insightful
    Unless you are a lay person, I don't understand what the big deal is .

    If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.

    I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

    As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

    1. Re:What's the big deal?? by dellis78741 · · Score: 4, Insightful

      The tricky part here was that the 'hardware connectivity' depended on 'software functionality'. Try maintaining machine a block away if the commnication link requires both ends to point a satellite dish at an orbiting satellite and that pointing relied of software functioning correctly.

      --
      ======= ~\_/~\_O Burmese
    2. Re:What's the big deal?? by FTL · · Score: 4, Insightful
      I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

      As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

      There are some fundamental differences, my friend:

      • If you screw up leaving the computer unbootable, you get local tech support to check the console and fix it. NASA on the other hand doesn't have tech support on Mars.
      • If you hose the server, it means a day's worth of reinstallation. If NASA hoses their rover, they just lost $300,000,000.
      • You can poke around the system and see what's wrong. NASA has a harder time since their lag time is 20 minutes.
      • You can download core dumps, NASA were operating on the low-bandwidth antenna which meant looking at file sizes, time stamps, selected lines, but not file contents.
      • You have your boss breathing down your neck (hoping for success), NASA have the international media breathing down their necks (hoping for a disaster).
      --
      Slashdot monitor for your Mozilla sidebar or Active Desktop.
    3. Re:What's the big deal?? by updog · · Score: 4, Insightful
      There is a big difference between this, and your example of forcing a controlled reboot of your remote machines.

      Spirit was in a constant reboot cycle, and the fact that they could even communicate with it long enough to bypass the problem was an accomplishment (and lucky).

      It would be more similar to your remote data-center machine suddenly going offline and you have no idea why, and you are unable to ssh to it, and you fix it by running through potential scenarios and finding that the problem could have been due to mounting a certain partition, then discovering that there's an exploit in ICMP that allows you to hack to kernel so it doesn't mount that partition.

    4. Re:What's the big deal?? by amRadioHed · · Score: 4, Insightful

      Are you forgetting that the latency when communicationg with mars averages around 1200000 ms? I'd say that when you have to wait 20 minutes to see the result of anything you do you're going to have to substantially change your debugging strategy.

      --
      We hope your rules and wisdom choke you / Now we are one in everlasting peace
    5. Re:What's the big deal?? by NymblZ · · Score: 3, Insightful

      As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

      That's just it - consider the stress those rovers are enduring or might encounter: subzero tempatures down to -200f, out-of-the-blue (red?) sandstorms, gamma radiation, and who knows what else out there that could suddenly fsck with the systems or scramble internal data ? Your average Dell rack will never have to deal with any of those things.

      --
      -- NymblZ
      Ignorance is a sty in the mind's eye
    6. Re:What's the big deal?? by cookiepus · · Score: 4, Insightful

      I'd say that when you have to wait 20 minutes to see the result of anything you do you're going to have to substantially change your debugging strategy.

      Please! Back in the day people would write programs on paper, mail them in an envelope to a computing center somewhere, and get results weeks later.

      THAT was pressure not to fuck up.

    7. Re:What's the big deal?? by jelle · · Score: 3, Insightful

      But at NASA, you have a local replica of the whole system sitting in the lab next door, you're in a team of professionals that if necessary can calculate the most probable results of particular radiation hitting your system under a given angle, or can tell you the power usage and temperature effect of the system components given a particular subroutine, or can dream low-level correct assembly for the platform under study, plus the vendor has a couple of on-line support guys sitting in chairs in the corner of your office waiting for your activation command (which is the word "huh?")...

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    8. Re:What's the big deal?? by Matrix9180 · · Score: 4, Insightful

      Did you RTFA? The rover was rebooting over and over because it was using up all of it's memory... then eventually the batteries were low so it went into a sort of 'safe mode' where only the absolute minimum was loaded, and that's when NASA was able to communicate with it again...

      It was nothing like what you described, just a VERY well designed system (though it would have been somewhat better had the system been able to go straight to "safe mode" after the initial critical error (running out of memory))

      Did the people with mod points RTFA? Score 5 Insightful?

      And no, I'm not new to /. ;)

      --
      120chars for a sig is teh suck
  3. The proper fix... by Dan+East · · Score: 3, Insightful

    ...would have been to have "fixed" the problem before the hardware left earth. This "bug" (or more accurately, known limitation of the filesystem) should have been discovered here on earth if the rover had been properly tested.

    The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.

    Dan East

    --
    Better known as 318230.
  4. Hindsight by FTL · · Score: 5, Insightful
    The article (I know, I know, this is Slashdot) is really good. It contains everything that is missing from traditional media. The story, the background, technical details, and follow through.

    Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.

    A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.

    This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.

    --
    Slashdot monitor for your Mozilla sidebar or Active Desktop.
  5. Re:do they use SSH ? by mcbridematt · · Score: 5, Insightful

    I don't think they would bother using anything to do with TCP. Anything you do send you will have to wait 9 minutes for. Just imagine the ping times:

    Pinging mars-rover with 32 bytes of data:
    request timed out
    request timed out
    request timed out
    64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms :(

    If it has anything to do with current internet protocols, it would be UDP.

  6. What the article doesn't say by Mr2cents · · Score: 4, Insightful

    What filesystem is used? Is wear leveling being used? The directory structure is apparently stored in RAM during the day (why else would it use so much RAM?), that is a good thing for reducing wear on the flash system. But what's the number of writes on the flash chips? When will that number be reached?

    --
    "It's too bad that stupidity isn't painful." - Anton LaVey
  7. Lucky Hack? by electromaggot · · Score: 5, Insightful

    "The outcome strikes me as an extremely Lucky Hack..."

    The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?

  8. Lucky Hack? by SuperKendall · · Score: 5, Insightful

    Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.

    Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  9. There is a significant lesson to learn, here .. by Anonymous Coward · · Score: 3, Insightful

    .. namely, "Do Not Use VxWorks". Use something stable instead. eCos comes to mind. So does everyone's favorite OS these days, which has RTOS support. Having been a frustrated VxWorks user in the past, I'd no more entrust my mission-critical services to it than I would to Microsoft. -- TTK

  10. Great trick for ssh administration by nsayer · · Score: 4, Insightful

    Before doing something risky, type this:

    sleep 600 && reboot &

    Now if your risky maneuver makes the ssh session unusable, just wait 5 minutes for the machine to reboot.

    This is great for fiddling with firewalls by remote control... through the firewall. :-)

    Oh... You say you're not using a POSIX-like system? That's not supported. Sorry. :-)

  11. Re:Oh, sure... by JWSmythe · · Score: 4, Insightful

    It sounded like the same type questions non-technical bosses always ask about technical matters.

    "We're ordering this brand new hardware that you've never tested before. Can you guarantee it will never crash?"

    "Will this database server handle the load of our brand new project?" (without an accurate growth estimate)

    "A server 2000 miles away just went down. What happened?" (no ping, no nothing) Hmmm.. Power/NIC/CPU/CPU fan/hard disks?

    It really sounds like they did some decent advanced planning on those probes, but from other stories I read, the were shooting for 90 days of reliability, which in itself was a hard one to do. What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)

    Sure, relate this to your web server colocated somewhere you're not. Cross your fingers, hold your breath, and hope there aren't a few fatal systems failures, or a bit of human error. I've been responsible for a bit of that in the past, but at least my equipment wasn't a few million miles away.

    --
    Serious? Seriousness is well above my pay grade.
  12. They didn't just randomly delete stuff by enosys · · Score: 4, Insightful
    From the article:

    Using the low- level commands, about a thousand files and their directories -- the leftovers from the initial launch load -- were removed.

    I think that means they deleted the useless stuff they wanted to delete anyways but didn't get to delete before the crash. I also remember news about science data from before the crash that was received after they got the rover working again.

    As for how critical it is, well yeah, it seems the rover didn't need the contents of the flash file system. The operating system and other software was in the same flash memory but I assume that any sane designer would put in some hardware write protect interlock that's not easy to defeat accidentally.

  13. What we can learn: by sakusha · · Score: 4, Insightful
    It appears that we still haven't learned the biggest lesson of all. I still remember back around 1970, there was a big sign on the wall next to the IBM 370s at my university, written on a primitive pen plotter, it said:

    Computers never make mistakes, they do exactly what humans tell them to do. All "computer errors" are human errors.

  14. Re:do they use SSH ? by Anonymous Coward · · Score: 4, Insightful

    UDP would be even worse. Interplanetary transmission is difficult, so some packet loss is likely. Under UDP the packets would just disappear-it's an unreliable protocol. TCP would of course be too inefficient. I'd expect them to use a custom protocol designed for the specific application, since their situation is totally unlike anything you'll face on Earth.

  15. Re:WindRiver's fault by KewlPC · · Score: 3, Insightful

    WindRiver may give JPL large discounts, but I doubt that's the only reason VxWorks is running on the MERs.

    Years ago, when JPL was designing the Mars Pathfinder mission, they asked Wind River to do an "affordable" port of VxWorks to the RAD6000 (a radiation-hardened RS6000), and they agreed. Since the computers on the two MERs are very similar to the computer on the Mars Pathfinder lander, it makes sense that they'd use the same OS that they used on the MPF lander.

    I would think the fact that JPL knows VxWorks very well by now would be a major factor in deciding to use VxWorks for the MERs.

  16. Hmmmm by ziggy_zero · · Score: 3, Insightful

    "The irony of it was that the operating system was doing exactly what we'd told it to do"

    Funny, that's how it was explained to me by my computer science teacher my freshman year in high school. He said, "The problem with computers is that they do exactly what we tell them to."

    --
    I belong to the ______ generation.
  17. Logging should not be limited ? by thrill12 · · Score: 3, Insightful

    Seriously, from a developer viewpoint, that is all wrong.
    I have worked on projects in which there was simply too much logging going on that you couldn't tell head from toe anymore. When a problem arrived, scanning the logfiles proved very cumbersome indeed. Every developer had his own stuff logged, which sometimes proved interesting, sometimes proved utter crap (noone wants to know variable XYZ is increased by 1 for 24943 times).

    You should develop a well-thought logging strategy that increases the logging verbosity on a problem-basis, not simply log everything that happens and hoping you get some useful information.

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  18. Except just one thing: by Chemisor · · Score: 3, Insightful

    > What on earth (or on Mars) could we possibly take away from this experience?

    Rule 3: Never ignore the return value from open.