Slashdot Mirror


Debugging The Spirit Rover

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

23 of 390 comments (clear)

  1. rebooting on mars... by segment · · Score: 4, Interesting
    Interesting reading:

    Rebooting on Mars

    By Matthew Fordahl, The Associated Press

    It's a PC user's nightmare: You're almost done with a lengthy e-mail, or about to finish a report at the office, and the computer crashes for no apparent reason. It tries to restart but never quite finishes booting. Then it crashes again. And again.

    Getting caught in such a loop is frustrating enough on Earth. But imagine what it's like when the computer is 200 million miles away on Mars. That's what mission controllers faced when the Mars rover Spirit stopped communicating last month.

    ...

    Tech support for an $820 million mission is a cautious affair. Tools to recover from and fix any problem must be built into the system before launch. The systems' behaviors need to be completely understood and predictable.

    "Luckily, during the design period, we anticipated that we might get into a situation like this," said Glenn Reeves, who oversees the software aboard the Mars rovers Sprit and Opportunity at NASA's Jet Propulsion Laboratory.

    For stability, reliability and predictability, mission designers did not bust the budget and design the hardware or software from scratch. Instead, they turned to hardware and software that's been used in space before and has a proven track record on Earth as well.

    "The advantage of using commercial software is it's well-known, and it's well deployed," said Mike Deliman, an engineer at Alameda-based Wind River Systems Inc., which made the rovers' operating system. "It has been used throughout the world in hundreds of thousands of applications."

    The operating system, VxWorks, has its roots in software developed to help Francis Ford Coppola gain more control over a film editing system. But the developers, David Wilner and Jerry Fiddler, saw a greater potential and eventually formed Wind River, named for the mountains in Wyoming. VxWorks became a formal product in 1987.

    rest of article

  2. Re:Space Technology by kfg · · Score: 4, Interesting

    No. You can't make a mechanical device like a car that requires no maintainence. Bearings wear out. Hoses and belts have a limited lifespan even you never drive the car, etc. This is the real world. We will obey the laws of thermodynamics. Entropy always wins.

    What you can do is make it require less maintainence, make that maintainence cheaper to perform, and make the car last until you hit something really hard so long as you maintain it. You should be able to hand your car down to your kids.

    Other than that you're bang on though.

    I wonder what we can learn from that about maintaining our computers?

    KFG

  3. NASA Rocks! by blueZhift · · Score: 5, Interesting

    Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.

    Seriously though, the key lessons to take away from this are.

    1) Gather all of the clues you can.

    2) Take those clues and build a model.

    With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!

    BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.

    1. Re:NASA Rocks! by roskakori · · Score: 2, Interesting

      Seriously though, the key lessons to take away from this are.

      1) Gather all of the clues you can.

      2) Take those clues and build a model.

      you forgot this one:

      0) predict failure scenarious in the design phase, think them through, and design accordingly.

      when you read the article, you will notice that a lot of plans and tools already existed that allowed them to trace the problem. this is one of the major difference between armchair coding and reliabilty engineering.

  4. Remote safe mode by Megane · · Score: 3, Interesting
    The first thing needed to achieve remote maintainability on the order of space probes is some way to access a machine remotely when it's not running the full OS. A KVM switch isn't going to work over long distances. The BIOS needs a way to run over the network. Same for the kernel boot messages. Whether it's through a serial console and SSH server, or through the BIOS running TCP/IP, what we have now isn't enough. A separate console server could also control a power cycle/reset switch circuit.

    There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  5. Re:What's the big deal?? by afidel · · Score: 5, Interesting

    Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  6. Re:What the article doesn't say by afidel · · Score: 4, Interesting

    Never, the rovers are only going to operate for ~100 days, the number of writes for modern flash ram is 100K cycles minimum, over a million typical. So unless they are really screwing something up that shouldn't be a limitation, also distributing file placement shouldn't be a software function, good CF cards do it in the controller logic.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  7. Could an earthbout 'twin' computer help? by AaronStJ · · Score: 4, Interesting

    What surprises me is that they don't have a 'twin' of the rover's computer system set up on earth. When commands are run on the rover, the same commands could be run on the computer system on earth. Then, if the rover's software, fails (as it did), the software on earth would (theoretically) fail in a similar way, and be MUCH easier to debug. Of course, the systems wouldn't be identical (without building an entire duplicate and expensive rover), and the data gatehred wouldn't be identical, but if the twin was carefully planned and fed dummy data that aproximately mirrored that data the rover was gathering. For example, the twin could be fed dummy pictures about as often as the rover took a real picture.

    From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).

    Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.

    --
    Stupid like a fox!
  8. Does Microsoft know about this? by superyooser · · Score: 2, Interesting
    The operating system is Wind River Systems' Vx-Works version 5.3.1, used with its flash file system extension.

    First wxWindows, now Vx-works?

  9. Re:Space Technology by Kymermosst · · Score: 1, Interesting

    I have a 90 mustang 5.0 with 168k on the clock.

    I've got a '70 Mustang with 190K on the clock. Ran fine before I took the engine out. (It needed new head gaskets and the intake manifold was cracked, but it ran well).

    --
    "Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
  10. Re:Hindsight by Anonymous Coward · · Score: 5, Interesting

    I'm a journalism undergrad at a large university. One of the points I brought up with some of our administrators is that the innumeracy and scientific illiteracy of the graduates of our program is appalling. I think this is one reason why many important stories don't get reported accurately or in depth: the writers simply don't understand the story, and don't want to understand the story. They actually feel that math and science are somehow beneath them, and that the average reader doesn't need to be bothered with the facts. So we get vagueness instead of specifics in the articles we read.

    I suggested we allow j-students to substitute math or hard science minors in place of the foreign language requirement. Most graduates of college foreign language programs don't translate at a level any higher than Babelfish. It seems wasteful to force people to spend so much time learning a language that most will never use, when that time could be more productively spent introducing them to the languages of math and science, which they will undoubtedly use in the future. We'd get better reporting that way, and isn't that what going to j-school is all about? Science and technology are too important to our day-to-day lives and governance to be left to illiterates.

  11. Re:OT:lots of mem of an embedded system by millette · · Score: 2, Interesting

    I'm serious. http://physics.nist.gov/cuu/Units/binary.html for all the groovy details. If anything, it's a move away from the hd manufacturers lingo.

  12. Re:Verifying the software !!! by jonastullus · · Score: 2, Interesting

    you are quite aware that software verification is far from being usable for any languages found in the wild?!
    first you need a model to verify your software against and as a matter of fact the model will again be written in some pseudo-language so that you not only double the workload but also introduce slight incompatibilites between the implementation language and the model language!
    and then somebody still has to prove that there are no errors in your model for which you will need a meta-model, etc, etc, etc.

    i'm not saying that verification is not practicale or that it wouldn't be nice to have it, but there are obstacles that won't be solved for quite some years to come!

  13. Re:Verifying the software !!! by WayneConrad · · Score: 5, Interesting

    Software verification is essentially mathematically proving the software....

    I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.

    Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.

    We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.

  14. Re:What's the big deal?? by You're+All+Wrong · · Score: 3, Interesting

    There was also pressure not to drop your stack of punched cards in those days!

    (hint - draw a diagonal line across their top edges so you can get them in order again quickly.)

    Some people seem to no know why "batch" files were so-called, it seems.

    YAW.

    --
    Your head of state is a corrupt weasel, I hope you're happy.
  15. Re:What we can learn: by ColaMan · · Score: 2, Interesting

    Unless you own an early pentium.

    --

    You are in a twisty maze of processor lines, all alike.
    There is a lot of hype here.
  16. Re:What the article doesn't say by hyc · · Score: 2, Interesting

    Except that fully mirrored RAM would use way too much power, something that no space probe can afford.

    --
    -- *My* journal is more interesting than *yours*...
  17. Re:NASA should have simulated... by hyc · · Score: 2, Interesting

    One word: outsourcing.

    When I worked at JPL, every 6 months to a year there'd be talks of layoffs because the headcount was too high; people would leave and return to the same projects as contractors, then get a higher hourly wage for doing the same work with less accountability.

    The whole reason for that lost probe (feet vs meters, anyone?) was because of a political squabble between two teams (one JPL-internal, one outside contractors as I recall) who simply failed to cooperate productively. The whole management structure inside that world is screwed. People's project leads are not the same as their section/department leads, so the reporting chain is a mes{h,s}. Time and energy is wasted in contract(or) management, all in the name of "reduced costs" even though having all the work done in-house would eliminate a full layer or two of mid-level management waste.

    NASA/JPL are totally hamstrung by beancounters who think they're saving the public's money, but truly can't see the big picture, missing the forest for the trees. (Either that, or they *do* see the big picture, and are busily lining their own pockets with the excess that gets tossed around thru all the churn.)

    --
    -- *My* journal is more interesting than *yours*...
  18. Re:One reasonable anology by Fallen_Knight · · Score: 3, Interesting

    considering the distance i'd say a while, couple hours doesn't make much diffrence when you got a billion $$ probe on another planet, it surviveing is more important then a fast boot time heh. and you can always login and tell it to continue booting

  19. Discovered a system log ? by thrill12 · · Score: 3, Interesting

    "We discovered a system log in which the problem was documented,"
    Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  20. Good posts! by electromaggot · · Score: 2, Interesting

    ...and I'm not saying that just because we agree. Yours are good additional insights (hence your "insightful" mods up! :-)

    I agree with the reply-post below too, saying that if they'd made their system a bit more fault-tolerant, then the problem might have been more easily recovered from. Sixty reboots in a row in a day seems a little excessive! Don't they have counters to detect that very thing? Don't they have a failsafe/debug OS burned into ROM (not flash) to load automatically in just such an event? Such are the risks when you're reloading a whole new OS remotely!

    However, maybe they do have such things, or equivalent. I don't think their method of recovery was "accidental" (or a hack) either, although I'm making assumptions and I haven't seen their spec. The key is that they recovered from the error... and I now assume that they have recovered completely.

    What I found interesting was NASA's initial assessment that the flash ROM was failing -- a hardware failure. The media jumped all over that and reported it, so the rest of us were thinking, "Great, the rover is crippled and will never be the same. :-( "

    Now, turns out it was just a software error. Where's the mainstream media now? ("EE Times" is hardly mainstream!) Can the rover's recovery now be considered a "complete recovery"?

    If this story goes mainstream, will it make NASA look bad for screwing up... or look good for making a full recovery? I'm not sure. (Of course, smart people make mistakes too, lots of them, but the key to being smart is covering your ass beforehand! :-)

  21. Re:One reasonable anology by sjames · · Score: 3, Interesting

    well, this presupposes that what caused the problem in the first place also didn't mess up the hardware watchdog as well.

    Nothing's perfect. It also presupposes that the sun didn't explode and vaporize the Earth and that God didn't get ticked off and squish it with his thumb, So What?

    A watchdog is a VERY simple device. A simple countdown timer, a control register with associated address decode, etc. It's quite unlikely to fail. When the timer hits zero, it strobes reset. Any access to the port address resets the countdown timer.

    Some dual processor boards are even set up to alternate which is the boot processor, so they can come up with a single failed CPU.

    There is always some sort of problem that precludes recovery. No amount of software or clever design can help you if the device is destroyed. However, that doesn't mean don't even try.

  22. Launching with incomplete code is common by rarose · · Score: 4, Interesting

    The enroute time for Cassini to get to Saturn was 7 years; rather than push back an already long mission they launched with feature-incomplete code. They knew they had 7 years to get the software fully functional and debugger; they've updated it remotely from millions of miles away a number of times now.

    I'm sure the rovers did the same thing... Develop the launch/cruise software before you launch (and of course try to get as much of the entry/landing code done as you can!), and then uplink the final code before it's needed. Therefore it doesn't surprise me one bit that the JPL engineer knew there were shortcomings in the launch software.

    Hell, I develop BIOS for servers and we do it all the time. The BIOS image we give the hardware engineers for initial bringup is usually *way* short of features that will be there when it actually gets used by the customers!

    --
    --Rob