Slashdot Mirror


Reformatting a Machine 125 Million Miles Away

An anonymous reader writes: NASA's Opportunity rover has been rolling around the surface of Mars for over 10 years. It's still performing scientific observations, but the mission team has been dealing with a problem: the rover keeps rebooting. It's happened a dozen times this month, and the process is a bit more involved than rebooting a typical computer. It takes a day or two to get back into operation every time. To try and fix this, the Opportunity team is planning a tricky operation: reformatting the flash memory from 125 million miles away. "Preparations include downloading to Earth all useful data remaining in the flash memory and switching the rover to an operating mode that does not use flash memory. Also, the team is restructuring the rover's communication sessions to use a slower data rate, which may add resilience in case of a reset during these preparations." The team suspects some of the flash memory cells are simply wearing out. The reformat operation is scheduled for some time in September.

10 of 155 comments (clear)

  1. Hey, Bob, this is Jim by Anonymous Coward · · Score: 5, Funny

    We're gonna need you to go out to the rover and reboot it. Yeah, it got stuck. You should probably leave ASAP.

  2. Re:And I thought I was cool... by marcello_dl · · Score: 4, Funny

    And I thought I was cool when I reboot servers around the world thinking I am rebooting mine.

    --
    ---- MISSING MISCELLANEOUS DATA SEGMENT --- [sigdash] trolololol
  3. ECC? by TechyImmigrant · · Score: 5, Funny

    They didn't do any ECC on the flash memory? I thought these people were rocket scientists.

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    1. Re:ECC? by Nimey · · Score: 4, Insightful

      You're a poster child for Dunning-Kruger: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.

      --
      Hail Eris, full of mischief...

      E pluribus sanguinem
  4. Alternative Title by wisnoskij · · Score: 4, Insightful

    How to brick a 2.5 billion dollar device.

    --
    Troll is not a replacement for I disagree.
    1. Re:Alternative Title by Zarhan · · Score: 5, Interesting

      Not modem reset. The filesystem on Spirit had bunch of temp files and other stuff from the Earth-Mars flight, and apparently it just ran out of inodes. So basically they had to remote into whatever constitutes a bootloader with 20 mins of latency and remove some of the no-longer-needed files.

      See http://science.slashdot.org/st...

  5. Re:Is it running Windows? by Psychotria · · Score: 4, Informative

    You're probably joking, but the OS is VxWorks.

  6. Re:Remote management by ledow · · Score: 5, Informative

    Not really...

    The chances are that "reformat" isn't what we think and includes one of more of:

    1) Rewriting cells and allowing wear-levelling and sector-replacement to take place, and make bad sectors as bad.
    2) Write-testing and manually avoiding those sectors that don't perform as expected.
    3) Rewriting all the critical storage functions to avoid the already-known bad sectors.

    It's the kind of thing that anyone can play with. Not saying it's not risky on a remote device, but BadRAM etc. patches have been in places for years and that's a way to run Linux on machines with faulty ***RAM****, not just long-term storage.

    Many years ago, a bad sector on your hard drive was something you found out with scandisk (or previous tools) and then it was marked as bad and that was the end of that. Your PC wouldn't use it and so long as it wasn't the boot sector, that was the end of that. It was only the "creeping" bad sectors, where you got more bad sectors over time, that would really worry anyone.

    I imagine that it's not at all difficult to make sure that multiple boot sectors were in place if you really wanted to but why bother? The chances are billions to one. Chances are this hardware has MUCH better fault tolerance and multiple hardware watchdogs, firmware, and boot attempts to make sure it eventually gets back up SOMEHOW.

    There's a reason that even FAT stores two copies of the allocation table, why Linux ext filesystems store multiple copies of the superblock, etc. They come from a legacy where the occasional bad sector wasn't a problem and where 20Mb of hard drive cost more than the computer did so it was better to cope with the fault than just tell people to buy a new one. And their predecessors were (and still are) mainframes with hardware that's just that fault-tolerant in the first place anyway.

    It's not at all hard to write a filesystem that can cope with not only damage, but even recurring damage. You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).

    It's not that big a deal once they KNOW that's the problem. The biggest problem is that they only "suspect" that's the problem.

  7. Err, if you're a system admin.. by Viol8 · · Score: 4, Funny

    ... you're not cool. Period. Sorry.

  8. It worked on Spirit by lemur3 · · Score: 4, Interesting

    they had to do this type of thing on spirit shortly after it arrived on mars..

    read more here: http://trs-new.jpl.nasa.gov/ds...

    or the PDF linked therin here http://trs-new.jpl.nasa.gov/ds...

    its got all sorts of awesome details.

    We commanded a shutdown, which terminated the
    current communication window, and the loss of signal occurred at the predicted time. Fifty minutes later, we commanded a beep at 7.8125 bps to alert us if the shutdown command did not work, and much to our disappointment, the beep was received!

    really a fun read. ..im guessing theyll be doing a lot of similar stuff