Slashdot Mirror


Reformatting a Machine 125 Million Miles Away

An anonymous reader writes: NASA's Opportunity rover has been rolling around the surface of Mars for over 10 years. It's still performing scientific observations, but the mission team has been dealing with a problem: the rover keeps rebooting. It's happened a dozen times this month, and the process is a bit more involved than rebooting a typical computer. It takes a day or two to get back into operation every time. To try and fix this, the Opportunity team is planning a tricky operation: reformatting the flash memory from 125 million miles away. "Preparations include downloading to Earth all useful data remaining in the flash memory and switching the rover to an operating mode that does not use flash memory. Also, the team is restructuring the rover's communication sessions to use a slower data rate, which may add resilience in case of a reset during these preparations." The team suspects some of the flash memory cells are simply wearing out. The reformat operation is scheduled for some time in September.

27 of 155 comments (clear)

  1. Hey, Bob, this is Jim by Anonymous Coward · · Score: 5, Funny

    We're gonna need you to go out to the rover and reboot it. Yeah, it got stuck. You should probably leave ASAP.

  2. Simple fix by SternisheFan · · Score: 2

    Easy, just gotta' replace the button battery.

    1. Re:Simple fix by sillybilly · · Score: 2

      It's running on solar power, that's how it lasts 10 years. Though the rechargeable battery must be tough to take so many recarchings.

      Ideally, you have redundant systems for such a situation, where you can take one of them down and use the other to do the booting, formatting, programming, as if there were a user sitting right next to it. They say it has a flashless mode of operation, but the way I think of it, as in a regular PC, with a BIOS, you can reformat the harddrive without booting off of and using the harddrive, such as booting from a floppy, or even ROM chip they used to have back in the 80's (ROM-DOS 3.3 or ROM BASIC). So when flashing a BIOS or a ROM chip, there is no lower level to boot from, but if you have Tandem, dual redundant systems for everything, you can boot from the lowest of lowest levels and have the partner system execute all the commands. So with Tandem failure is less frequent, as in, you're down to 50% capacity but still fully functioning ok, and can work on regaining the 100% capacity, while not using regular operations, for two days and the like. The problem with Tandem is the double or higher cost, and, in space missions, the extra power consumption and extra weight, and in space missions, weight is almost everything, as each lb has to be paid for dearly, on the order of $10,000/lb low Earth orbit, and who knows how many gazillion dollars per lb for a Mars mission.
      There used to be a company named Tandem, designing dual CPU redundant resilient failure tolerant systems, but they fell behind on chip design because of small size, plus high expense, and did not compete well in the computing field. For instance back in 1999 when Google started, they started with regular pc's of whateve the vogue of the day was, I don't know, 700 MHz PIII, maybe? And just jerry rigged a bunch of them into a daisy chain and voila, you have a Tandem-like, more than dual, more like thousandfold or millionfold duplicated, resilient supercomputer. But the principle of tandemness and fault tolerance was there. Maybe for space missions that need fault tolerance like that, it may be worth the extra rocket fuel weight in the first place to double the weight and duplicate most critical systems. The human body duplicates kidneys, lungs, but not liver, or heart, so there is a balance on what you want to go redundant on and what not. Life is easy with 2 kidneys, some people can live with only one kidney, but it's really difficult to live with zero kidneys.

    2. Re: Simple fix by Anonymous Coward · · Score: 2, Funny

      Ass-burgers.

    3. Re:Simple fix by davester666 · · Score: 2

      what's this step 4??

      Press the reset button.

      Who the hell designed this stuff?

      --
      Sleep your way to a whiter smile...date a dentist!
  3. Re:And I thought I was cool... by marcello_dl · · Score: 4, Funny

    And I thought I was cool when I reboot servers around the world thinking I am rebooting mine.

    --
    ---- MISSING MISCELLANEOUS DATA SEGMENT --- [sigdash] trolololol
  4. ECC? by TechyImmigrant · · Score: 5, Funny

    They didn't do any ECC on the flash memory? I thought these people were rocket scientists.

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    1. Re:ECC? by Anonymous Coward · · Score: 3, Insightful

      As it happens, for flash, read errors are often transient. A better model than DRAM style ECC is to treat it more like a disk drive with checksums on each block. If you get an error, reread the block. And if you have a problem writing a block (e.g. the readback is wrong), just use a new block. Surely you've noticed that your USB thumbdrive gradually gets smaller with time as blocks wear out. (In space hardware, back in 2000, wear leveling was done manually.. still is as far as I know.. there's no nice rad-hard flash controller chips to make a big pile of MLC flash look like a disk drive, etc.)

      The long duration radiation performance of flash memory (particularly back in 2000, when these things were being designed) was/is not particularly well understood. There are a lot of what is called Enhanced Low Dose Radiation Effects (ELDREs) that are poorly understood for all semiconductor devices: you can't just blast the part in an accelerator at 1kRad/hr for a few days to get to a few hundred kRad and expect that this is the same as taking a few tens of Rad/hr over days and days and days, with 12 hours off after the sun goes down to anneal and heal.

      And, because resources on spacecraft are very precious, one doesn't blindly head off and say "let's just TMR everything". You make a rational choice based on the expected design life and the data you do have and pray for the best.

      And, of course, the design life was 3-6 months, and here we are 10 years later, still cranking along. I think it's done pretty well, all things considered.

    2. Re:ECC? by Anonymous Coward · · Score: 2

      The rocket scientists did their job ten years ago. They're working at McDonalds now.

    3. Re:ECC? by schlachter · · Score: 2

      Well, in their defense, ECC on the flash memory isn't exactly rocket science.

      --
      My God can beat up your God. Just kidding...don't take offense. I know there's no God.
    4. Re:ECC? by Nimey · · Score: 4, Insightful

      You're a poster child for Dunning-Kruger: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.

      --
      Hail Eris, full of mischief...

      E pluribus sanguinem
    5. Re:ECC? by M1FCJ · · Score: 2

      Most of the hardware cost is the launch vehicle, not the rover.
      Most of the people (salary) cost is the people working on the data generated (all accross the universities around the world who analze the data and write papers), not the designers.

      Underspeccing it wouldn't have saved much.

      There's one that breaks this rule, the JWST. Just the endless redesigns have gobbled up so much money, I don't believe there will be enough Science generated by it to cover the build & launch costs.

    6. Re:ECC? by lightbounce · · Score: 2

      ECC use is standard with all flash storage. Flash is so unreliable that it can't be used without it, and it has nothing to do with the hard radiation environment on Mars. As for wear leveling, it's been standard since at least 1990 with the first attempts at flash storage. Why the rovers don't do it, I don't know. Maybe because it requires too many cycles of an already limited processor, plus dedicated storage space to keep "use counts" of all the flash blocks.

  5. Alternative Title by wisnoskij · · Score: 4, Insightful

    How to brick a 2.5 billion dollar device.

    --
    Troll is not a replacement for I disagree.
    1. Re:Alternative Title by rasmusbr · · Score: 3, Insightful

      I would imagine that the system probably boots itself off of a ROM chip that has a routine for receiving data from Earth and storing it in RAM and then flashing that data onto the flash chip.

      If the rover does not boot from ROM then it is a miracle that it hasn't bricked itself yet.

    2. Re:Alternative Title by Zarhan · · Score: 5, Interesting

      Not modem reset. The filesystem on Spirit had bunch of temp files and other stuff from the Earth-Mars flight, and apparently it just ran out of inodes. So basically they had to remote into whatever constitutes a bootloader with 20 mins of latency and remove some of the no-longer-needed files.

      See http://science.slashdot.org/st...

  6. Is it running Windows? by mark_reh · · Score: 2

    Is it?

    1. Re:Is it running Windows? by Psychotria · · Score: 4, Informative

      You're probably joking, but the OS is VxWorks.

  7. Re:Remote management by ledow · · Score: 5, Informative

    Not really...

    The chances are that "reformat" isn't what we think and includes one of more of:

    1) Rewriting cells and allowing wear-levelling and sector-replacement to take place, and make bad sectors as bad.
    2) Write-testing and manually avoiding those sectors that don't perform as expected.
    3) Rewriting all the critical storage functions to avoid the already-known bad sectors.

    It's the kind of thing that anyone can play with. Not saying it's not risky on a remote device, but BadRAM etc. patches have been in places for years and that's a way to run Linux on machines with faulty ***RAM****, not just long-term storage.

    Many years ago, a bad sector on your hard drive was something you found out with scandisk (or previous tools) and then it was marked as bad and that was the end of that. Your PC wouldn't use it and so long as it wasn't the boot sector, that was the end of that. It was only the "creeping" bad sectors, where you got more bad sectors over time, that would really worry anyone.

    I imagine that it's not at all difficult to make sure that multiple boot sectors were in place if you really wanted to but why bother? The chances are billions to one. Chances are this hardware has MUCH better fault tolerance and multiple hardware watchdogs, firmware, and boot attempts to make sure it eventually gets back up SOMEHOW.

    There's a reason that even FAT stores two copies of the allocation table, why Linux ext filesystems store multiple copies of the superblock, etc. They come from a legacy where the occasional bad sector wasn't a problem and where 20Mb of hard drive cost more than the computer did so it was better to cope with the fault than just tell people to buy a new one. And their predecessors were (and still are) mainframes with hardware that's just that fault-tolerant in the first place anyway.

    It's not at all hard to write a filesystem that can cope with not only damage, but even recurring damage. You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).

    It's not that big a deal once they KNOW that's the problem. The biggest problem is that they only "suspect" that's the problem.

  8. Err, if you're a system admin.. by Viol8 · · Score: 4, Funny

    ... you're not cool. Period. Sorry.

    1. Re:Err, if you're a system admin.. by TeknoHog · · Score: 3, Funny

      Chill out. They're just having that time of the month.

      --
      Escher was the first MC and Giger invented the HR department.
  9. Alternative Title by Whiternoise · · Score: 2

    They will almost certainly do a dummy run on an identical piece of flight hardware on Earth. The only difference is how the data are sent.

  10. Re:Assumptions by beelsebob · · Score: 2

    I believe you're assuming that the flash used on a rover that went to mars, and encounters all kinds of crazy radiation, is in some way similar to the crappy OCZ thing you stuck in your PC 10 years ago.

  11. Re:Why is it not trivial? by SeaFox · · Score: 2

    Why didn't they plan ahead for this sort of operation in the beginning, making it painless and 'reliable' ( as possible ).

    That's a joke, right? We are talking about one of the two rovers that was sent to Mars on a mission planned to only last 90 days. They didn't see "flash memory wearing out from use" as a contingency they needed to plan for.

  12. Re:If there is a problem and need to call "support by sillybilly · · Score: 2

    I'll be glad to help you with that Sir.

  13. It worked on Spirit by lemur3 · · Score: 4, Interesting

    they had to do this type of thing on spirit shortly after it arrived on mars..

    read more here: http://trs-new.jpl.nasa.gov/ds...

    or the PDF linked therin here http://trs-new.jpl.nasa.gov/ds...

    its got all sorts of awesome details.

    We commanded a shutdown, which terminated the
    current communication window, and the loss of signal occurred at the predicted time. Fifty minutes later, we commanded a beep at 7.8125 bps to alert us if the shutdown command did not work, and much to our disappointment, the beep was received!

    really a fun read. ..im guessing theyll be doing a lot of similar stuff

  14. anything starting with "why didn't they just..." by electrosoccertux · · Score: 2

    shoot the asker?