Reformatting a Machine 125 Million Miles Away
An anonymous reader writes: NASA's Opportunity rover has been rolling around the surface of Mars for over 10 years. It's still performing scientific observations, but the mission team has been dealing with a problem: the rover keeps rebooting. It's happened a dozen times this month, and the process is a bit more involved than rebooting a typical computer. It takes a day or two to get back into operation every time. To try and fix this, the Opportunity team is planning a tricky operation: reformatting the flash memory from 125 million miles away. "Preparations include downloading to Earth all useful data remaining in the flash memory and switching the rover to an operating mode that does not use flash memory. Also, the team is restructuring the rover's communication sessions to use a slower data rate, which may add resilience in case of a reset during these preparations." The team suspects some of the flash memory cells are simply wearing out. The reformat operation is scheduled for some time in September.
We're gonna need you to go out to the rover and reboot it. Yeah, it got stuck. You should probably leave ASAP.
Easy, just gotta' replace the button battery.
When I reboot machines in Asia or UK/EU using IPMI from the US.
Nobodies Prefect
Tidbits for Techs Technology Blog
do they get sombody in or from India?
With a replacement SLC SSD and a screwdriver
Religion: The greatest weapon of mass destruction of all time
They didn't do any ECC on the flash memory? I thought these people were rocket scientists.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
How to brick a 2.5 billion dollar device.
Troll is not a replacement for I disagree.
Is it?
Not really...
The chances are that "reformat" isn't what we think and includes one of more of:
1) Rewriting cells and allowing wear-levelling and sector-replacement to take place, and make bad sectors as bad.
2) Write-testing and manually avoiding those sectors that don't perform as expected.
3) Rewriting all the critical storage functions to avoid the already-known bad sectors.
It's the kind of thing that anyone can play with. Not saying it's not risky on a remote device, but BadRAM etc. patches have been in places for years and that's a way to run Linux on machines with faulty ***RAM****, not just long-term storage.
Many years ago, a bad sector on your hard drive was something you found out with scandisk (or previous tools) and then it was marked as bad and that was the end of that. Your PC wouldn't use it and so long as it wasn't the boot sector, that was the end of that. It was only the "creeping" bad sectors, where you got more bad sectors over time, that would really worry anyone.
I imagine that it's not at all difficult to make sure that multiple boot sectors were in place if you really wanted to but why bother? The chances are billions to one. Chances are this hardware has MUCH better fault tolerance and multiple hardware watchdogs, firmware, and boot attempts to make sure it eventually gets back up SOMEHOW.
There's a reason that even FAT stores two copies of the allocation table, why Linux ext filesystems store multiple copies of the superblock, etc. They come from a legacy where the occasional bad sector wasn't a problem and where 20Mb of hard drive cost more than the computer did so it was better to cope with the fault than just tell people to buy a new one. And their predecessors were (and still are) mainframes with hardware that's just that fault-tolerant in the first place anyway.
It's not at all hard to write a filesystem that can cope with not only damage, but even recurring damage. You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).
It's not that big a deal once they KNOW that's the problem. The biggest problem is that they only "suspect" that's the problem.
... you're not cool. Period. Sorry.
I dunno so much these days. Its 10 years old and got a few miles on the clock plus collection for the new owner would be an issue. On the plus side vandalism won't be a worry. For a few centuries anyway.
They will almost certainly do a dummy run on an identical piece of flight hardware on Earth. The only difference is how the data are sent.
Why didn't they plan ahead for this sort of operation in the beginning, making it painless and 'reliable' ( as possible ).
---- Booth was a patriot ----
Ultrix used to mark bad sectors on the fly, as far as I could remember, if the disk was not a SCSI...
I believe NASA is operating under the assumption that the rover's on board flash memory is still serviceable. 10 years ago flash memory was still in its relative infancy. A reformat and reload risks bricking the rover completely.
I'll be glad to help you with that Sir.
I didn't realize they used OCZ for the storage tech. ;)
Why OpalCalc is the best Windows calc
I'd hate to be the guy a) pitching this operation at the change control meeting, and b) the guy signing off on this change.
The "Civilized World" jumped the shark ca. 1973.
You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).
While all hard drives now do their own Hamming error correction (or something better), RAID2 is the same idea for "raw" storage that doesn't: you write explicit ECCs to redundant volumes to allow recovery from both drive loss and bad sectors.
RAID5 with modern drives gives all the same resiliency, as the drives do the block-level ECC themselves, so you never see RAID2. But for a pile of flash memory, that's the filesystem-level equivalent of PAR files.
Socialism: a lie told by totalitarians and believed by fools.
they had to do this type of thing on spirit shortly after it arrived on mars..
read more here: http://trs-new.jpl.nasa.gov/ds...
or the PDF linked therin here http://trs-new.jpl.nasa.gov/ds...
its got all sorts of awesome details.
We commanded a shutdown, which terminated the
current communication window, and the loss of signal occurred at the predicted time. Fifty minutes later, we commanded a beep at 7.8125 bps to alert us if the shutdown command did not work, and much to our disappointment, the beep was received!
really a fun read. ..im guessing theyll be doing a lot of similar stuff
I'm picturing something akin to those Shuttle missions to repair flaws in the Hubble telescope's optics, except involving a NASA-engineered paperclip.
Man, hope they don't select the wrong partition.....
I'm sure they didn't get any of their capacitors from that bad batch a few years past.
They should have used what genius?
Hah, I remember running the DOS debugger, poking into a certain address in the memory to access the MFM BIOS, then you could do a low level format where you could enter the sectors to mark as bad. Those were the days...
"g=c800:5." :)
Hah, I almost remembered that one, good old Seagate controllers. I had the 800 but not the rest.
Now if only we could get a Martian to IM during the process: "Yes. The little red LED is blinking ....."
Have gnu, will travel.
can parent be modded funny not insightful? Insightful is too depressing...
Do unto others...
???
shoot the asker?
You mean like RAID-5? Because RAID-5 was part of the inspiration for the PAR2 format.
Sometimes when I sound mocking, ironic and sarcastic, I'm actually serious, as in ironic-ironic, or sarcastic-sarcastic. A lot of Americans simply smack the phone down on Indian tech support, saying gimme somebody who speaks English. I patiently listen to them struggle through it.
that makes no sense.
I always thought that the disk controller should do idle scrubbing. Are there any modern SATA disks that do this?
No, the drives themselves don't do this because it pulls the head away from where the host wants/expects it to be. This would result in a lot of unexpected thrashing. If scrubbing is to be done, it is best done by the OS as a background task.
Flash memory isn't the Rover's problem. It's still running XP and there are no more hot fixes. At this point the Rover's system has massive "bit rot," not to mention that it's been hacked countless times by the Chinese. Undeterred by this seemingly insurmountable problem, Microsoft has donated a Windows Phone for communications back to earth and a Surface Pro to power the Rover "because it's just like a computer." They didn't say just who's going to operate their touch-only interfaces. It all makes perfect sense because nobody in their right mind buys those things down on earth. Thus, new markets like Mars are vital to both products' successes. You might wonder how they will get into space. Microsoft has also kept mum on that, but the word is that there is still so much gas leftover from the Ballmer era that achieving liftoff is a trivial undertaking. -- Cary R., Microsoft Senior Technical Writer (ret.)