Curiosity Rover On Standby As NASA Addresses Computer Glitch
alancronin writes "NASA's Mars rover Curiosity has been temporarily put into 'safe mode,' as scientists monitoring from Earth try to fix a computer glitch, the US space agency said. Scientists switched to a backup computer Thursday so that they could troubleshoot the problem, said to be linked to a glitch in the original computer's flash memory. 'We switched computers to get to a standard state from which to begin restoring routine operations,' said Richard Cook of NASA's Jet Propulsion Laboratory, the project manager for the Mars Science Laboratory Project, which built and operates Curiosity."
TFA pretty much covers this, saying they believe it is a problem in the flash memory.
The computer problem is related to a glitch in flash memory on the A-side computer caused by corrupted memory files, Cook said. Scientists are still looking into the root cause the corrupted memory, but it's possible the memory files were damaged by high-energy space particles called cosmic rays, which are always a danger beyond the protective atmosphere of Earth.
They also say
"We also want to look to see if we can make changes to software to immunize against this kind of problem in the future," Cook said.
It seems that, since the same thing happened on one of the earlier rovers, this is something they would have done some time ago.
They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
Sig Battery depleted. Reverting to safe mode.
Who else has a feeling that someone fitted in a module backwards?
Either that, or a dead cell or two.
Nobody who has read TFA has that feeling. Curiosity has been running since Aug. 6, 2012 on your putative "backwards module".
Sig Battery depleted. Reverting to safe mode.
"NOOOO! I should've selected Safe mode WITH Networking!"
They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
When are you "satisfied" with software like this? Imagine something comes as slow corruption or only occurs when a certain counter overflows or whatever, you don't want to be caught in a race against time to save the system before the B computer dies too. Which is probably the reason why they want to move it back to the A computer, if it can't run there then they don't have a backup anymore. It's better for them with a slightly reduced system with backup than a B computer running with no backup. It does them no good to sit on the ground and say "well, we've figured out how what happened and how we could have fixed it" after you've lost contact, then it's game over. You don't run it in "if it breaks, we're done here" mode unless you really, absolutely must.
Live today, because you never know what tomorrow brings
I can easily imagine this happening, I work on a very similar, perhaps nearly identical spacecraft (that's just a tad mode critical AND expensive than this thing...) and we haven't necessarily maintained this. You underestimate the overhead associated with generating the necessary uploads.
The reason they probably want to go back to the Prime is that their failure isolation system database is keyed to using the prime units only, and to alter it to start on the "B" side and have it switch back to "A" is prohibitive, or at least easier to get around by switching back to A. This last is also something we do in the rare case of a temporary failure. There's less good justification to doing it than leaving the backup program image alone but having to completely retest the entire redundancy management system for a new configuration is generally avoided. If it fails hard, it doesn't really matter, since there's no Prime to switch back to.
Brett
I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.
Good luck to the guys working on this.
Check out the official rover press kit for a summary of the computer design (http://mars.jpl.nasa.gov/msl/news/pdfs/MSLLanding.pdf) Page 42 in particular:
"Curiosity has redundant main computers, or rover compute elements. Of this “A” and “B” pair, it uses one at a time, with the spare held in cold backup. Thus, at a
given time, the rover is operating from either its “A” side or its “B” side. Most rover devices can be controlled by either side; a few components, such as the navigation camera, have side-specific redundancy themselves. The computer inside the rover — whichever side is active — also serves as the main computer for the rest of the Mars Science Laboratory spacecraft during the flight from Earth and arrival at Mars. In case the active computer resets for any reason during the critical minutes of entry, descent and landing, a software feature called “second chance” has been designed to enable the other side to promptly take control, and in most cases, finish the landing with a bare-bones version of entry, descent and landing instructions.
Each rover compute element contains a radiation-hardened central processor with PowerPC 750 architecture: a BAE RAD 750. This processor operates at up to 200 megahertz speed, compared with 20 megahertz speed of the single RAD6000 central processor in each of the Mars rovers Spirit and Opportunity. Each of Curiosity’s redundant computers has 2 gigabytes of flash memory (about eight times as much as Spirit or Opportunity), 256 megabytes of dynamic random access memory and 256 kilobytes of electrically erasable programmable read-only memory.
The Mars Science Laboratory flight software monitors the status and health of the spacecraft during all phases of the mission, checks for the presence of commands to execute, performs communication functions and controls spacecraft activities. The spacecraft was launched with software adequate to serve for the landing and for operations on the surface of Mars, as well as during the flight from Earth to Mars. The months after launch were used, as planned, to develop and test improved flight software versions. One upgraded version was sent to the spacecraft in May 2012 and installed onto its computers in May and June. This version includes improvements for entry, descent and landing. Another was sent to the spacecraft in June and will be installed on the rover’s computers a few days after landing, with improvements for driving the rover and using its robotic arm."
And according to a release they issued after landing, both computers receive the same updates and are running the same software (not a version or 2 behind like others have suggested): http://mars.jpl.nasa.gov/news/whatsnew/index.cfm?FuseAction=ShowNews&NewsID=1305
They sent the update once, didn't they?
Wait till you are satisfied it worked, and shunt it over to computer B.
I'm fairly sure that they purposely keep the computers out of sync to avoid a single bug taking out both systems. If I recall, it actually has 3 computers, 2 of them have identical hardware that run different versions of the same software, and a 3rd computer based on completely different hardware running yet another software package. Each system is able to assume command of the mission and issue updates to the other systems.