Curiosity Rover On Standby As NASA Addresses Computer Glitch

← Back to Stories (view on slashdot.org)

Curiosity Rover On Standby As NASA Addresses Computer Glitch

Posted by samzenpus on Sunday March 3, 2013 @06:55AM from the fixing-the-glitch dept.

alancronin writes "NASA's Mars rover Curiosity has been temporarily put into 'safe mode,' as scientists monitoring from Earth try to fix a computer glitch, the US space agency said. Scientists switched to a backup computer Thursday so that they could troubleshoot the problem, said to be linked to a glitch in the original computer's flash memory. 'We switched computers to get to a standard state from which to begin restoring routine operations,' said Richard Cook of NASA's Jet Propulsion Laboratory, the project manager for the Mars Science Laboratory Project, which built and operates Curiosity."

28 of 98 comments (clear)

Min score:

Reason:

Sort:

Glitch or flash memory failure? by AmiMoJo · 2013-03-03 07:03 · Score: 4, Interesting

Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
1. Re:Glitch or flash memory failure? by Anonymous Coward · 2013-03-03 07:05 · Score: 3, Funny
  
  Would hate to be the field service engineer for this one...
2. Re:Glitch or flash memory failure? by icebike · 2013-03-03 07:18 · Score: 5, Informative
  
  TFA pretty much covers this, saying they believe it is a problem in the flash memory.
  
  The computer problem is related to a glitch in flash memory on the A-side computer caused by corrupted memory files, Cook said. Scientists are still looking into the root cause the corrupted memory, but it's possible the memory files were damaged by high-energy space particles called cosmic rays, which are always a danger beyond the protective atmosphere of Earth.
  They also say
  
  "We also want to look to see if we can make changes to software to immunize against this kind of problem in the future," Cook said.
  It seems that, since the same thing happened on one of the earlier rovers, this is something they would have done some time ago.
  They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
  
  --
  Sig Battery depleted. Reverting to safe mode.
3. Re:Glitch or flash memory failure? by Anonymous Coward · 2013-03-03 07:44 · Score: 2, Funny
  
  Because apt-get update installed takes a million years with that kind of latency?
4. Re:Glitch or flash memory failure? by Kjella · 2013-03-03 07:54 · Score: 5, Informative
  
  They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
  When are you "satisfied" with software like this? Imagine something comes as slow corruption or only occurs when a certain counter overflows or whatever, you don't want to be caught in a race against time to save the system before the B computer dies too. Which is probably the reason why they want to move it back to the A computer, if it can't run there then they don't have a backup anymore. It's better for them with a slightly reduced system with backup than a B computer running with no backup. It does them no good to sit on the ground and say "well, we've figured out how what happened and how we could have fixed it" after you've lost contact, then it's game over. You don't run it in "if it breaks, we're done here" mode unless you really, absolutely must.
  
  --
  Live today, because you never know what tomorrow brings
5. Re:Glitch or flash memory failure? by icebike · 2013-03-03 08:13 · Score: 2
  
  They sent the update once, didn't they?
  Wait till you are satisfied it worked, and shunt it over to computer B.
  
  --
  Sig Battery depleted. Reverting to safe mode.
6. Re:Glitch or flash memory failure? by icebike · 2013-03-03 08:15 · Score: 2
  
  If your scenario were real, then why are the updating B at this point?
  Clearly they are satisfied with that A was running before the memory fritz.
  
  --
  Sig Battery depleted. Reverting to safe mode.
7. Re:Glitch or flash memory failure? by Brett+Buck · 2013-03-03 08:15 · Score: 5, Interesting
  
  I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
  
  I can easily imagine this happening, I work on a very similar, perhaps nearly identical spacecraft (that's just a tad mode critical AND expensive than this thing...) and we haven't necessarily maintained this. You underestimate the overhead associated with generating the necessary uploads.
  The reason they probably want to go back to the Prime is that their failure isolation system database is keyed to using the prime units only, and to alter it to start on the "B" side and have it switch back to "A" is prohibitive, or at least easier to get around by switching back to A. This last is also something we do in the rare case of a temporary failure. There's less good justification to doing it than leaving the backup program image alone but having to completely retest the entire redundancy management system for a new configuration is generally avoided. If it fails hard, it doesn't really matter, since there's no Prime to switch back to.
  Brett
8. Re:Glitch or flash memory failure? by instagib · 2013-03-03 09:23 · Score: 3, Insightful
  
  One can only hope that they have a C computer which will never be updated, and which can reset the rover to the initial state. Even if updates on A run fine for some time, experience in computing of the last decades shows that Murphy's Law is always lurking.
9. Re:Glitch or flash memory failure? by icebike · 2013-03-03 09:37 · Score: 3, Insightful
  
  Does every incident in the real world need a reference to a TV show?
  Are you sure you can't find an XKCD comic that would be more appropriate?
  
  --
  Sig Battery depleted. Reverting to safe mode.
10. Re:Glitch or flash memory failure? by Solandri · 2013-03-03 09:45 · Score: 2
  
  Glitched memory usually isn't a problem. Other spacecraft have had similar memory problems. Usually it's temporary. If it's permanent, the computers are programmed to map around the glitched memory or (back in the tape drive days) not use that segment of tape..
  
  The real danger is that such a glitch will first manifest itself by altering control or orientation instructions, breaking the spacecraft's contact with Earth. Most spacecraft are designed with a "safe mode" when this happens. If there's been no communication with Earth for x days, the main computer switches to a rudimentary instruction set or a second computer takes over, and tries to re-establish communications.
11. Re:Glitch or flash memory failure? by BradleyUffner · 2013-03-03 13:29 · Score: 5, Interesting
  
  They sent the update once, didn't they?
  Wait till you are satisfied it worked, and shunt it over to computer B.
  I'm fairly sure that they purposely keep the computers out of sync to avoid a single bug taking out both systems. If I recall, it actually has 3 computers, 2 of them have identical hardware that run different versions of the same software, and a 3rd computer based on completely different hardware running yet another software package. Each system is able to assume command of the mission and issue updates to the other systems.
12. Re:Glitch or flash memory failure? by tbird81 · 2013-03-03 16:58 · Score: 2
  
  Obligatory:
  Even XKCD haters didn't mind 695
  Quite frankly, while some of his work is meh, Randall's done good things for the geek/nerd community.
13. Re:Glitch or flash memory failure? by ScottMaxwell · 2013-03-03 17:53 · Score: 2
  
  They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?
  They're running the same flight software, but the parameters are different. (A parameter might say, for example, how far to drive between autonomous visual odometry updates, or how big the bounding boxes around the arm should be when computing ChemCam laser safety.) There are thousands of these parameters, and they're not routinely kept up to date on the non-prime side (which has historically been the B side).
  And while the computers are identical, the non-cross-strapped equipment isn't. For example, the B-side rear HAZCAMs are exposed to more radiation, because of the DAN instrument, than the A-side rear HAZCAMs, and are therefore expected to degrade faster. Switching back to the A side is, generally, switching back to slightly better equipment.
  
  --
  
  ``Life results from the non-random survival of randomly varying replicators.'' -- Richard Dawkins
Re:Robust hardware by icebike · 2013-03-03 07:24 · Score: 5, Informative

Who else has a feeling that someone fitted in a module backwards?
Either that, or a dead cell or two.
Nobody who has read TFA has that feeling. Curiosity has been running since Aug. 6, 2012 on your putative "backwards module".

--
Sig Battery depleted. Reverting to safe mode.
And i still wonder by lesincompetent · 2013-03-03 07:26 · Score: 2

How does it feels to SSH with a ~45 minutes delay. Wait: why bother encrypting? Just telnet!
Re:Robust hardware by ArchieBunker · 2013-03-03 07:35 · Score: 4, Informative

This may be of some interest http://www.cpushack.com/space-craft-cpu.html

--
Only the State obtains its revenue by coercion. - Murray Rothbard
Safe Mode with Networking by Miletos · 2013-03-03 07:45 · Score: 5, Funny

"NOOOO! I should've selected Safe mode WITH Networking!"
Gotta love Armchair Quarterbacks and their simple by shoottothrill · 2013-03-03 08:00 · Score: 3

solutions: "someone fitted a module in backwards" "Why didnt they simply do X instead of Y" Do you seriously think the people who work for NASA, the same ones who sent a group of men to the moon and back, rovers to mars, Voyager to Deep Space, shuttles, space stations, et all can be out thought by YOU? And from the comfort of your Lazy Boy Chair, no less?! Wow. Surely you can come to respect these these men and women have given countless hours of thought, simulation, and planning to these missions. In fact, many have dedicated their lives to this engineering. Surely you can respect that those involved in this mission did not simply put the red wire where the blue one should have gone. Thats sorta insulting. Peace out
Re:Robust hardware by 93+Escort+Wagon · 2013-03-03 08:04 · Score: 4, Funny

Nobody who has read TFA has that feeling.
You could actually explain it to him rather than choosing to go all holier than thou. Here, I'll do it for you.

Who else has a feeling that someone fitted in a module backwards? Either that, or a dead cell or two.
The A-side flux capacitor was somehow depolarized, perhaps by a cosmic ray impact event. They're hoping to fix it by reinitializing the quantum warp matrix.

--
#DeleteChrome
Remote fixes always a hair raiser by Celarent+Darii · 2013-03-03 08:23 · Score: 5, Interesting

I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.
Good luck to the guys working on this.
1. Re:Remote fixes always a hair raiser by evilviper · 2013-03-03 11:52 · Score: 3, Interesting
  
  I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
  It's called out-of-band management. You can bring up a server from bare metal with no working OS installed. Damn near every server out there comes with at least ipmi, and often DRACs/iLos/RSAs with some additional features. All you need to do is give the OoBM interface an IP address (perhaps a DHCP reservation) and you're good to go.
  Even if you're running on desktop-class hardware, you can still fake OoBM pretty well with a serial port. Linux/BSD/etc., will bring-up the serial port as the console as soon as the bootloader starts up, if configured to do so. And if the disk has failed, or otherwise your bootloader doesn't work, hopefully your bios is set to PXE boot, and your pxelinux configuration will give you a serial console as soon as that kicks-in. Throw-in magic sysrq to allow you to reboot a system that's not responding, and you've got something reasonably close to OoBM just about free. You could also supplement this with a watchdog timer and make things even more reliable.
  But as cheap as server-class hardware is, and the ubiquity of ipmi, it's probably not worthwhile going the cheap route.
  
  --
  Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Gotta love Armchair Quarterbacks and their simp by MLCT · 2013-03-03 08:24 · Score: 3, Informative

On the whole I am sure everyone does respect NASA, but they do have "previous" on things far simpler than the random slashdotter obtuse suggestions:

Newtons or pound-force?

I don't think anyone is suggesting that simple mistakes were the cause in this case - but the above link may help explain why a little leg-pulling by slashdotters is not crossing any lines.

"Peace out"
Re:Gotta love Armchair Quarterbacks and their simp by Tablizer · 2013-03-03 08:57 · Score: 4, Interesting

The Galileo Jupiter atmosphere probe actually had a parachute-related part put on backward. It almost ruined the mission. They got lucky and the shaking from atmospheric drag eventually shook the high-altitude parachute off the bad lock barely in time before it could have damaged the probe.
Doesn't hurt to ask, although knowing more about the hardware may allow you to give more specific advice, such as "part X could be put in backward and still mostly work without early detection according to simulation Y."

--
Table-ized A.I.
The design is very robust by chalker · 2013-03-03 09:01 · Score: 5, Interesting

Check out the official rover press kit for a summary of the computer design (http://mars.jpl.nasa.gov/msl/news/pdfs/MSLLanding.pdf) Page 42 in particular:
"Curiosity has redundant main computers, or rover compute elements. Of this “A” and “B” pair, it uses one at a time, with the spare held in cold backup. Thus, at a
given time, the rover is operating from either its “A” side or its “B” side. Most rover devices can be controlled by either side; a few components, such as the navigation camera, have side-specific redundancy themselves. The computer inside the rover — whichever side is active — also serves as the main computer for the rest of the Mars Science Laboratory spacecraft during the flight from Earth and arrival at Mars. In case the active computer resets for any reason during the critical minutes of entry, descent and landing, a software feature called “second chance” has been designed to enable the other side to promptly take control, and in most cases, finish the landing with a bare-bones version of entry, descent and landing instructions.
Each rover compute element contains a radiation-hardened central processor with PowerPC 750 architecture: a BAE RAD 750. This processor operates at up to 200 megahertz speed, compared with 20 megahertz speed of the single RAD6000 central processor in each of the Mars rovers Spirit and Opportunity. Each of Curiosity’s redundant computers has 2 gigabytes of flash memory (about eight times as much as Spirit or Opportunity), 256 megabytes of dynamic random access memory and 256 kilobytes of electrically erasable programmable read-only memory.
The Mars Science Laboratory flight software monitors the status and health of the spacecraft during all phases of the mission, checks for the presence of commands to execute, performs communication functions and controls spacecraft activities. The spacecraft was launched with software adequate to serve for the landing and for operations on the surface of Mars, as well as during the flight from Earth to Mars. The months after launch were used, as planned, to develop and test improved flight software versions. One upgraded version was sent to the spacecraft in May 2012 and installed onto its computers in May and June. This version includes improvements for entry, descent and landing. Another was sent to the spacecraft in June and will be installed on the rover’s computers a few days after landing, with improvements for driving the rover and using its robotic arm."
And according to a release they issued after landing, both computers receive the same updates and are running the same software (not a version or 2 behind like others have suggested): http://mars.jpl.nasa.gov/news/whatsnew/index.cfm?FuseAction=ShowNews&NewsID=1305
Re:In space cosmic ray excuse never gets old by Binestar · 2013-03-03 13:18 · Score: 3, Informative

Yeah... did you miss the part where it went to the redundant unit and sent an error to mission control? Sheesh.

--
Do you Gentoo!?
Re:Robust hardware by stackOVFL · 2013-03-03 13:18 · Score: 2

If only the good computer could perform the Jedi Mind Meld on the other!
Re:In space cosmic ray excuse never gets old by Chris+Burke · 2013-03-03 19:44 · Score: 4, Insightful

Ok lets assume a cosmic ray corrupted some random block of flash memory...so what? Why should that lead to failure to upload anything or enter sleep mode?
Pretty much any fault, error, or out-of-bounds reading with any part of the rover causes it to stop whatever it is doing and wait for ground control to check it out and decide what to do. If the fault is with the computer itself, it makes sense to gracefully enter safe mode. It probably was a cosmic ray flipping a random bit, but you can't assume that when designing your fault handler.

If it were any old PC app this would be perfectly acceptable behavior. However for ultra expensive spacefaring things I would expect it to be designed to still try and be useful even if the southbridge cought fire.
See, I think you have that backwards. If it were a PC app it would be appropriate to just assume the error was insignificant or more likely not bother checking in the first place. If it's a more serious problem then eventually the app or OS might crash, the user will reboot, and if that doesn't work reinstall, and if not that then they'll just go get some new hardware.
For a multi-billion rover on another planet, you don't want to just wait and see what happens. Any anomaly at all should be cause for cautious, deliberate action. Heck, the whole project is run that way.
The rover was designed with a lot of redundancy and flexibility so that it can be useful even in the face of more serious problems, and if that turns out to be the case they'll find a way to make the rover as useful as possible. Missing a couple night's worth of downloads and delaying some activities in order to take the time to make sure they're maximizing the rover's future potential is an easy tradeoff.

--

The enemies of Democracy are