Curiosity Rover On Standby As NASA Addresses Computer Glitch
alancronin writes "NASA's Mars rover Curiosity has been temporarily put into 'safe mode,' as scientists monitoring from Earth try to fix a computer glitch, the US space agency said. Scientists switched to a backup computer Thursday so that they could troubleshoot the problem, said to be linked to a glitch in the original computer's flash memory. 'We switched computers to get to a standard state from which to begin restoring routine operations,' said Richard Cook of NASA's Jet Propulsion Laboratory, the project manager for the Mars Science Laboratory Project, which built and operates Curiosity."
Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Who actually fabs the chips and circut boards used by NASA? I'm guessing these flash modules are not your typical Sandisk variety. Also, do they plan on wiping out the flash memory from a secondary computer, mapping bad cells, and reloading software from scratch? Hopefully these flash modules don't suffer any systemic issues.
Life is not for the lazy.
Nothing is happening.
How does it feels to SSH with a ~45 minutes delay. Wait: why bother encrypting? Just telnet!
"NOOOO! I should've selected Safe mode WITH Networking!"
solutions: "someone fitted a module in backwards" "Why didnt they simply do X instead of Y" Do you seriously think the people who work for NASA, the same ones who sent a group of men to the moon and back, rovers to mars, Voyager to Deep Space, shuttles, space stations, et all can be out thought by YOU? And from the comfort of your Lazy Boy Chair, no less?! Wow. Surely you can come to respect these these men and women have given countless hours of thought, simulation, and planning to these missions. In fact, many have dedicated their lives to this engineering. Surely you can respect that those involved in this mission did not simply put the red wire where the blue one should have gone. Thats sorta insulting. Peace out
I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.
Good luck to the guys working on this.
On the whole I am sure everyone does respect NASA, but they do have "previous" on things far simpler than the random slashdotter obtuse suggestions:
Newtons or pound-force?
I don't think anyone is suggesting that simple mistakes were the cause in this case - but the above link may help explain why a little leg-pulling by slashdotters is not crossing any lines.
"Peace out"
Didn't Apple just disable Flash again? Coincidence? Knew they should have turned off remote updating on the Rover's Mac Mini.
The Galileo Jupiter atmosphere probe actually had a parachute-related part put on backward. It almost ruined the mission. They got lucky and the shaking from atmospheric drag eventually shook the high-altitude parachute off the bad lock barely in time before it could have damaged the probe.
Doesn't hurt to ask, although knowing more about the hardware may allow you to give more specific advice, such as "part X could be put in backward and still mostly work without early detection according to simulation Y."
Table-ized A.I.
Check out the official rover press kit for a summary of the computer design (http://mars.jpl.nasa.gov/msl/news/pdfs/MSLLanding.pdf) Page 42 in particular:
"Curiosity has redundant main computers, or rover compute elements. Of this “A” and “B” pair, it uses one at a time, with the spare held in cold backup. Thus, at a
given time, the rover is operating from either its “A” side or its “B” side. Most rover devices can be controlled by either side; a few components, such as the navigation camera, have side-specific redundancy themselves. The computer inside the rover — whichever side is active — also serves as the main computer for the rest of the Mars Science Laboratory spacecraft during the flight from Earth and arrival at Mars. In case the active computer resets for any reason during the critical minutes of entry, descent and landing, a software feature called “second chance” has been designed to enable the other side to promptly take control, and in most cases, finish the landing with a bare-bones version of entry, descent and landing instructions.
Each rover compute element contains a radiation-hardened central processor with PowerPC 750 architecture: a BAE RAD 750. This processor operates at up to 200 megahertz speed, compared with 20 megahertz speed of the single RAD6000 central processor in each of the Mars rovers Spirit and Opportunity. Each of Curiosity’s redundant computers has 2 gigabytes of flash memory (about eight times as much as Spirit or Opportunity), 256 megabytes of dynamic random access memory and 256 kilobytes of electrically erasable programmable read-only memory.
The Mars Science Laboratory flight software monitors the status and health of the spacecraft during all phases of the mission, checks for the presence of commands to execute, performs communication functions and controls spacecraft activities. The spacecraft was launched with software adequate to serve for the landing and for operations on the surface of Mars, as well as during the flight from Earth to Mars. The months after launch were used, as planned, to develop and test improved flight software versions. One upgraded version was sent to the spacecraft in May 2012 and installed onto its computers in May and June. This version includes improvements for entry, descent and landing. Another was sent to the spacecraft in June and will be installed on the rover’s computers a few days after landing, with improvements for driving the rover and using its robotic arm."
And according to a release they issued after landing, both computers receive the same updates and are running the same software (not a version or 2 behind like others have suggested): http://mars.jpl.nasa.gov/news/whatsnew/index.cfm?FuseAction=ShowNews&NewsID=1305
Do you seriously think the people who work for NASA, the same ones who sent a group of men to the moon and back
Not the same people.
The current crop of "scientists" and "engineers" (terms used very loosely here) at NASA is nothing like the group of scientists and engineers that put men on the moon 50 years ago.
Ok lets assume a cosmic ray corrupted some random block of flash memory...so what? Why should that lead to failure to upload anything or enter sleep mode?
I can only assume there is integrity check for block level I/O from flash and it just did not try to load garbage without knowing it. If it were any old PC app this would be perfectly acceptable behavior.
However for ultra expensive spacefaring things I would expect it to be designed to still try and be useful even if the southbridge cought fire.
Noooooo!
Have gnu, will travel.
probably runs Java. I say 'let it crash'.
E.E. / firmware guy here... Corrupted files, and so far not even a mention of a potential software problem?
Even considering radiation out in space, it seems that it's still easier to get the hardware right than the software / firmware. I'm not jumping to any conclusions, but my first guess would be some kind of rare and unanticipated race condition, or some rarely-executed leg of the filesystem / logging software, etc.
I'm probably cynical from having worked on lots of safety-critical systems for a while, but it just seems often convenient to throw alpha particles under the bus and not even question, perhaps gently, if perhaps there is some latent, obscure bug which just crapped all over the flash.
except it *isn't* the same team that sent people to the moon and back...not even close. 1/3 of the apollo astronauts are dead, and probably roughly the same proportion of the team which worked on the ground as well.
The space shuttle was a monumental failure compared to its stated objectives, unlike apollo, so now you have a culture of abject failure predominating.
Don't believe me? Ask yourself, how many people actually died going to the moon vs the space shuttle fiasco?
NASA spoke of an historical announcement or something a few months ago.
Is that it ? Do not use TLC flash memory in expensive piece of hardware ?
Well, Duh!
What about that earth-shaking discovery Curiosity made that this news will be put down into history? I still haven't heard about it. Here's the link http://science.slashdot.org/story/12/11/20/1511232/what-earth-shaking-discovery-has-curiosity-made-on-mars
Sometimes 'nothing' is news. As in this case. Apparently the locals want more bloviation threads, and not news threads. I'm sure that they will get their wish.
Fugue for Aaron Swartz
Would hate to be the field service engineer for this one...
Are you kidding? Trip to Mars and back, on the company dime?
The SkyMiles alone would be worth it!
There's a whole sequence of watchdog like things. There's a watchdog in the flight computer, and, there are things like hardware command decoders (see if a particular sequence of bits comes over the radio, and push the reset button). Also most spacecraft (MSL included) have hardware that deals with command uplink loss timeout (We haven't heard a command in N days, time to go into safe mode).
bear in mind that "safe mode" reverts to VERY slow data rates over the radio link (10 bps kind of range) on the assumption that no matter what, you can get some bits through.
Yes, there is error detection and correction (EDAC) logic on flash and RAM, but that doesn't mean a)you didn't have a bad day and get a multiple bit error or b) that some other software bug is making it seem there's bad memory or c) that you've found some sort of obscure hardware bug.
Back in the 80s, I worked on a system that would apparently throw double bit errors on DRAM cards on a every couple days basis. MUCH too frequently given the observed single bit error rate. Turns out it was the bus interface logic and some timing problems. Stuff like this happens.
Yes, we're coming into opposition, and for the month of April, give or take, communicating with anything at Mars is difficult, because you have to point your antenna at the sun, which is noisy, or blocks/distorts the radio signals. Mars's orbit is inclined about 1-2 degrees relative to earth, so at best, it's 2 degrees away from the sun. The Sun's corona extends at least that far, and is a fine radio signal distorter.
Not only that, but we're also about as far apart as you can get (2.5 AU), so that reduces the maximum data rate substantially (vs closest at 0.5 AU).
I would imagine there's a fair amount of late night work going on to get this all resolved in the next couple weeks.
Ask yourself, how many people actually died going to the moon vs the space shuttle fiasco?
3 vs. 12? Is that your point? Well, you might have a point. 3 during a simulation and 12 due to PR and marketing bullshit.
Capcha: Counters