Mars Global Surveyor Died from Single Bad Command

← Back to Stories (view on slashdot.org)

Mars Global Surveyor Died from Single Bad Command

Posted by Zonk on Friday April 13, 2007 @09:33PM from the damn-space-bugs dept.

wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"

15 of 141 comments (clear)

Min score:

Reason:

Sort:

It wasn't a single wrong command by 91degrees · 2007-04-13 21:40 · Score: 4, Informative

It was a whole series of errors. Either that or every accident ever is caused by a single minor fault. Here's what the article says
The review panel found that the management team followed procedures in dealing with the problem but that the procedures "were inadequate to catch the errors that occurred."

The review also said the spacecraft's onboard fault protection system failed to respond to the errors. Instead of protecting the spacecraft, the programmed response made it worse.
So, if the procedures were better, this wouldn't have happened. If the fault protection system was better, this wouldn't have happened. If the designers had predicted this exact problem might occur this wouldn't have happened.

Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
1. Re:It wasn't a single wrong command by roaddemon · 2007-04-13 23:57 · Score: 5, Funny
  
  "Either that or every accident ever is caused by a single minor fault."
  
  I agree. Otherwise WWII was caused by Hitler's mom having one too many drinks the night she met his dad.
2. Re:It wasn't a single wrong command by MichaelSmith · 2007-04-14 00:47 · Score: 4, Interesting
  
  So, if the procedures were better, this wouldn't have happened. If the fault protection system was better, this wouldn't have happened. If the designers had predicted this exact problem might occur this wouldn't have happened.
  
  TFA:
  
  over the years budgets and staff had been cut "in an effort to operate the mission as economically as possible."
  
  MGS was well into bonus time in the sense that the original goals had been reached. The project was running on a reduced budget and this made a mistake inevitable. I can't help thinking that at a higher level this was considered to be a good thing. When you have new missions to run and a fixed budget to run them on you want your old missions to stop so that you can draw a line under it and go on to the next thing.
  
  The last thing management want is to have to decide to shut the spacecraft down because they don't have the budget for operations on the ground. Reducing the budget is a way of inducing the shutdown.
  
  --
  http://michaelsmith.id.au
Emmentaler vs. Gruyere by DingerX · 2007-04-13 21:42 · Score: 4, Insightful

One bad command started the chain, but it needed a series of system failures to kill it. In other words, a slight misalignment of the solar panels (or whatever it was) may have been a necessary cause, but not sufficient. The thing needed a safe-mode that wasn't safe, and battery logic that failed to consider environmental variables. All the conditions lined up.

It's like saying that a mid-air collision occurred because two jetliners were assigned the same altitude and jetway in opposite directions at the same time. Yeah, but A) How they got that assignment is kinda complicated and B) any number of traffic control and collision avoidance systems have to fail too.
That'll Teach 'Em by Anonymous Coward · 2007-04-13 21:42 · Score: 5, Funny

That'll teach those NASA folks to stop just using "sudo" when a command doesn't work under regular user permissions...
1. Re:That'll Teach 'Em by Tibor+the+Hun · 2007-04-14 02:00 · Score: 4, Funny
  
  You've just lost thousands of Windows folks...
  su..do...?
  
  --
  If you don't know what AltaVista is (was), get off my lawn.
2. Re:That'll Teach 'Em by Anonymous Coward · 2007-04-14 02:42 · Score: 4, Funny
  
  ku!
3. Re:That'll Teach 'Em by Spudtrooper · 2007-04-14 04:07 · Score: 4, Funny
  
  Mars Global Surveyor wants to commit seppuku: Cancel or Allow?
Re:Bad command or filename by robably · 2007-04-13 21:56 · Score: 5, Funny

C:\>
You know, that looks like the emoticon for an egghead with a beard, frowning. Very appropriate.
The actual report by Mike1024 · 2007-04-13 22:10 · Score: 5, Informative

The preliminary official report is availiable from here. The summary conclusions are:

* A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
** Disabled the solar array positioning limits.
** Corrupted the HGA's pointing direction used during contingency operations.
* A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
* The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
* The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
* The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
* Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
* Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.

--
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
Impressive by hcdejong · 2007-04-13 22:23 · Score: 4, Interesting

Not the error itself, but the fact NASA was able to figure out what happened in such detail, when the spacecraft it happened to is not giving any diagnostic information and cannot be examined directly.
Re:*Design* flaw by Mike1024 · 2007-04-13 22:24 · Score: 4, Informative

Some temperature monitors on critical, exposed devices would also help. All you need is the CPU temperature diode present on just about every motherboard sold today.

I looked at the actual report on the NASA website; it said "the spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current."

There was a temperature monitor on the critical, exposed component. Furthermore, the information from the sensor was used in a sensible manner: Li-poly/li-ion batteries can catch fire under some circumstances (see also: sony laptop batteries) so if your li-poly battery overheats while being charged you stop charging it (because you'd rather have a flat battery than an exploded battery).

After the craft stopped charging the battery it never started charging the battery again. The battery ran down and the craft stopped working.

The obvious question is: why didn't charging resume after the battery had cooled down? It might not have cooled down (as it was hot in the first place due to being exposed to the sun) or the system might have been waiting for a 'resume charging' command from ground control, which was never received as the high-gain antenna was in the wrong position.

Personally if I was designing a space craft I'd duplicate the (presumably quite small) onboard computer and radio hardware, because it seems quite common for software/electronics failures to result in loss of communications. Having two processors running different software, each capable of reprogramming the other one if it became broken, would seem like a sensible route to take.

Just my $0.02.

--
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
wrong parameter? by advocate_one · 2007-04-13 22:26 · Score: 4, Funny

/sudo shutdown -h now sent instead of /sudo shutdown -r now

--
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Re:Bad command or filename by Dasher42 · 2007-04-14 00:08 · Score: 4, Funny

*sniff*

You just made a beautifully appropriate commentary on a common fixture of my childhood. Dude.
Typical multiple-factor catastrophe by mattr · 2007-04-14 03:45 · Score: 4, Interesting

There are an awful lot of posts here that disparage the people who have built and operated this system. To me it looked very much like the explanation for an aircraft accident. The easy failure modes are all known, so the really hard ones are left. In aircraft accidents, and it seems space accidents now too, a fatal result is generally the result of a number of seemingly disparate factors including system states, environmental state, and human impressions of what is going on.

In one major aircraft accident I know a lot about, the (Airbus) jet crashed in part because it ended up being a tug of war between a human pilot and a robot autopilot that should have been disengaged, causing and up and down roller coaster ride. There were lots of other distracting things that were maybe wrong or maybe not, but a key part was the difficulty in knowing what state the machine was in.

It was a similar situation with this accident, it seems, and though the misuse of metric units caused another recent accident it appears that these incidents have elements in common. They are also made more probable it strikes me by funding pressures and also in the way that operating these systems involves radical commands while the systems also lack enough power to be self-aware enough to preserve themselves.

I am not going to do any more guessing because the people involved can probably figure it out themselves, and it seems that these combined factor accidents at least are not costing human lives, while they are adding to knowledge about how not to make the accident the next time.

In that regard my hope is that some of the money being spent on Mars can be used to improve autonomous robotic systems to reduce accidents both on Mars and on Earth.