Mars Global Surveyor Died from Single Bad Command
wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"
Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
The preliminary official report is availiable from here. The summary conclusions are:
* A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
** Disabled the solar array positioning limits.
** Corrupted the HGA's pointing direction used during contingency operations.
* A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
* The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
* The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
* The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
* Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
* Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
Some temperature monitors on critical, exposed devices would also help. All you need is the CPU temperature diode present on just about every motherboard sold today.
I looked at the actual report on the NASA website; it said "the spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current."
There was a temperature monitor on the critical, exposed component. Furthermore, the information from the sensor was used in a sensible manner: Li-poly/li-ion batteries can catch fire under some circumstances (see also: sony laptop batteries) so if your li-poly battery overheats while being charged you stop charging it (because you'd rather have a flat battery than an exploded battery).
After the craft stopped charging the battery it never started charging the battery again. The battery ran down and the craft stopped working.
The obvious question is: why didn't charging resume after the battery had cooled down? It might not have cooled down (as it was hot in the first place due to being exposed to the sun) or the system might have been waiting for a 'resume charging' command from ground control, which was never received as the high-gain antenna was in the wrong position.
Personally if I was designing a space craft I'd duplicate the (presumably quite small) onboard computer and radio hardware, because it seems quite common for software/electronics failures to result in loss of communications. Having two processors running different software, each capable of reprogramming the other one if it became broken, would seem like a sensible route to take.
Just my $0.02.
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion