Mars Global Surveyor Died from Single Bad Command
wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"
Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
One bad command started the chain, but it needed a series of system failures to kill it. In other words, a slight misalignment of the solar panels (or whatever it was) may have been a necessary cause, but not sufficient. The thing needed a safe-mode that wasn't safe, and battery logic that failed to consider environmental variables. All the conditions lined up.
It's like saying that a mid-air collision occurred because two jetliners were assigned the same altitude and jetway in opposite directions at the same time. Yeah, but A) How they got that assignment is kinda complicated and B) any number of traffic control and collision avoidance systems have to fail too.
That'll teach those NASA folks to stop just using "sudo" when a command doesn't work under regular user permissions...
It was the Tamil Tigers that hacked it, and inserted this insidious command! The threat of terrorists is everywhere! This would have been preveneted if we had kept up the war on terror.
Only three things are certain; death, taxes, and apocryphal quotations - Ben Franklin.
From TFA: ".... That exposed one of the batteries to direct sunlight, causing it to overheat." So, also a small naviation error or small mechanical failure could already cause this thing to overheat. It should have been constructed more robust.
It worked for a decade at a cost of a piddling $220 million, plus $20 million a year in upkeep. At a hair over $40 million a year, thats much, much less wasteful than most NASA missions. (Yeah, I suppose you could consider whether the return was worth it. Heh, who are we kidding -- did YOU get $40 million a year out of those desktop photos? I didn't.)
I propose that next time NASA spend $150 million on the construction phase, which is just a slush fund for defense contractors anyhow, and then issue the lethal command before launch. Then we'd save a decade worth of upkeep costs and the $65 million launch budget. NASA could even have a $10 million prize going to the person who most creatively identified a possible fatal error, since thats the only fun part of these missions for people who aren't rocket scientists and we wouldn't want to skimp on it.
Help poke pirates in the eyepatch, arr.
The preliminary official report is availiable from here. The summary conclusions are:
* A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
** Disabled the solar array positioning limits.
** Corrupted the HGA's pointing direction used during contingency operations.
* A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
* The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
* The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
* The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
* Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
* Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
Not the error itself, but the fact NASA was able to figure out what happened in such detail, when the spacecraft it happened to is not giving any diagnostic information and cannot be examined directly.
/sudo shutdown -h now sent instead of /sudo shutdown -r now
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Admittedly offtopic, but...
Somehow I find it reassuring that NASA employs someone called "Dolly Perkins". It has that warm cosy 1950's feeling of Golden Age Space Exploration. Now, if only we could get the astronauts named "Buck", "Rock", or "Trent".
*sniff*
You just made a beautifully appropriate commentary on a common fixture of my childhood. Dude.
In Soviet Russia perfect probe sends lens cap code back to you!
A wiki link to help with the lens part.
http://en.wikipedia.org/wiki/Venera_program
Domestic spying is now "Benign Information Gathering"
So, which scientific experiment would you remove in order to put additional heat shielding? No, the thermal shielding and other protection systems are just right for a spacecraft that had to travel a hundred million kilometers.
What really failed was the ground-based software, that didn't have a good enough thermal model, and the technical support team. Equipment may fail, operators may commit errors, but there should be enough experienced engineers around to do a correct analysis to catch those errors. Downgrading of the engineering team is the true problem here. Look at what happened to Columbia. It blew up on reentry because of a failure that had happened on take-off, was caught on video, but not analyzed correctly.
NASA isn't alone in these failures, perhaps one could say they set the pace for the rest of the industry. The lack of a good thermal model is typical of a whole generation of engineers used to do everything in Excel. With the current CPUs one has at each desktop, it wouldn't be so hard to do a correct thermal model of the spacecraft, but it would imply in solving a system of partial differential equations in C++, something very few engineers are able to do, even when given an extensive library.
Guess they weren't aware of the recall on those Dell batteries.
http://nssdc.gsfc.nasa.gov/database/MasterCatalog? sc=1996-062A
I realize this was dirt cheap by space mission standards. A laptop encrusted with diamonds which costs $80,000 is dirt cheap by laptop-encrusted-with-diamonds standards. That *doesn't make it worth the money*. I know we waste far more than $40 million a year on many things -- and, logically, every one of them except one can be justified by "We waste more money on another program, don't cut *my* hobby horse!"
Its interesting that you draw the distinction between subsidies/entitlements and science, since NASA is a fairly naked subsidy directly to defense contractors, who make all of the really expensive bits. I'm all for giving Lockheed Martin money when its required, but lets be honest and get ourselves something which blows up in a suitably impressive manner when we do, OK? Similarly, I might even be persuaded that the US federal government should fund science projects -- great, then *fund science*! Don't blow $160 million just to accelerate a tin can out of the atmosphere to get a few close up pictures of rocks. $160 million could fund an awful lot of real science down here, much of which would produce actual results (or, alternatively, you could fund research gazing into the Clear Blue Sky, which is *still* cheap when you do it somewhere in atmosphere).
Help poke pirates in the eyepatch, arr.
NASA has been on this kick of doing quick, reduced cost and inexpensive projects for some time now. They really have no choice since congress will only give them funding for unmanned and low cost missions.
So occasionally you get the stunning successes, E.G. the Mars rovers Spirit and Opportunity. Considering they were only supposed to last 90 sols and they're somewhere out to 1075 or more sols it means that the Steve Squyers is currently the start of NASA.
But more likely you get the devastating failures.
It's really sad that we blow a few billion a month on our little Iraq and Afghanistan ventures yet sciences take a back seat.
There are an awful lot of posts here that disparage the people who have built and operated this system. To me it looked very much like the explanation for an aircraft accident. The easy failure modes are all known, so the really hard ones are left. In aircraft accidents, and it seems space accidents now too, a fatal result is generally the result of a number of seemingly disparate factors including system states, environmental state, and human impressions of what is going on.
In one major aircraft accident I know a lot about, the (Airbus) jet crashed in part because it ended up being a tug of war between a human pilot and a robot autopilot that should have been disengaged, causing and up and down roller coaster ride. There were lots of other distracting things that were maybe wrong or maybe not, but a key part was the difficulty in knowing what state the machine was in.
It was a similar situation with this accident, it seems, and though the misuse of metric units caused another recent accident it appears that these incidents have elements in common. They are also made more probable it strikes me by funding pressures and also in the way that operating these systems involves radical commands while the systems also lack enough power to be self-aware enough to preserve themselves.
I am not going to do any more guessing because the people involved can probably figure it out themselves, and it seems that these combined factor accidents at least are not costing human lives, while they are adding to knowledge about how not to make the accident the next time.
In that regard my hope is that some of the money being spent on Mars can be used to improve autonomous robotic systems to reduce accidents both on Mars and on Earth.