Mars Global Surveyor Died from Single Bad Command
wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"
Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
One bad command started the chain, but it needed a series of system failures to kill it. In other words, a slight misalignment of the solar panels (or whatever it was) may have been a necessary cause, but not sufficient. The thing needed a safe-mode that wasn't safe, and battery logic that failed to consider environmental variables. All the conditions lined up.
It's like saying that a mid-air collision occurred because two jetliners were assigned the same altitude and jetway in opposite directions at the same time. Yeah, but A) How they got that assignment is kinda complicated and B) any number of traffic control and collision avoidance systems have to fail too.
That'll teach those NASA folks to stop just using "sudo" when a command doesn't work under regular user permissions...
It was the Tamil Tigers that hacked it, and inserted this insidious command! The threat of terrorists is everywhere! This would have been preveneted if we had kept up the war on terror.
C:\>
Only three things are certain; death, taxes, and apocryphal quotations - Ben Franklin.
...martian spacecraft. The ultimate geek's meal.
Sorry, my mind's wandering.
From TFA: ".... That exposed one of the batteries to direct sunlight, causing it to overheat." So, also a small naviation error or small mechanical failure could already cause this thing to overheat. It should have been constructed more robust.
... It makes me actually want to go over my implementation of a FSM (finite state machine, not flying spagetti monster) for a program I'm working on. It's amazing how these errors mirror a software error: some small bug/hidden "feature" propagates into one or multiple big problems. Your system fails, because something so simple as an if statement.
It worked for a decade at a cost of a piddling $220 million, plus $20 million a year in upkeep. At a hair over $40 million a year, thats much, much less wasteful than most NASA missions. (Yeah, I suppose you could consider whether the return was worth it. Heh, who are we kidding -- did YOU get $40 million a year out of those desktop photos? I didn't.)
I propose that next time NASA spend $150 million on the construction phase, which is just a slush fund for defense contractors anyhow, and then issue the lethal command before launch. Then we'd save a decade worth of upkeep costs and the $65 million launch budget. NASA could even have a $10 million prize going to the person who most creatively identified a possible fatal error, since thats the only fun part of these missions for people who aren't rocket scientists and we wouldn't want to skimp on it.
Help poke pirates in the eyepatch, arr.
It was a bad idea to begin with by adding the crash.exe file.
The preliminary official report is availiable from here. The summary conclusions are:
* A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
** Disabled the solar array positioning limits.
** Corrupted the HGA's pointing direction used during contingency operations.
* A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
* The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
* The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
* The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
* Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
* Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.
"Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
In a tragic comedy of errors, NASA accidently sends the Mars Global Surveyor a confirmation to execute "con/con". Microsoft explains that this will be patched in TerraWindows (TM), and for the moment their only suggestion is to "...do the Microsoft '1,2 shuffle'; sigh heavily and do a hard reboot..."
John Dvorak has been contacted as a possible canidate to go manually reboot the Surveyor, but has yet to accept the proposition.
*ducks*
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
The article mentions that a new round of global-warming may be taking place on Mars - does this lend any credence to the theory that global warming is an unavoidable solar event? Maybe Mars and Earth switch off and on in turns - making one hospitable to life while the other becomes a desolate barren wasteland. Maybe we all just need to move 35 Million Miles away.
Sometimes I feel like I need to.
Also, The slashdot write-up says a, 'wrong command to the wrong computer address'. It was the right command, to the wrong computer address. If you're going to just play 'telephone' entering stories, pay attention. You made it more complicated and wrong. Maybe you should go work for NASA; got some diapers and surgical tubing?
---
Diapers and surgical tubing!
Ace
... that NASA doesn't have an undo command? I guess they really have cut their budget.
Not the error itself, but the fact NASA was able to figure out what happened in such detail, when the spacecraft it happened to is not giving any diagnostic information and cannot be examined directly.
/sudo shutdown -h now sent instead of /sudo shutdown -r now
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Heh, seems I am not the only one to have a problem with the Linux/ACPI combo.
(I am kidding, honestly)
> Fuk Li, manager
Hehehe
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
Admittedly offtopic, but...
Somehow I find it reassuring that NASA employs someone called "Dolly Perkins". It has that warm cosy 1950's feeling of Golden Age Space Exploration. Now, if only we could get the astronauts named "Buck", "Rock", or "Trent".
In Soviet Russia perfect probe sends lens cap code back to you!
A wiki link to help with the lens part.
http://en.wikipedia.org/wiki/Venera_program
Domestic spying is now "Benign Information Gathering"
So, which scientific experiment would you remove in order to put additional heat shielding? No, the thermal shielding and other protection systems are just right for a spacecraft that had to travel a hundred million kilometers.
What really failed was the ground-based software, that didn't have a good enough thermal model, and the technical support team. Equipment may fail, operators may commit errors, but there should be enough experienced engineers around to do a correct analysis to catch those errors. Downgrading of the engineering team is the true problem here. Look at what happened to Columbia. It blew up on reentry because of a failure that had happened on take-off, was caught on video, but not analyzed correctly.
NASA isn't alone in these failures, perhaps one could say they set the pace for the rest of the industry. The lack of a good thermal model is typical of a whole generation of engineers used to do everything in Excel. With the current CPUs one has at each desktop, it wouldn't be so hard to do a correct thermal model of the spacecraft, but it would imply in solving a system of partial differential equations in C++, something very few engineers are able to do, even when given an extensive library.
Are they sure the computer didn't reposition the antenna deliberatly
when that damned butterfly began to flap its wings.
[root@surveyor]# dd if=/dev/urandom of=/dev/solar_panels
boycott slashdot February 10th - 17th check out: altSlashdot.org
There sure are fewer MS jokes than I expected.
Guess they weren't aware of the recall on those Dell batteries.
http://nssdc.gsfc.nasa.gov/database/MasterCatalog? sc=1996-062A
I realize this was dirt cheap by space mission standards. A laptop encrusted with diamonds which costs $80,000 is dirt cheap by laptop-encrusted-with-diamonds standards. That *doesn't make it worth the money*. I know we waste far more than $40 million a year on many things -- and, logically, every one of them except one can be justified by "We waste more money on another program, don't cut *my* hobby horse!"
Its interesting that you draw the distinction between subsidies/entitlements and science, since NASA is a fairly naked subsidy directly to defense contractors, who make all of the really expensive bits. I'm all for giving Lockheed Martin money when its required, but lets be honest and get ourselves something which blows up in a suitably impressive manner when we do, OK? Similarly, I might even be persuaded that the US federal government should fund science projects -- great, then *fund science*! Don't blow $160 million just to accelerate a tin can out of the atmosphere to get a few close up pictures of rocks. $160 million could fund an awful lot of real science down here, much of which would produce actual results (or, alternatively, you could fund research gazing into the Clear Blue Sky, which is *still* cheap when you do it somewhere in atmosphere).
Help poke pirates in the eyepatch, arr.
That command was:
win
D'oh!
NASA has been on this kick of doing quick, reduced cost and inexpensive projects for some time now. They really have no choice since congress will only give them funding for unmanned and low cost missions.
So occasionally you get the stunning successes, E.G. the Mars rovers Spirit and Opportunity. Considering they were only supposed to last 90 sols and they're somewhere out to 1075 or more sols it means that the Steve Squyers is currently the start of NASA.
But more likely you get the devastating failures.
It's really sad that we blow a few billion a month on our little Iraq and Afghanistan ventures yet sciences take a back seat.
POKE 59458,PEEK(59458)OR 32
Mr. T pitied this fool on 27 July 1992.
Fuk Li, manager of the Mars exploration program at JPL, was reported to say, "Fuk Mi..."
...was is contolled by a Commodore PET?
There are an awful lot of posts here that disparage the people who have built and operated this system. To me it looked very much like the explanation for an aircraft accident. The easy failure modes are all known, so the really hard ones are left. In aircraft accidents, and it seems space accidents now too, a fatal result is generally the result of a number of seemingly disparate factors including system states, environmental state, and human impressions of what is going on.
In one major aircraft accident I know a lot about, the (Airbus) jet crashed in part because it ended up being a tug of war between a human pilot and a robot autopilot that should have been disengaged, causing and up and down roller coaster ride. There were lots of other distracting things that were maybe wrong or maybe not, but a key part was the difficulty in knowing what state the machine was in.
It was a similar situation with this accident, it seems, and though the misuse of metric units caused another recent accident it appears that these incidents have elements in common. They are also made more probable it strikes me by funding pressures and also in the way that operating these systems involves radical commands while the systems also lack enough power to be self-aware enough to preserve themselves.
I am not going to do any more guessing because the people involved can probably figure it out themselves, and it seems that these combined factor accidents at least are not costing human lives, while they are adding to knowledge about how not to make the accident the next time.
In that regard my hope is that some of the money being spent on Mars can be used to improve autonomous robotic systems to reduce accidents both on Mars and on Earth.
Hal. Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.
A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006.
I am well aware that you can do some nifty things in VxWorks, but at some point, shouldn't you be using an OS [like QNX, Integrity, or, gasp!, WinCE] that offers a little more memory protection?
Especially if you're writing code in a language with pointers?
HLT
It sucks when your hacking the firmware in your gadgets and brick them.
*It's not what you can do for the Dark Side but what the Dark Side can do for you!*
It is great they found out what caused the problem, but that isn't going to bring the craft back to life. And for all you people that comment here, they should've done this and that... get out of school, find a job in aeronautics and see if you can do it better, if you can they'll happily accept you and still, in your design, you'll make errors.
I have worked for different large and small companies, everybody makes mistakes. I've seen all connections for a large datacenter going down because somebody made a mistake updating a single firewall, I've seen professional cooling solution designers install a triple-redundant system for said datacenter which went down completely because the datacenter didn't produce enough heat yet and one of the regulators had a wrong offset which caused the cooling to freeze.
I have seen people insert commands in a mainframe which hung the whole thing and it took a few IBM engineers to start it up again.
Yes, people make mistakes and hopefully they'll learn from it. We shouldn't be outraged about it or fire them, because those mistakes are basically paid-for education. If you can do it better, they'll hire you, if you can't, STFU. Space is large, and those devices are just like servers, a single mistake can bring them down. The problem is the ping time is 60 minutes, so before you even get a response from a system, it's an hour later. The sun is a powerful source of energy and in an hour you can get sun-burnt in summer on earth, try sunbathing on the planet Mercury or just in outer space, and see the difference after an hour, that's what we're talking about.
Custom electronics and digital signage for your business: www.evcircuits.com
I am pointing this out simply because of completeness. I normally ignore trolls, but this matter is sufficiently important to warrant a propper response. Before you start dismissing the worlds scientists as incompetent, youc ould at least read the Wikipedia articles on the matter before assuming the vast majority of the scientific comunity and every meterological institution on earth are incompetent enough that they all fail to MEASURE the solar irradiance. Global warming has been vigorously discussed in the scientific comunity since the 80ies, weather forecasts have been arround since god knows when. We have good records of how much energy the sun has been putting out ( in many ways better than the temperature record ). Here, have a look: http://en.wikipedia.org/wiki/Image:Solar-cycle-dat a.png
That is the solar irradiance over the last 35 years or so.
Now don't come and tell me about a time lag, because that graph stays fairly constant yet rate of global warming has been accelerating at a steady rate ( even when the solar radiation has been on the decline of a cycle ). Now try to explain how the planet's temperature can not only increase, but doing so at a steadily icnreasing rate, while the solar irradiance remain constant or even while it decreases.
Oh, and just in case you are going to claim it reduces the CO2 content in the oceans by heating them...
http://en.wikipedia.org/wiki/Ocean_acidification
Furthermore, explain why the following shows a steady increase in perfect correlation to the rate of fossil fuel consumption and deforestation, yet doesn't show a single sudden peak at the dates of major volcanic erruptions:
http://en.wikipedia.org/wiki/Image:Mauna_Loa_Carbo n_Dioxide.png
Furthermore, do explain why the following graph shows a fairly allright correlation between temperature and solar activity while CO2 remained fairly constant, only for that correlation to break down completely once CO2 starts to really shoot off:
http://en.wikipedia.org/wiki/Image:Temp-sunspot-co 2.svg
If you could also explain why solar variation which should allegedly affect the earth and mars coudl cause one of them to start heating up at a different time and why some of the solar system's planets and moons have even experienced cooling during the same period that would be nice too.
It would also be interesting to know why we would miss so heavily on CO2's potential as a greenhouse gas given that the absrobtion spectrum of CO2 is known to several significant figures of accurace, and the spectrum of radiation emitted by the earth has been carefully measured by sattelites in orbit.
/sudo shutdown -h now /sudo: No such file or directory
bash:
That would not cause any problems whatsoever.
Wow, I take a commonly discussed 'question' about global warming - reference it. As if it, you know, is discussed by people - and I'm called a Troll.
All of your links to other web sites appear to me, to be Trolling. Am I the scientists that are debating the causes of global warming? No. So I'm not going to look at your chosen data sets and do the math - I'm not qualified to. Is global warming a linear process? No, and all of the scientists agree on that. Until someone wants to prove it's caused by CO2 - then the data is linear running along with the linear correlation between food production and global warming. Maybe producing food is causing global warming.
Why don't the MODS here think for themselves? I know I do when I'm a moderator. You should lose your account just for using 'poster is a troll' in the subject of your message. I'm not encouraging you to talk to me at all, believe me. I'm saying it's possible that earth and mars atmospheres and weather have some connection to each other. Maybe, man-made global warming triggers the onset of a hospitable environment on mars.
There I said it again, ban me.
Ace