Slashdot Mirror


Mars Global Surveyor Died from Single Bad Command

wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"

9 of 141 comments (clear)

  1. It wasn't a single wrong command by 91degrees · · Score: 4, Informative
    It was a whole series of errors. Either that or every accident ever is caused by a single minor fault. Here's what the article says

    The review panel found that the management team followed procedures in dealing with the problem but that the procedures "were inadequate to catch the errors that occurred."

    The review also said the spacecraft's onboard fault protection system failed to respond to the errors. Instead of protecting the spacecraft, the programmed response made it worse.
    So, if the procedures were better, this wouldn't have happened. If the fault protection system was better, this wouldn't have happened. If the designers had predicted this exact problem might occur this wouldn't have happened.

    Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
    1. Re:It wasn't a single wrong command by DerekLyons · · Score: 2, Informative

      The last thing management want is to have to decide to shut the spacecraft down because they don't have the budget for operations on the ground.

      A nice theory, but one that fails to coincide with the facts. NASA routinely shuts down missions for lack of budget.
  2. The actual report by Mike1024 · · Score: 5, Informative

    The preliminary official report is availiable from here. The summary conclusions are:

    * A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
    ** Disabled the solar array positioning limits.
    ** Corrupted the HGA's pointing direction used during contingency operations.
    * A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
    * The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
    * The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
    * The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
    * Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
    * Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.

    --
    "Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
    1. Re:The actual report by gfilion · · Score: 2, Informative

      The preliminary official report is availiable from here.

      Thanks for the link. The report is only three pages long and very interesting to read. The cause (quoted below) is really stunning, I wonder what's the probability of this sequence of event to happen.

      The LM team performed a fault analysis to determine the cause of the spacecraft anomaly. An LM spacecraft engineer ultimately determined that the likely cause of the anomaly was an incorrect parameter upload that had occurred 5 months earlier (June 2006). A direct memory command to update the HGA's positioning for contingency operations was mistakenly written to the wrong memory address in the spacecraft's onboard computer. This resulted in the corruption of two independent parameters and had dire consequences for the spacecraft. The first parameter error caused one solar array to be driven against its hard stop, leading the MGS fault-protection system to incorrectly believe it had a stuck gimbal, causing MGS to enter contingency mode. Upon entry into contingency mode, the spacecraft's orientation was such that one of the batteries was directly exposed to the sun. This caused the battery to overheat which in turn gave a false indication of an overcharged battery and led to the premature termination of battery charging on each subsequent orbit. Even though the remaining battery continued to be charged, it was not being charged sufficiently to support the full electrical load, which was normally supported by both batteries. The end result was that both batteries were depleted, probably within 12 hours. The second parameter error caused the HGA to point away from the Earth when the spacecraft was, in fact, properly oriented to communicate to Earth. Communication from the spacecraft to the ground was therefore impossible, and the unsafe thermal and power situation could not be identified by the MGS's ground controllers.
    2. Re:The actual report by gfilion · · Score: 2, Informative

      So hang on... they *overwrote* the memory which contained the contingency operations plan and the hardware limitations data for the solar array? Surely that's bad design, you shouldn't be able to overwrite something like that (Unless the hardware limits plan on changing mid-mission). NASA fault protection modules evidently don't do their job too well :-/

      Actually, they had to correct a previous error by writing directly to memory. I believe that writing directly to memory is not a standard operating procedure. The PDF report linked by the GP states that:

      [...] The HGA parameter was actually updated on the two redundant control systems at two different times. The updates were commanded with slightly different (operator input) precision. This difference in precision, while numerically inconsequential, resulted in an inconsistency between the computer memories. A full memory readout taken at a later date revealed the difference between the two positioning angles, which warranted a correction by the operations team. During the effort to correct the inconsistency, the operations team specified incorrect memory addresses. The incorrect memory addresses caused the command upload to enter data into erroneous memory locations, resulting in the consequences described above.
  3. Re:*Design* flaw by Mike1024 · · Score: 4, Informative

    Some temperature monitors on critical, exposed devices would also help. All you need is the CPU temperature diode present on just about every motherboard sold today.

    I looked at the actual report on the NASA website; it said "the spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current."

    There was a temperature monitor on the critical, exposed component. Furthermore, the information from the sensor was used in a sensible manner: Li-poly/li-ion batteries can catch fire under some circumstances (see also: sony laptop batteries) so if your li-poly battery overheats while being charged you stop charging it (because you'd rather have a flat battery than an exploded battery).

    After the craft stopped charging the battery it never started charging the battery again. The battery ran down and the craft stopped working.

    The obvious question is: why didn't charging resume after the battery had cooled down? It might not have cooled down (as it was hot in the first place due to being exposed to the sun) or the system might have been waiting for a 'resume charging' command from ground control, which was never received as the high-gain antenna was in the wrong position.

    Personally if I was designing a space craft I'd duplicate the (presumably quite small) onboard computer and radio hardware, because it seems quite common for software/electronics failures to result in loss of communications. Having two processors running different software, each capable of reprogramming the other one if it became broken, would seem like a sensible route to take.

    Just my $0.02.

    --
    "Goodness me, how unlike the FBI to abuse the trust of the American public." -- The Onion
  4. Good code Is For Old People by AHuxley · · Score: 2, Informative
    In Capitalist West you send sloppy code to perfect probe.
    In Soviet Russia perfect probe sends lens cap code back to you!


    A wiki link to help with the lens part.
    http://en.wikipedia.org/wiki/Venera_program

    --
    Domestic spying is now "Benign Information Gathering"
  5. Nope, those are the real numbers by patio11 · · Score: 2, Informative

    http://nssdc.gsfc.nasa.gov/database/MasterCatalog? sc=1996-062A

    I realize this was dirt cheap by space mission standards. A laptop encrusted with diamonds which costs $80,000 is dirt cheap by laptop-encrusted-with-diamonds standards. That *doesn't make it worth the money*. I know we waste far more than $40 million a year on many things -- and, logically, every one of them except one can be justified by "We waste more money on another program, don't cut *my* hobby horse!"

    Its interesting that you draw the distinction between subsidies/entitlements and science, since NASA is a fairly naked subsidy directly to defense contractors, who make all of the really expensive bits. I'm all for giving Lockheed Martin money when its required, but lets be honest and get ourselves something which blows up in a suitably impressive manner when we do, OK? Similarly, I might even be persuaded that the US federal government should fund science projects -- great, then *fund science*! Don't blow $160 million just to accelerate a tin can out of the atmosphere to get a few close up pictures of rocks. $160 million could fund an awful lot of real science down here, much of which would produce actual results (or, alternatively, you could fund research gazing into the Clear Blue Sky, which is *still* cheap when you do it somewhere in atmosphere).

  6. Re:More robust == heavier by Anonymous Coward · · Score: 1, Informative

    Not a lack of thermal model (such things DO exist for most spacecraft), and they DO spend quite a lot of time in both modeling and test (thermal balance) where they shine artificial sunlight on the spacecraft in a vacuum chamber while it's operating to verify that the model works. http://mpfwww.jpl.nasa.gov/martianchronicle/martia nchron7/mgs.html

    In fact, because MGS used aerobraking, which heats the spacecraft during the dips into the atmosphere, I'll bet the thermal model for MGS is better than most.

    But, the previous poster is right..ultimately it's a budget issue.. if you designed the spacecraft to handle every eventuality, it would be too heavy to launch. if you did ground analysis for every conceivable situation, there aren't enough engineers in the world to finish the job before the "every two years" launch opportunity. At some point, you rely on judgement.. hordes of people in reviews shooting at your design, and you figure you've covered 99.9% of the stuff... time to ship and shoot.

    let's also remember that this puppy has been going for >10 years, which means it was designed 15 years ago... Call it 1990. Somehow I don't think the thermal design engineers at Martin Marietta and JPL were rookies using Excel for the first time, and I suspect that they are fully capable of understanding how to numerically solve partial differential equations, and the limitations of those numerical methods. They can also solve them analytically.. my gosh, with a slide rule, even.

    http://mpfwww.jpl.nasa.gov/martianchronicle/martia nchron2/marschro29.html

    FWIW, C++ is hardly necessary.. This kind of thing is really the domain of good old FORTRAN. Good optimizing compilers, well validated numerical codes, etc.