Slashdot Mirror


Software Error Likely Killed MGS Spacecraft

Aglassis writes "NASA investigators have determined that a software update performed in June of 2006 may have doomed the 10-year-old spacecraft. Apparently the software error caused the solar arrays to drive against a mechanical stop which then forced the spacecraft into safe mode. Unfortunately, after that the spacecraft's radiator was pointed at the sun which overheated the battery and destroyed it. Contact was lost with the Mars Global Surveyor spacecraft in November 2006. NASA will form an internal review board to determine formally the cause of the loss of the spacecraft and what remedial actions are needed for future missions."

38 of 199 comments (clear)

  1. Don't believe it by LiquidCoooled · · Score: 5, Funny

    I don't believe it.
    Its most likely the Martian automated defense system setup just before we sent a probe and destroyed their civilisation.

    --
    liqbase :: faster than paper
  2. Battery by Anonymous Coward · · Score: 5, Funny

    overheated the battery and destroyed it Have NASA been using Dell batteries?
  3. a Technical solution I see: by pilgrim23 · · Score: 2, Insightful

    Typical response to a problem: form a committee!

    --
    - Minutus cantorum, minutus balorum, minutus carborata descendum pantorum.
  4. What is Microsoft wrote it? by quadelirus · · Score: 5, Interesting

    One crash in ten years? Why don't the NASA guys write consumer operating systems?

    1. Re:What is Microsoft wrote it? by the_humeister · · Score: 2, Informative

      Because it'd be even less user friendly than Linux. Plus they'd also require people to run 80386 processors with 4 MB memory, if that.

    2. Re:What is Microsoft wrote it? by Calinous · · Score: 3, Insightful

      Why don't computers use NASA-quality hardware, ready for space?
      Why don't all computers use just a single configuration (peripherals, cards, interfaces)?

            The purpose of an operating system is so much wider than what the Mars Global Surveyor had to do.

    3. Re:What is Microsoft wrote it? by edremy · · Score: 5, Insightful
      Actually, they buy their OS's off the shelf. (VxWorks for the rovers, for example)

      That said, you could get software written to this level of perfection if you wanted. It's easy- follow the space shuttle's team's example. You have a stable team of mature developers who work reasonable hours. You test the hell out of the software to the point a single bug in a test is reason to redo the software. You run the software on four identical computers and make sure they all agree.

      Then you hire another entire team to write code that does the same thing, but otherwise has no contact with the first team. That software runs on a fifth computer that takes over if something happens to the other four.

      Willing to pay for that?

      --
      "Seven Deadly Sins? I thought it was a to-do list!"
    4. Re:What is Microsoft wrote it? by the_humeister · · Score: 4, Funny

      I don't know. And people with their "keyboard" and "mouse." Idiots I say. The only true way to interact with a computer is by plugging wires into the serial port and generating the necessary electrical pulses myself.

  5. *phew* by Daetrin · · Score: 4, Funny
    NASA investigators have determined that a software update performed in June of 2006 may have doomed the 10-year-old spacecraft. Apparently the software error caused the solar arrays to drive against a mechanical stop which then forced the spacecraft into safe mode.

    Glad i'm not the programmer who came up with that bit of code! Their next performace review is going to be _lots_ of fun!

    --
    This Space Intentionally Left Blank
  6. "Safe" mode? by Bazman · · Score: 5, Funny

    Funny definition of 'safe mode'. I'd get the main antenna pointing at the earth, the battery radiator pointing away from the sun, and the computer going 'what do I do know, smarty earthlings?' and waiting for a command.

    Maybe NASA's 'safe mode' just put 'safe mode' in the corners of all the returned images and did them in 8-bit colour...

  7. YACCS -Yet Another Computer Corkup in Space by Ancient_Hacker · · Score: 4, Informative
    Just one more example of how Computer Science sint quite up to the reliability requirements of Space:
    • A missing comma in a Do-loop statement causes the first mission to Mars rocket to go off course and blow up.
    • The space-shuttle programs had a race condition that causes the first launch to be scrubbed.
    • The space-shuttle re-entry program had one important variable off by a factor of -4, causing rthe first re-entry to be a bit wobbly.
    • A Ariane guidance program had multiple basic design glitches that caused the first launch to blow up.
    • The F-16 autopilot worked very well, until the plane was deployed to Australia, where on its way there it bounced off the equator.
    • The LEM landing program didnt protect itself from spurious radar data, causing the computer to get behind.

    Aero and space are very unforgiving of human coding errors.

    1. Re:YACCS -Yet Another Computer Corkup in Space by zyl0x · · Score: 2, Interesting

      Be careful not to place too much of the blame on us programmers. Most of these crazy "business logic" equations were created by some math genius in another department. Since most of these equations mean nothing to programmers, we make sure we're typing them in correctly, since there's no way we would ever recognize any type of mistake. Most of the time the problem lies with the math guy, who was too lazy to carry a remainder, or who thought the equation was good enough being precise to four decimal places.

      --
      Blerg.
    2. Re:YACCS -Yet Another Computer Corkup in Space by spun · · Score: 4, Insightful

      In other disciplines, the engineers ARE math guys. Face it, compared to other engineering types, software engineers and programmers are SLOPPY. This is because engineering has thousands of years worth of spectacular cork-ups with enormous death tolls to look back on, and engineering students are (I'm guessing, IANAE) shown horrific, traffic-safetyesque movies like Blood on the Protractor, Slide Rule Massacre, and London Bridge is Falling Down, Killing Litle Johnny's Entire Family.

      Maybe we CS types need our own safety movies, perhaps When Buffers Attack!, Threads: Your Parallel Friends or Quagmires of Debugging DOOM?, or maybe Metric or Imperial: You Mean there's a Difference? Or maybe we need to recognize that many of us have the same awesome responsibility that other engineers do of protecting human lives from the consequences of our mistakes. I'm told that this point is hammered home in engineering schools, why not in CS departments?

      --
      - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
    3. Re:YACCS -Yet Another Computer Corkup in Space by unix_core · · Score: 4, Funny

      I think I've seen some of those, starring Troy McLure right?

    4. Re:YACCS -Yet Another Computer Corkup in Space by januth · · Score: 3, Insightful

      I wouldn't call it a failure of Computer Science; it's a QA failure without a doubt.

      Mistakes happen when you code. Sure, you try to minimize them but even the most carefully designed code can't be guaranteed to be 100% error free. That's why you employ, presumably, a top-notch QA team to check and recheck, testing your "perfect" code in ways that perhaps you never even considered.

      This is what you would expect in a terrestrial application. When the platform that your code is going to run on isn't bound to the same gravitational source that you are, you would think...you would *hope*...that the QA team might do an even more thorough job.

      If this event is at all indicative of the QA efforts that NASA will be making for our return to the moon, perhaps we'd be better off staying at home.

    5. Re:YACCS -Yet Another Computer Corkup in Space by Mayhem178 · · Score: 4, Insightful

      For the uninformed, QA = Quality Assurance. A must-have for any self-respecting software model.

      NASA has got it rough, has since the mid 70s. Their wildest successes are regarded as routine and hardly noticed by the public eye. Their failures, on the other hand, are spun to be the worst disasters in human history. Granted, when shuttles explode and people die, it's reasonable that the public be concerned. But it seems to me that for every 20 great things that NASA accomplishes, the media picks 1 failure (and sometimes blows that failure out of proportion) to rile the masses into a furious frenzy calling for the dissolution of NASA.

      --

      "You will pay for your lack of vision..." - Emperor Palpatine to Ray Charles

    6. Re:YACCS -Yet Another Computer Corkup in Space by Fishbulb · · Score: 5, Informative

      The F-16 didn't "bounce off the equator". Before it ever flew, in simulation the computer flipped the plane over when it crossed the equator due to a bug that incorrectly handled southern lattitudes. Additionally, since the computer "flip" happened instantaneously, and the f-16 can roll at much higher G forces than the pilot can take, the flip would have killed the pilot (and the F-16 would have happily continued on its way).

      http://portal.acm.org/ft_gateway.cfm?id=163293&typ e=pdf&coll=GUIDE&dl=GUIDE&CFID=11154656&CFTOKEN=19 136062

    7. Re:YACCS -Yet Another Computer Corkup in Space by caerwyn · · Score: 2, Insightful

      CS people are math guys too, at least many of us are. That doesn't mean we necessarily have the expertise to validate aerospace control algorithms on the fly- that's why the's an entire discipline of aerospace engineers, because you can't expect all the *other* engineers to have sufficient knowledge.

      Things like this are built as teams- and team members have to make certain assumptions about the accuracy of the other team members' work. Those algorithms should have been validated before even being handed off to the programmers, and then validated *again* as part of integrated testing.

      --
      The ringing of the division bell has begun... -PF
    8. Re:YACCS -Yet Another Computer Corkup in Space by Minwee · · Score: 2, Insightful

      Okay the operators helped by plugging in the wrong units but neither did the software catch the discrepancy in the values.

      "On two occasions I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

      Plus ça change, plus c'est la même chose.

    9. Re:YACCS -Yet Another Computer Corkup in Space by Flavio · · Score: 3, Insightful

      In other disciplines, the engineers ARE math guys. Face it, compared to other engineering types, software engineers and programmers are SLOPPY. This is because engineering has thousands of years worth of spectacular cork-ups with enormous death tolls to look back on, and engineering students are (I'm guessing, IANAE) shown horrific, traffic-safetyesque movies like Blood on the Protractor, Slide Rule Massacre, and London Bridge is Falling Down, Killing Litle Johnny's Entire Family.

      Engineering and applied mathematics are much more demanding than computer programming. Sure, one could argue that "computer science is math too", but my experience is that CS majors don't graduate with a strong math background. And even if they did once know some calculus and linear algebra, they were never required to apply it like an EE or Applied Math person would.

      So while you could find a rigorous programmer or software engineer (and I use the term "software engineer" very loosely, because few individuals actually fit that description), it's often a lot easier to look for an engineer or applied mathematician with good programming skills. Their math and physics is usually significantly stronger, and they actually understand what they're programming.

    10. Re:YACCS -Yet Another Computer Corkup in Space by Rei · · Score: 2, Insightful

      To put the shoe on the other foot, have you ever seen software written by people who aren't programmers? Uck. The code is usually a nightmare. Things like:

      "Well, here we're using the global "qzv" as a loop variable, but over here we'll use it to mean how many widgets we're looking at, and over here, it's our exit condition. Oh, and we'll set it to '5' over here for no discernable reason. Now, here's where we've cut and pasted the code 15 times so that we could change one variable's type (instead of using templates), but naturally, all of the bugfixes we've applied since then haven't all migrated into all of the versions. Ah, here's the core of the code, where we cast structs and function pointers to void pointers, and then pass those around, with a jurry-rigged method to figure out what they actually contain duplicated in every piece of code that uses. If you scroll up in this 23,000 line file, you'll see eighteen pages of commented out code. Scroll down, and you can see the famous Sea of TODO Notes -- the only place in the file in which comment are actually associated with descriptive text. Unfortunately, most of them contain only the word 'Fixme'. Now, on to the diverse species of macros you'll find scattered about, defined and redefined throughout the code..."

      --
      "Now," she thought, watching the dolphins adjust their bowties, "might be a good time to up my medication."
    11. Re:YACCS -Yet Another Computer Corkup in Space by Ancient_Hacker · · Score: 2, Informative
      >Additionally, since the computer "flip" happened instantaneously, and the f-16 can roll at much higher G forces than the pilot can take, the flip would have killed the pilot.

      Well your whole post is called into question due to quite a few questionable items:

      • It seems unlikely that the lattitude would enter at all into any calculation of roll attitude. If so, it's more than a "bug", it's a basic design mistake.
      • The F-16 does have a high roll rate, about 320 degrees per second, but since the pilot is very close to the roll axis, there's very little acceleration at the pilot's position during your basic aileron-roll. Pilots routinely apply maximum roll without dying, or even passing-out.
      • Nobody dies intantly from excess G-s... Fighter pilots overdo it all the time. Usually they let off the stick as they feel the early effects, such as a narrowing or darkening field of vision. If they keep on commanding too many G's, they'll pass out and that will let pressure off the controls, which quickly reduces the G forces. Good fail-safe system.
      • Flipping upside down will quickly send blood to the head, which is exacrtly what's needed to recover from too many positive G's.

  8. Pilot said.... by isieo · · Score: 2, Funny

    Houston, I B.S.O.Ded

  9. Is this a sign? by Billosaur · · Score: 4, Insightful

    Some expert is always trumpeting the fact that "Johnny can't program," to which many of us roll our eyes and go back to coding. But could this be a sign that the quality of the help NASA is hiring is such that these kinds of mistakes are now rampant? I mean, this could have been avoided if the code had been tested out on a full-scale mock-up of the machine, to verify that it did what it was supposed to do, before ever sending the commands to the actual machine. If anything, it's a QA failure.

    --
    GetOuttaMySpace - The Anti-Social Network
    1. Re:Is this a sign? by benevixit · · Score: 5, Insightful

      In all fairness, writing code for a spacecraft is a lot harder than most of our Earthbound coding projects. These are custom-built machines running one-of-a-kind hardware; one can simulate components independently but it's very difficult to figure out how the hardware is going to behave up there in the vacuum. For example, consider the one function of maintaining orientation. Most spacecraft use telescopes that look for star reference points. They look for particular star configurations and use microthrusters or gyroscopes to adjust their orientation. Imagine what it would take to simulate this: a zero-gravity vacuum with a realistic star-field at focus=infinity. Any laboratory mock up is going to cost a lot more than launching a new spacecraft. And that's just one subsystem. Software upgrades at NASA go through a really rigorous quality control regimen, often requiring programmers to justify _individual_lines_ of their code to a review committee. Even then they usually won't patch noncritical bugs until the primary mission is completed. I think your point is a good one. And the key lesson is not that NASA QA sucks, it's that programming for spacecraft is _tough_. I know they are constantly investigating new ways (like more standardization, code re-use, and formal verification procedures) of improving software reliability.

  10. Better than a metric-English conversion error by ccmay · · Score: 3, Insightful
    I guess those things happen. But at least it wasn't an error converting units, like the other Mars spacecraft that was lost. That is just incredibly stupid. Glad I'm not the "engineer" who wasted thousands of man-years and hundreds of millions of taxpayers' dollars because I was too stupid or lazy to convert between meters and feet.

    On a positive note, it has provided me an instructive example for when I help my teenagers with their math homework. If they say it's "almost" correct, I tell them that the guy who screwed up the Mars mission probably said the same thing.

    -ccm

    --
    Too much Law; not enough Order.
    1. Re:Better than a metric-English conversion error by kfg · · Score: 2, Insightful

      If you wish them to grow up to be good little engineers; ask them to define how "almost" correct it is.

      KFG

    2. Re:Better than a metric-English conversion error by iamlucky13 · · Score: 4, Informative

      It wasn't one engineer. It was a team effort. And it wasn't a very simple matter of "forgetting". Several factors combined, including re-use of code from the MGS mission (a conversion factor was in the old code, but not recognized when the code was adapted for the doomed MCO) and budget constraints that limited pre-flight testing (so bug was missed...and in fact might have still been missed even with more testing). The effects of the bug were also subtle enough that 3 minor main engine firings were conducted without enough error showing up to reveal the problem. It wasn't until the long orbital insertion firing that the error in the trajectory became noticeable, and by then it was too late. The team's first clue something was wrong was when the spacecraft didn't radio home after the engine burn.

      The details are really convoluted, but the Wikipedia page on the mission has a decent write up explaining how the mistake was made, with additional resources cited. The PDF paper giving a perspective from the MCO team is particularly revealing, if you've got some time on your hands.

  11. Re:Should have used Gentoo!! by zootm · · Score: 4, Insightful

    No sandbox can avoid the fact that one test was missing.

  12. zing! by steak · · Score: 2, Funny

    that was the sound of me hitting the bullseye.

    [quote]at least if something went wrong some guy at nasa could tell his grand kids that he bricked something from ~140 million miles away.[/quote]

    http://slashdot.org/comments.pl?sid=214508&cid=174 27542

  13. Where's K'Breel? by Amazing+Quantum+Man · · Score: 2, Insightful

    We need his report! Tripmaster Monkey, where are you?

    --
    Fascism starts when the efficiency of the government becomes more important than the rights of the people.
  14. Re:So what if the battery is dead? by smoker2 · · Score: 2, Insightful

    I expect the electronics runs off the battery, and the solar just charges the battery. If the battery's dead, nothing will run.

  15. Luxury! by avronius · · Score: 4, Funny

    We used to live in a vacuum tube. When the computer was running, and your bit was accessed, you almost had enough light to read by. Mother would disconnect the tube when she went to bed, causing floating point errors for almost eight clock-cycles...

    Or at least, that's how I remember it...

  16. Re:"almost" correct by kfg · · Score: 2, Insightful

    So if I botch the balance in my checkbook, the bank will pat me on the head. . .

    Why should the bank even care? I don't even remember the last time I balanced my checkbook.

    "Almost correct" is someone being spineless.

    I just measured the hight of a tree with a meter long chunk of 2x4 and a bubble protractor. I get a figure of 10 meters. How many feet is that? 32.808399 is not the right answer. Using it is likely to result in your shell missing the top of the tree. 30 is the right answer. Why?

    Neither you nor you wife is correct, or incorrect either. Define what "correct" means and define the degree of incorrectness and precisely why it is incorrect.

    Arithmatic is exact, the things you use it to model often are not. Modeling states and calculation of figures are two seperate acts and skills. They both need to be taught and understood.

    Telling me that I'm stooopid is a personal attack; telling me my calculation is incorrect is a statement of fact. Folks need to learn that the latter statement isn't necessarily a bad thing.

    Here I am with you 100%.

    KFG

  17. MGS was currently a low priority for NASA by jespley · · Score: 2, Interesting

    I'm a scientist that works with the MGS data so I don't know the engineering side well. However, I do know that last year NASA was strongly considering dropping all support for MGS in order to spend the limited Mars program money on newer missions (the idea being that we had gotten 90% of the useful science from MGS). Instead they decided to keep MGS funded with a bare minimum of money and hence a bare minimum number of personnel. I imagine that the poor overworked engineers running the operational show at JPL just didn't have the time to doublecheck everything as they would in an ideal world. As their end user, I'm just grateful for all the work they did over the years to keep the thing running.

  18. Nope. by Anonymous Coward · · Score: 2, Informative

    Additionally, since the computer "flip" happened instantaneously, and the f-16 can roll at much higher G forces than the pilot can take, the flip would have killed the pilot

    A single, half-roll to inverted in the Falcon wouldn't have exerted enough Gs on the pilot to do anything worse than to exclaim WTF!, and disengage the a/p. A roll in and of itself in an aircraft doesn't really induce much Gs.... a "bank-and-yank" turn does, and that's what the F16 can do at higher Gs than the pilot can take... not the roll.

  19. Re:We need Computer Engineering, not Scientists. by alienmole · · Score: 2, Insightful

    We're never going to improve as long as people insist on comparing software development to building bridges, i.e. a more sophisticated understanding of the problem is needed. In software, once you have a program for a bridge you can make a billion bridges, all alike or customized by certain parameters, just by running the program. So being "able to build the same damn bridge 100 times" doesn't get you anywhere. Making it better and safer each time? That's another story, and once again, the comparison to bridge building doesn't hold up, because you're talking about improving the design, not the building practices or materials.

    If there was any merit in this canard, don't you think that before now, you'd have had some engineers who also knew software come along and revolutionize the software industry?

    Standardization is how you get rid of most errors. You'll notice that nobody is making new bolts or nails anymore, they're all standardized.

    You haven't written a line of code in your life, have you? If you have, tell me what level of standardization you're even talking about, in the software context.

  20. O_O by Vacardo · · Score: 2, Funny

    Well, that's that tops my list on "Worst Times to Get the Blue Screen of Death".