Slashdot Mirror


Debug your Code, or Else!

Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "

23 of 485 comments (clear)

  1. Pentium bug in perspective by Alomex · · Score: 5, Informative
    Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters".


    The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.

    1. Re:Pentium bug in perspective by jrstewart · · Score: 3, Informative

      Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters.

      What is exceptional is that instead of just announcing a new erratum (which is what Intel and most cpu makers normally do in such a case), Intel tried to bury the problem, initially denying that it existed and then denying that anyone would ever run into the problem. This really pissed off the numerical computing community and destroyed confidence in the accuracy of intel's floating point unit. That's why it was a public relations fiasco.

      see:

  2. My prof at Georgia Tech stressed this a lot by delphin42 · · Score: 5, Informative

    He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:

    In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.

    He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."

    --
    -- Adam
  3. Read comp.risks by kzinti · · Score: 5, Informative

    Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.

    --Jim

  4. It's Worse: The Patriot Never Worked by GuyMannDude · · Score: 5, Informative
    The Patriot missle defense system never worked -- the bug mentioned in the article is a red herring. The main problem was that the Iraqis had modified the scud with additional fuel tanks. The resulting missle was unstable and would start to break apart in flight. The Patriot couldn't lock on to the missle because it of all the schrapnel. In addition, the scuds are poor missles to begin with. When they fly, they do so with a wobble -- like a poorly thrown football. The Patriots had been tested prior to the war on good-quality American missles which flew in a smooth trajectory. The Patriots simply couldn't deal with a missle that "danced around" in midflight. Bottom line: the Patriots simply do not protect against scuds because of poor design -- not some floating point error. The floating point explanation is analogus to that Coriolis-effect-causes-water-to-swirl-in-the-toile t myth that you find in so many physics textbooks (the Coriolis effect only works on planetary scales). It looks good on paper but if the "experts" had bothered to perform a test they would see that the explanation is dead wrong. The failure of the Patriots to intercept scuds (and the fact that the media never mentions this) has grave implications for our anti ballistic missle shield.

    Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:

    http://www.fas.org/spp/starwars/docops/rp911024.ht m

    http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html

    GMD

  5. Re:Much is very iffy to beaf up list by jgerman · · Score: 3, Informative

    Go check out the Risks Forum, links are available from the ACM webpage. There is plenty of proof and explanation for hundreds of software related mishaps. You're obviously looking in the wrong places.

    --
    I'm the big fish in the big pond bitch.
  6. Re:Patriot Scud Time Error by Kintanon · · Score: 5, Informative

    That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.

    Kintanon

    --
    Check out JoshJitsu.info for Brazilian Ji
  7. Re:Much is very iffy to beaf up list by blamanj · · Score: 3, Informative

    Blatant karma whoring...

    The risks forum is available as a moderated newsgroup, or you can subscribe to the e-mail version. See the Risks info page.

  8. We have a difficult battle ahead ... by jc42 · · Score: 5, Informative

    Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.

    Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.

    This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.

    The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.

    When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.

    I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.

    --
    Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    1. Re:We have a difficult battle ahead ... by T.E.D. · · Score: 3, Informative
      I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.


      That's precisely why people developing safety-critical apps should be (and quite often are) using Ada, rather than Fortran or C. Not only does the languge put in all the checks you mention (and more), but the "software culture" among Ada programmers is significantly better where bugs and safety are concerned.

      Take a look at Praxis' SPARK for a look at how responsible people develop safety-critical software. The approach takes more effort than the typical "hack something together then bash it into shape with the debugger" approach. But in many cases, it is well worth the cost.
  9. Re:The Ariane blowup was especially amusing by T.E.D. · · Score: 4, Informative
    Then to make it funnier, turns out the system engineers had decided that since software is infallible, any exception condition would indicate a hardware failure(!), so instead of a reset they shut the affected computer down altogether.


    Not quite. The software was built for the Arianne 4. On the Arianne 4, it was physically impossible for that value to ever get high enough to overflow. So on the Arianne 4 the assumption that an overflow could only be due to a hardware failure was entirely correct.
    If they had known that years later an Arianne 5 would come along, and those engineers would stupidly reuse the Arianne 4 code without testing it once, then perhaps they would have made a different decision. But I think the blame goes on entirely on the Arianne 5 guys, who were *not* the ones who wrote that code.
  10. Coupla Notes by StormyMonday · · Score: 4, Informative
    1. The Patriot time-drift was caused by the system being operated outside of its dsign parameters. It was designed to operate during a Soviet invasion of Western Europe, and expected to have to relocate every 8 hours or so. The spec, therefore, assumed that the software would reboot every 8-12 hours. From my experience with the military, if a programmer had put in a clock algorithm that would track indefinitely, he or she would have been ordered to take it out. (Been there. Done that. Broke the coffee mug.)
    2. The Yorktown crash was the result of mixing mission-critical and non-mission-critical programs on the same box. Big no-no.

    So we have a specification problem and a system design problem. Neither is a pure "programming problem".

    Software crashes are like airplane crashes -- blame the lowest guy on the totem pole. In air crashes, it's the pilot. In software, it's a coder.

    --
    Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
  11. Re:speaks more to TESTING by dgb2n · · Score: 3, Informative

    Testing is critical.

    Others would argue that testing alone may not suffice. Particularly for these kinds of mission critical applications, nothing short of formal methods of software engineering will suffice. Formal as opposed to natural language specifications can reduce ambiguity. Safety conditions can then be derived and verified through rigourous mathematical proofs.

    Of course none of this obviates the need for testing but it can lead to a more predictable system.

  12. USS Vincennes Incident was NOT software related by kylef · · Score: 3, Informative

    There were many things that went wrong during the incident, but one of the FEW things that worked correctly was the AEGIS weapons system on board the guided missile cruiser. The error lay in the crew's mistaking the range information reported on the radar screen with altitude information. As a result, the CO thought that the incoming contact was flying straight towards his ship and decreasing in altitude (preparing to attack).

    Blaming a "cryptic display" is hardly a software bug if anyone is familiar with radar screens. That's why we train people to read them!

  13. more here by 3-State+Bit · · Score: 3, Informative
  14. OT: The Christian Science Monitor by Squirrel+Killer · · Score: 2, Informative
    Note: I am NOT a regular reader of the Christian Science Monitor.
    That's too bad, you should be. The CSM is highly regarded non-partisan, non-denominational, very independent paper. It is one of the few sources of quality international news in the US (aside from the internet.) While I won't go so far as to say that it is completely unbiased, it certainly is one of the least biased news sources I know of, and their coverage is usually well-balanced. For more info about the paper, check their About the Monitor page. If nothing else, the page is indicative of how independent of the church is the paper.

    -sk

  15. Re:Millennium Bridge - Kansas City skywalk by lingsb · · Score: 2, Informative

    Your assumption about the nature of pedestrian motion that caused the bridge wobble is incorrect:

    They did take into account pedestrian movement on the bridge; they didnt take into account pedestrian motion on the bridge locking in to the motion of the bridge:

    1) Pedestrians walk on bridge
    2) Bridge wobbles slightly
    3) Pedestrians adjust their walking to be in phase with bridge
    4) Bridge wobbles more

    This was a new phenomenon, due to the lightness of the construction of the bridge. It is now fixed, by the addition of dampers.

    --

    -BB

  16. Re:Millennium Bridge - Kansas City skywalk by Captain+Nitpick · · Score: 2, Informative
    Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion.

    The Hyatt's skywalk collapsed soley because of the change in design. The design change caused the walkway to fail to meet building code. Some civil engineers who studied the disaster were surprised it could support its own weight, much less the weight of the pedestrians.

    Quoting from a Kansas City Star article.

    The National Bureau of Standards concluded failure was just a matter of time. "The walkways," its probe found, "had only minimal capacity to resist their own weight."

    The dancing people were by and large on the floor below the skywalk, participating in a dance contest.

    The mistake that caused the Hyatt disaster was not one of failing to consider human motion in the design, but failing to consider the effects of seemingly minor changes in design.

    --
    But then again, I could be wrong.
  17. Re:32. Therac-25, X-ray by irix · · Score: 4, Informative
    The Therac-25 was an automated x-ray machine that overdosed patients. Fatally.

    Well, not exactly. It was used for cancer treatments, not x-ray imaging. And not all of the radiation overdoses were fatal.

    It was a UI bug rather than a software bug.

    Again, not exactly. The problems with the Therac-25 included hardware issues and some UI problems that lead operators to do some interesting things. They also included some race conditions that were definately software bugs.

    You can check out a reprint of an IEEE article discussing it in depth here.

    Just for some history: AECL, the Canadian government crown corporation who made the Therac-25, spun off its medical operations into private companies in the 1980s. The first was Nordion, where I worked for a summer as a co-op student, produces radioisotopes for medical use. Nordion was bought my MDS. The other company was Theratronics, which was responsable for devices like the Therac-25. It went without a purchaser for many years becuase of the stigma of Therac-25, but it was eventually (IIRC) bought my MDS as well.

    Both companies are in my hometown, and the fallout from the Therac-25 (like the IEEE article) was front-page news when I worked at Nordion in the early 1990s. I just worked on sofware to measure how much of a given isotope to dispense to fill an order, but the whole Therac-25 incident was definately on everyone's mind.

    --

    Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
  18. Re:F15 equator bug by Nehemiah+S. · · Score: 2, Informative

    The prototype F-22 was also lost due to a sign error in the code which controlled the thrust-vectoring nozzles during landing. Technically it was chalked up to pilot error, since he was supposed to lock the nozzles down before beginning the landing procedure, but it is something that should have been considered in the code.

    Frm the unclassified accident report:

    "At the time of the crash, Morgenfeld had been carrying out a planned go-around, and he had just switched on his afterburners and had retracted his undercarriage at less than 50 feet off the runway with thrust vectoring active. At a speed of 175 knots, the aircraft began an uncommanded pitchup followed by a severe stick-forward command from the pilot. The aircraft then entered a series of pitch oscillations, with rapid tail and thrust nozzle fluctuations, exacerbated by control surface actuators hitting rate limiters causing commands to get out of synchronization with their execution.

    An investigation later showed that Morgenfeld had ignored a test-card that required that the vectoring nozzles to be locked into position in just such a configuration that he had found himself at the time of the crash. However, most engineers had also ignored this instruction since they thought it to be unnecessary. At the time of the accident, the aircraft had made some 760 flights and had logged 100.4 hours in the air."

    neh
    aero geek :)

    --
    ... and there is no doubt, that one day he will be
    where the eye of his telescope has already been
  19. Re:speaks more to TESTING by slamb · · Score: 5, Informative

    What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.

    Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.

    PostgreSQL has regression tests.

    There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.

    The Boost code has automated testing.

    There's a project called qmtest.

    The Wine people have recently started using regression tests.

  20. Re:One that we did - killing long distance nighty by mesocyclone · · Score: 3, Informative

    No, it was a we. Someone else knew about the number of lines. They didn't give me the number.

    --

    The only good weather is bad weather.

  21. Re:MOD THIS UP by halflinger_n · · Score: 2, Informative
    There is also

    http://sunnyday.mit.edu/therac-25.html

    Which includes links to the author's other papers and publications.