Slashdot Mirror


Debug your Code, or Else!

Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "

25 of 485 comments (clear)

  1. but dont forget by rosewood · · Score: 5, Funny

    Remember that time when that kid dialed into NORAD and used that security exploit to get into the Thermo-nuclear war simulator and everyone thought it was real until he and the inventor were able to trick the computer into playing Tic-Tac-Toe? I see a LOT of bugs in the software there but no one ever seems to care about that...

  2. Missing From The List by BiggestPOS · · Score: 5, Funny
    1) 1999 - Buffer Overflow causes Half-Life to crash while I'm in an important clan match (counter-strike) we lose the match, and I lose many friends.

    2) 2000 - Poorly coded garbage collection causes Word 97 to crash, lose last 2 hours of research paper. Class was in 30 minutes, paper was late. I lost my scholarship.

    3) 2002 - IE Crashes while writing AWESOME first post for /., My karma never recovered.

    --
    What, me worry?
    1. Re:Missing From The List by Novus · · Score: 5, Funny

      shock moment: Word has garbage collection!?

      Yes. It collects megabytes of garbage in files with the extension ".DOC".

    2. Re:Missing From The List by GafTheHorseInTears · · Score: 5, Funny

      4) 2002 - Windows Media Player freezes up while I'm whacking it to porn. Unfortunately, it freezes on one of those annoying shots where they cut away to the dude's face, and I'm too close to the finish line to be able to stop. Afterwards, I feel embarassed and uncomfortable, yet strangely aroused.

      --
      "You're just scared like a little white pussy. I'll fuck you till you love me, you faggot!"
  3. speaks more to TESTING by teambpsi · · Score: 5, Insightful

    It really amazing how many software project managers that don't fully understand what regression testing is all about.

    Software engineers simply cannot be trusted to do more than small unit level testing! We get into a pattern of behavior, we know what to expect, and simply do not stress test the system.

    Thats why I like hiring sales people and 2-year olds to test my code at the unit/integration level.

    --

    Old age and treachery almost always overcome youth and skill.
    1. Re:speaks more to TESTING by billnapier · · Score: 5, Funny

      Thats why I like hiring sales people and 2-year olds to test my code at the unit/integration level

      You didn't need to repeat yourself

    2. Re:speaks more to TESTING by slamb · · Score: 5, Informative

      What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.

      Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.

      PostgreSQL has regression tests.

      There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.

      The Boost code has automated testing.

      There's a project called qmtest.

      The Wine people have recently started using regression tests.

    3. Re:speaks more to TESTING by qslack · · Score: 5, Funny

      What do you have against 2-year-olds!? That was simply uncalled for.

  4. another bug page by blooher · · Score: 5, Insightful

    Software Horror Stories linked from the post's link

  5. Hi-tech toilet swallows woman by DeadSea · · Score: 5, Funny
    Of all of them this is my favorite. It doesn't say if it was a software bug or not though.
    [Source: Article by Lester Haines, 17 Apr 2001, via Brian Randell]
    A 51-year-old woman was subjected to a harrowing two-hour ordeal [on 16 Apr 2001] when she was imprisoned in a hi-tech public convenience. Maureen Shotton, from Whitley Bay, was captured by the maverick cyberloo during a shopping trip to Newcastle-upon-Tyne. The toilet, which boasts state-of-the-art electronic auto-flush and door sensors, steadfastly refused to release Maureen, and further resisted attempts by passers-by to force the door. Maureen was finally liberated when the fire brigade ripped the roof off the cantankerous crapper. Maureen's terrifying experience confirms that it is a short step from belligerent bogs to Terminator-style cyborgs hunting down and exterminating mankind.
  6. Pentium bug in perspective by Alomex · · Score: 5, Informative
    Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters".


    The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.

  7. One that we did - killing long distance nighty by mesocyclone · · Score: 5, Interesting
    Back in 1973 we built a system for hotel reservations that had over 1000 mini-computers distributed in hotels all over the US. These computers periodically dialed an 800 number in to get outstanding messages (it was cheaper for them to dial in than for us to dial out to them).


    I wrote the algorithm that scheduled the dialins. It used a pseudo-random approach during the day, weighted by outstanding traffic.


    But at night, there was period during which we had to unload all messages before the next day's processing. During this time, the pseudo-random algorithm was replaced by a deterministic one that assigned computers time slots.


    The computers also had auto-rety in the case of failure, so each call could result in several if it were blocked.


    Unfortunately, during coding I had put in the number of modems answering phones at 20 (as an arbitrary number for testing). During the hectic rollout, this never was changed to the actual number which was much smaller.


    Once the system came on line, every night at 1AM portions of Omaha (which included lots of call centers) would lose all long distance service for a couple of hours, as all these computers called in and retried several times.


    Eventually the phone company figured it out and contacted us, and we discovered and corrected the discrepancy.


    Another issue was that we had a number of hotels that were using pulse dialing (this was a long time ago in a galaxy far far away). Sometimes these would be off by one due to the inherent unreliability of pulse dialing, and the result was a lot of calls to certain numbers related to the 800 number, all in the middle of the night.


    BTW... as far as I know, this was the first large widely distributed commercial computing system to use switched telephone circuits for communications (but no doubt some other grey-haired slashdotter knows of another).

    --

    The only good weather is bad weather.

  8. My prof at Georgia Tech stressed this a lot by delphin42 · · Score: 5, Informative

    He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:

    In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.

    He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."

    --
    -- Adam
  9. Read comp.risks by kzinti · · Score: 5, Informative

    Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.

    --Jim

  10. It's Worse: The Patriot Never Worked by GuyMannDude · · Score: 5, Informative
    The Patriot missle defense system never worked -- the bug mentioned in the article is a red herring. The main problem was that the Iraqis had modified the scud with additional fuel tanks. The resulting missle was unstable and would start to break apart in flight. The Patriot couldn't lock on to the missle because it of all the schrapnel. In addition, the scuds are poor missles to begin with. When they fly, they do so with a wobble -- like a poorly thrown football. The Patriots had been tested prior to the war on good-quality American missles which flew in a smooth trajectory. The Patriots simply couldn't deal with a missle that "danced around" in midflight. Bottom line: the Patriots simply do not protect against scuds because of poor design -- not some floating point error. The floating point explanation is analogus to that Coriolis-effect-causes-water-to-swirl-in-the-toile t myth that you find in so many physics textbooks (the Coriolis effect only works on planetary scales). It looks good on paper but if the "experts" had bothered to perform a test they would see that the explanation is dead wrong. The failure of the Patriots to intercept scuds (and the fact that the media never mentions this) has grave implications for our anti ballistic missle shield.

    Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:

    http://www.fas.org/spp/starwars/docops/rp911024.ht m

    http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html

    GMD

  11. The buggs that didn't happen by MountainLogic · · Score: 5, Funny

    I'm sure we all have those bugs that we catch in bench testing. Mine was forgeting to add a cancel button to the following dialog box:

    "OK to delete database"

    When I caught that one I had visions of a user who had his/her million dollar database deleted charging into our office with a shotgun and ... well, you read the papers. Glad I caught that one before I released it to test.

  12. Software bugs...NOT! by T.E.D. · · Score: 5, Insightful
    I'd call it a bad sign when the first two entries on a page that proports to show famous software bugs are not, in fact, software bugs.

    The bug that caused Airane explosion was a requirements analysis bug. The Pentium FP bug was a hardware bug.

    A quick skim of the rest nets me at least 6 more non-software software bugs
    • 4. Mars Climate Orbiters, Loss (Mixture of pounds and kilograms, 1999) - Specification bug
    • 27. Distributed denial-of-service attacks - Malicious people
    • 31. Florida Voting Chaos - not a damn thing to do with computers
    • 34. Wall Street Crash, October 1987 (Acceleration of the crash) - computers did precisely what their users wanted them to do
    • 42. Great Concert Disasters - WTF?!
    • 43. Tacoma Bridge (not a computer bug)(collapse, 1940) - he said so himself

    After seeing that, I can't really trust the list on things I don't have a good knowledge about.

    Here's a challenge for someone: Go through the list and find out how many (if any) of the listed software bugs are actually software bugs.
  13. Re:Patriot Scud Time Error by Kintanon · · Score: 5, Informative

    That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.

    Kintanon

    --
    Check out JoshJitsu.info for Brazilian Ji
  14. We have a difficult battle ahead ... by jc42 · · Score: 5, Informative

    Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.

    Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.

    This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.

    The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.

    When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.

    I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.

    --
    Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  15. Always works right on my system by nomadicGeek · · Score: 5, Funny

    My software always works perfectly on my system. Zero bugs.

    I have no idea what the hell the users do to it to screw it up.

  16. Re:Happy to hear it... by jc42 · · Score: 5, Interesting

    There is one highly relevant difference between the way that we deal with hardware and software. With hardware, inner details, schematics, and the like are usually easily available. Often this is required by law in any critical applications.

    With software, most programmers are writing code to run on systems (kernels, runtime libraries, and the like) that are usually proprietary. The inner details are not just neglected; the companies intentionally keep them secret and prosecute people who leak them.

    As a result, software can't be made reliable, not even in principle.

    We do have a few exceptions, e.g. linux and all the GNU stuff. If *everything* underneath your code is Open Source, then in principle you can examine it and find problems. (It ain't easy, but at least it's doable if your employer will permit the time that it takes).

    But we're facing a major battle just getting Open Source software accepted by a tiny part of the market. In most jobs, you are required to write code for systems whose inner working you are not permitted to know.

    The US government is even using proprietary, binary-only computer systems in secure and mission-critical situations. Anyone who expects the code in such situations to be reliable is either utterly ignorant or actively malicious.

    Myself; I'd welcome rules that make me and other software developers responsible for bugs in our code. If there were such a legal requirement, I could point to it when someone denies me access to the information that I need, and say "I can't possibly write correct code when you are keeping vital information from me. Show me the inner details of these parts of the system, and I'll agree to write reliable code for it."

    Of course, in a couple of cases, when I've gotten my hands on such details, I've proceeded to write a proof that certain things could not be done reliably on that system. "Fix that bug in that library, and I'll vouch for my code. Until then, here's my bug report describing exactly how it will fail."

    Unfortunately, when I've done this, the usual result was that I was looking for another job soon thereafter.

    (One such lost job was when I proved that certain sensors in a nuclear power plant could not be made to work reliably due to their software. But that was 20 years ago; maybe they've fixed it by now. ;-)

    --
    Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  17. Re:Millennium Bridge - Kansas City skywalk by victim · · Score: 5, Interesting

    Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion. more info

    Armys break cadence when marching across bridges, at least as far back as Napoleon's time. Presumably they learned that the hard way.

    On a more personal note, I have participated in the unintentional destruction of a gymnasium. 80 or so people crowded together in the middle, bouncing up and down, and then "down and down". We fractured the engineered wooden joists. Fortunately it failed gracefully. Just sagged down about 4 feet in the middle.

    What I'm trying to say, not particularly directly, is "don't give the designers of the bridge a pass because this new phenomenon struct their bridge". Chastise them for risking people's lives and wasting resources by neglecting the loads placed on bridges.

  18. CUI by Ozan · · Score: 5, Insightful

    I think most of the bugs in software are the result of "Coding Under Influence". Wether it is a strict time-limit, ambiguous specifications, no sleep or other disturbances, it leads to blatant dumb assumptions or similar faults. Everyone knows that driving under influence is dangerous and can lead to accidents. Why do "software architects" think this is different when someone writes important programs?
    I think part of the problem is that writing software is a rather new handwork in comparison to e.g. metalworking. Programmers don't have a union, often they work under poorer confitions than workers at conveyor belts if you consider the higher responsibility they have.

  19. Re:Millennium Bridge - Kansas City skywalk by igrek · · Score: 5, Funny

    In the old USSR (Stalin times), there was a standard bridge acceptance test:
    1) put project managers, lead architects and engineers under the bridge;
    2) put heavy loaded trucks on the bridge.

    That was real extreme testing.

  20. Re:It's Worse: The Patriot Never Worked by 5KVGhost · · Score: 5, Insightful
    The failure of the Patriots to intercept scuds (and the fact that the media never mentions this) has grave implications for our anti ballistic missle shield.


    I'm pretty sure the media has mentioned this, beyond those two media links you already posted, I mean. The issue has been debated since the first Patriot experiences during the Gulf War.

    But I don't really see how this has "grave implications" for an anti-ballistic missile shield. The effectiveness of the Patriot missile used during the Gulf War era is in doubt, but a that does nothing to invalidate the general concept of destroying a ballistic missile with another interceptor missile. It certainly isn't easy to do, and there may be better ways to accomplish the same goal or things more worthy of our limited resources, but to claim that it's somehow physically impossible is both disingenuous and incorrect.