Slashdot Mirror


Examples of Programming Gone Wrong?

LightForce3 asks: "I'm a beginning CS student, and in my studies I've come across examples of programmer error causing very large problems, such as the Ariane 5 failure and the Therac-25 accidents, often as tales of caution to beginner programmers such as myself. My (morbid?) curiosity has been piqued, and I'm looking for other examples of programmer error leading to serious problems. After all, it is better to learn from the mistakes of others than from your own, right? ;) What programming-related accidents, incidents, and failures, both well-known and obscure, do Slashdot readers know about, and are there any good resources for researching these?"

42 of 626 comments (clear)

  1. already.. by Suppafly · · Score: 1, Informative

    this is already any ask slashdot from a while back.. check the archives.

    1. Re:already.. by gmajor · · Score: 5, Informative

      http://slashdot.org/articles/02/05/02/1525210.shtm l?tid=128 - Debug your code, or else!

      Using google's serach engine provides better results for slashdot.org that slashdot's own search engine :-)

  2. The book "Fatal Defect" by spanky555 · · Score: 5, Informative

    This book is devoted to just that. It's what you're looking for...go get it and read it.

  3. Mars Orbiter Lost Over Metric Conversion by Kircle · · Score: 4, Informative

    http://slashdot.org/articles/99/09/30/1437217.shtm l

    --

    -- Kircle

  4. Re:Challenger by agentZ · · Score: 5, Informative

    What happened to Challenger wasn't a programming mistake, but rather a case of not following policy. The solid rocket boosters were never designed to operate in cold temperatures. The result of working outside of design specs was catastrophic failure, yes, but that wasn't the result of a programming error.

  5. How about the AT&T Switch failure in NY? by Bolen · · Score: 5, Informative

    A Central Office (CO) switch is basically a mainframe-class computer programed in assembler. A few years back, a newly-installed switch failed due to a bug in the code, causing a cascading failure of the phone system for a few hours.

    1. Re:How about the AT&T Switch failure in NY? by doc_side · · Score: 2, Informative
  6. Re:Challenger by Pyromage · · Score: 5, Informative

    Incorrect: This was not a programming issue. Nor was it a software issue at all. The problem was the O-ring seals in the SRBs (Solid Rocket Boosters). The manufacturer stated that they should not be operated under 53 degrees, and NASA overrode the recomendation and launched anyway. The expected happened.

    NASA hasn't ever had a hardware problem. Or a software problem. Ever. Every problem can be directly tied to one specific person being a fscking moron. The closest you could come is that Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.

  7. RISKS Digest by BinBoy · · Score: 4, Informative

    The RISKS Digest is a mailing list and usenet newsgroup that describes all kinds of situations where technology has gone wrong. Many of the stories involve programming errors.

    Google's RISKs Archive

  8. Mars Orbiter Lost Over Metric Conversion (link) by Kircle · · Score: 2, Informative
    --

    -- Kircle

  9. Failures by Jordan+Graf · · Score: 4, Informative

    MIT runs a class called 6.033: Computer Systems Engineering. These lecture notes contain a list of projects that had great sums of money spent on them only to be abandoned. Also the reading list has a bunch of papers that discuss the "big splash" failures like Therac 25.

  10. Pretty sure this was posted earlier on slashdot by Utopia · · Score: 2, Informative

    but couldn't find it.

    Anyway, here are a couple of links.
    Software horror stories
    More horrors

  11. A common but fatal problem (most of the time) by vga_init · · Score: 2, Informative
    I can remember that when I firsted started learning how to program, a common mistake would be that my programs would get stuck in infinite loops. @_@ This is not so big of a deal with a sophisticated, multitaskig operating system, but back when I was using DOS and had no way to break the code, I had to reboot!

    I think it would be interesting to know when the first infinite loop occured in the early days of programming, and how the programers dealt with it. Obviously, back then they only had single-tasking machines.

    Let's say you turned in some bad FORTRAN code to the university computer on a time share. What if nobody noticed for hours that your program was taking up all the processing time? That would make some people pretty pissed. :p

  12. Re:One Word by Helter · · Score: 5, Informative

    Come on now, that's the lazy way!

    How about citing an actual example of windows code bugs causing big problems? I'll go first. The USS Yorktown had to be towed back to harbor when the NT system that was automating most of the ship crashed.

  13. Re:That was an easy setup by Transcendent · · Score: 3, Informative

    that was not an error in the programming... some dumbass gave all the calculation in English units for acceleration to the programmer who writes his program using SI for units (or metric... same thing...).

  14. Re:the harrr-rrrrror by s20451 · · Score: 5, Informative

    US shooting down Airbus 320

    You're referring to the destruction of Iran Air flight 655 by the USS Vincennes near the Strait of Hormuz, on July 4, 1988. For one thing, it was an Airbus A300 (bigger and older than an A320). The failure there was mostly in human decision making, not in the AEGIS radar system, which faithfully reported that the airliner was travelling at 450 knots on a steady bearing towards Vincennes, roughly four miles outside the commercial air corridor, and not broadcasting IFF information (which of course they wouldn't, as a foreign civilian airliner). It was the officers of Vincennes who interpreted this information as a threat, misidentified the target as an Iranian F14, and destroyed it.

    --
    Toronto-area transit rider? Rate your ride.
  15. Re:That's kind of silly by pongo000 · · Score: 4, Informative

    Wouldn't setting it to something like 0 be better?

    In most areas of the world (unless you're flying over the Dead Sea, or Death Valley, or New Orleans), if your altimeter reads 0, you're probably already dead. Altimeters used for navigation read MSL (height above mean sea level), not AGL (height above ground). There are radar altimeters that read in AGL, but these are used for close-to-ground maneuvers like landing.

  16. not true by PissedOffGuy · · Score: 4, Informative

    the database they were using faulted on a divide by zero. nothing to do with NT.

    1. Re:not true by PissedOffGuy · · Score: 5, Informative

      what are you talking about? the navy inquiry found no fault in NT. here, you try this: write a program that divides by zero and run it on NT. as with any other good OS, the program shuts down and the OS keeps going. user mode code cannot cause a blue screen, makes sense.

      in the navy's case the crashed program was enough to call the computers "down", and that makes sense too. the only thing that doesnt make sense is the attribution of blame to the OS for an app problem.

    2. Re:not true by xcham · · Score: 2, Informative

      Just an update/correction, As recorded by Zappadoodle they finally DID fix it. Still, you can see from this (and the inordinate amount of time it took for them to address this bug) that Microsoft's main focus is obviously somewhere other than stability and security. Penguin POWER. :D

      --
      When life gives you lemons, you CLONE those lemons, and make SUPER-LEMONS. -- Dr. Cinnamon Scudworth, Ph.D
    3. Re:not true by len_harms · · Score: 2, Informative

      If MS is like any other company they have a list. That list gets a priorty. Currently MS has stated its top priority is security fixes. Buffer overruns and the like. Yet a bug like that would have to be coded for. So dont do that. For example I submitted a bug years ago about being able to resize download windows. Basicly a minor cosmetic issue. Yet it took 3 versions before it was even fixed. The real question is does MS know the bug is there. Or do trade rags know about it. Being in a magazine or on the web does not mean that MS automaticly knows about it. They do not have people whos job it is to go around finding bugs out on the web that people have produced but never bothered telling them about. Most of the time bugs are only addressed by MS if you have TOP level support. IE you paid them big bucks per year per incident to get it fixed. Otherwise you get the 'we might get around to fixing it in about 20 versions'. No money to fix no fix...

      Rework costs money. MS has take the venue of if you pay for it, it will get fixed now. Otherwise wait in line with everyone else. They probably have THOUSANDS of bugs. Each one with a customer yelling about it. They fix the ones that most people are yelling about first. Then next the next one and so on.

      Remember NT is a system of exes. That you put one exe with another one and it acts like this is not surprising. I bet there is no one person in this world who could tell you how the whole thing acts in every condition.

      Probably the best quote I ever heard was out of a MS book on programming. Its like a bowl of jello that is shaking. When that bowl is quivering least we ship.

      My other favorite quote is from a programmer I work with at my company. If your walking off into memory with stray pointers all bets are off.

  17. I can't recomend comp.risks too highly by Camel+Racer · · Score: 3, Informative

    I can't recomend the risks site too highly. (redundent I know)
    Risks To The Public In Computers And Related Systems
    http://catless.ncl.ac.uk/Risks

    On how to be 0wned by other people: Counterpane: Crypto-Gram . Shares with comp.risks the reframe of "I can't belive people don't learn from this"
    Counterpane: Crypto-Gram
    http://www.counterpane.com/crypto-gra m.html

    Don Norman's _The Design of Everyday Things_ and website also offer insight on how to avoid UI failures relating to failures.
    http://www.jnd.org/index.html

    Also, get a copy of _Code Complete_ and/or _Code Write_ by Steve McConnell [pub: Microsoft Press Which is rich irony) Lots of mistakes and how to avoid them.
    The cautionary note might be that most of these failures are human related at some level. Whether it be at the project level, or the UI level -- there are lots of ways to cause a failure.

    Finally, avoid any kind of carreer in Software QA. There is no better way to just get kicked around at the expense of the people putting the bugs in the software in the first place.

    --
    Anybody can work under ideal circumstances. -- Jeff K. (January 4, 2001)
  18. NASA software bugs by dstone · · Score: 3, Informative

    Someone here was claiming that NASA has never had a software bug. That sounded pretty unbelievable to me. And sure enough, it's not true. In the recent Mars missions alone, they had a bunch of software bugs resulting in things varying from non-fatal vehicle failures to outright loss of spacecraft.

    Regarding the loss of the Mars Climate Orbiter spacecraft, from nasa.gov: "The 'root cause' of the loss of the spacecraft was the failed translation of English units into metric units in a segment of ground-based, navigation-related mission software"

    Also, here are several "software bugs" (their words) relating to the Mars Surveyor Lander Vehicle are described. These bugs were detected and fixed in the field (ie, Mars). At least one of the bugs caused a heater failure in the vehicle on Mars. This failure was recovered from.

    Anyways, those are just two quickies, but NASA has their share of bugs. (And generally some pretty ingenious ways to reprogram and update vehicle software post-launch.)

    On a related note, here's a paper from NASA entitled "The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software".

  19. How is an app the fault of NT? by deadsquid · · Score: 5, Informative

    Much as I dislike NT, especially in critical environments, this problem had nothing to do with NT. It had everything to do with bad coding.

    As we all know, information systems are only as smart as people make them. In the case of the USS Yorktown, an admin/operator entered data which caused a divide by zero condition in the application. Because the application did not have any exception handling built into it for a divide by zero condition, it died.

    You can't blame the OS for this. The application should have had exception handling built into it in a couple of places. It probably should have checked any new entries before comitting them to ensure the new data would not introduce such a condition, and the app itself should have had appropriate error handling to prevent a panic/dump when a divide by zero condition was encountered.

    If the app was coded by the same people on another platform, the end result would have been the same.

    --
    Idiot, n. A member of a large and powerful tribe whose influence in human affairs has always been dominant
    1. Re:How is an app the fault of NT? by rakslice · · Score: 3, Informative

      Okay, but the only specific failure cited in the article has nothing to do with NT.

  20. It was a bad break in C code by hayne · · Score: 5, Informative

    Actually, the switching code was in C and the crash was due to a programmer's apparent misunderstanding of the 'break' statement. See full details at: http://www.csc.calpoly.edu/~jdalbey/SWE/Papers/att _collapse.html

  21. Good site by dumboy · · Score: 4, Informative
    Check out this site

    http://wwwzenger.informatik.tu-muenchen.de/perso ns/huckle/bugse.html

  22. Insidious bug from the wayback machine by rufusdufus · · Score: 3, Informative

    Back when C++ was new, there was an insidious problem with the syntax that never showed up during compilation.

    if(c=='\') //check for \
    slashfound=1; //found one, handle path
    ++index;

    Code similar to this delayed shipment of a commercial product because it caused serious instability.

    1. Re:Insidious bug from the wayback machine by rufusdufus · · Score: 4, Informative

      Perhaps I should point out the bug: the comment "//check for \" ends with a pre-processor line-entesion character (\), which effective appends the next line onto the current line, thus the code "slashfound=1" is effectively commented out and the next statment (++index) only executes if c=='\'

  23. One of the best resources I've found by PghFox · · Score: 5, Informative
    The Pragmatic Programmer: From Journeyman to Master, is one of the best resources I've found to avoid common programming mistakes. This book details many of the common errors we make as software developers and describes strategies for overcoming them. Having been in the field for close to two decades, I've found this book to be of immense value, and give it a high recommendation.

    Some of the tips, which may appear obvious to some of us, include:
    • Always Aim for Simplicity, Clarity and Generality
    • Treat all of your code as if you're going to release it
    • Keep subroutines small; break-up code as you go
    • Document as you go, not after the fact
    • Write tests as you go, not after the fact
    • Fix bugs immediately; do not delay fixing them
    • Do not duplicate any code, anywhere
    • Separate form and functionality
    • Subroutines should do one thing and do it well
    • Make your work easy to reuse
    --
    --- Fox
  24. Re:Airbus by s20451 · · Score: 3, Informative

    Unfortunately, it has been conclusively proven by experience that the risk of an incapacitated pilot causing an accident is much, much less than the risk of a pilot and computer being at odds over the correct course of action in an emergency, or the risk of computer settings confusing the pilot. I prefer the Boeing design philosophy, which is that the pilot is the final authority on the operation of the airplane, not the computer. The pilot, not the software engineer, is on board the airplane, and therefore has a much higher interest in ensuring that the vehicle gets on the ground in one piece.

    --
    Toronto-area transit rider? Rate your ride.
  25. Re:Y2K? by T-Ranger · · Score: 3, Informative

    Prehaps true, but back in the days of punchcards anc COBOL you wernt storing a integer for a date, you were storing a string.

  26. Re:Why this cant be right... by bongholio · · Score: 4, Informative

    You're all sorta right.. here is one of my favorite aviation pages It'll tell you more than you ever wanted to know about airplane physics (from a pilot's point of view). Chapter 1 covers these altitude/speed/power concepts...

  27. Re:That is NOTHING -- 10,000 died in Bhopal, India by jgaynor · · Score: 4, Informative

    A "large quantity of water" entered the storage tank because an employee who had just been fired dropped a hose into it out of spite (he didnt know what would happen, he just wanted to ruin something). Yes the safety precautions were under-par, but when someone with legitimate access wants to destroy something its pretty hard to prevent.

    And yes, this has nothing to do with programming error :).

  28. Blind code reuse by management edict by Anonymous Coward · · Score: 1, Informative
    "...I think reuse of code was part of the problem."

    In particular, there was a management decision that the software for the previous model would be used, even though the design criteria for the new model were different. In particular, the Ariane 5 was capable of accelleration that overflowed variables in the program written for Ariane 4.

  29. Re:A Great Story by florescent_beige · · Score: 4, Informative

    Speaking of aviation: This SAAB Gripen crash was attributed to the coding of the control laws in the flight control computer. So was this one. And this F-22. And lets all remember the Apollo 11 incident.

    --
    Equine Mammals Are Considerably Smaller
  30. Computer-Related Risks by Peter G. Neumann by Malic · · Score: 4, Informative

    I think I've recommended this book serveral times on Slashdot. Simply put, THE collection of computing related horror stories.

    http://www.amazon.com/exec/obidos/tg/detail/-/02 01 55805X/qid=1035769692/sr=8-13/ref=sr_8_13/104-4078 673-1863905?v=glance&n=507846

    --
    I swear by MacOS X. Although I use to swear *at* MacOS 9...
  31. Re:RTM Worm by Anonymous Coward · · Score: 2, Informative

    Just for the record: he never went to jail.

  32. Programming Gone Wrong Sources by DorAgaznog · · Score: 2, Informative

    From my Software Engineering textbook (author: Vliet if you're interested), a few references you might like: - http://www.csl.sri.com/users/neumann/neumann-book. html - http://www.rothstein.com/slbooks/sl296.htm Also, you might like: "Design Paradigms: Case Histories of Error and Judgment in Engineering" by H. Petroski (not restricted to Software Eng) Enjoy, Rod

    --
    "I respect faith but doubt is what gets you an education." --who knows
  33. Sleipner A by RallyDriver · · Score: 3, Informative

    On a slightly different tack - the
    Sleipner A oil platform sank because of a bad design, caused by inaccurate computer based modelling (using an FEA tool inappropriately). In this case it was the data not the software.

  34. Re:OT: Scuds and Patriot missile defenses by trveler · · Score: 2, Informative
    Dead on! There's been loads of evidence that NONE of the patriots EVER hit a signle scud

    Of course they didn't. The patriot was specifically designed to detonate itself CLOSE TO the offending missile and, hopefully, in the process destroy the latter. This is, in fact, what happened: Tel Aviv and surrounding areas were rained on by falling scud parts. These were pieces of the scuds intercepted by the Patriots.

    The problem of intercepting a moving target is difficult, but it becomes much easier when the goal is to simply get "near enough" to disable it with an explosion.

    --
    ... is whot bwings os tugevza tsuzay.