Slashdot Mirror


Examples of Programming Gone Wrong?

LightForce3 asks: "I'm a beginning CS student, and in my studies I've come across examples of programmer error causing very large problems, such as the Ariane 5 failure and the Therac-25 accidents, often as tales of caution to beginner programmers such as myself. My (morbid?) curiosity has been piqued, and I'm looking for other examples of programmer error leading to serious problems. After all, it is better to learn from the mistakes of others than from your own, right? ;) What programming-related accidents, incidents, and failures, both well-known and obscure, do Slashdot readers know about, and are there any good resources for researching these?"

28 of 626 comments (clear)

  1. Y2K? by Monthenor · · Score: 5, Insightful

    ...wherein a technique to save memory on older computers resulted in a massive media panic twenty years later. Oh, and it caused a couple glitches

    --
    Co-founder of GerbilMechs
  2. Re:already.. by Anonymous Coward · · Score: 5, Insightful

    Why not provide a link instead of saying "Oh yeah, I saw it way back when."

    You people who say "use google to find it" or "this was already asked" are worse than the people who actualy ask the question.

    Their only problem (if it could be said to be a problem) is ignorance, your kind however are a much better example of the problem of self-rightous lazyness.

  3. That's kind of silly by Ghoser777 · · Score: 4, Insightful

    Wouldn't setting it to something like 0 be better? I mean, I could miss it sticking at 12,000 for a while, but if I notice that my altitude is suddenly 0, I think my first instinct will be to pull up as fast as possible.

    F-bacher

    --
    James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."
    1. Re:That's kind of silly by Anonymous Coward · · Score: 0, Insightful

      GOD, you're stupid. Did you even READ the comment you replied to? Stupid fucking Americans.

  4. Don't be so narrow by coyote-san · · Score: 5, Insightful

    Don't be so narrow in your approach. Is it a programming error if a stadium roof collapses because the engineers couldn't understand what the output of their computer model was saying?

    What about when the construction crew quietly substituted what they thought was an equivalent design to what the computer program came up with for a skywalk over a hotel lobby?

    After almost 20 years in this field, I think that at least 80% of the serious "errors" I see are because the user didn't understand the results of the program, and only 20% of them are due to classic development errors.

    The lesson to learn from this: the user interface matters. Give some thought to presenting the information in a meaningful manner (e.g., the infamous pre-Challenger graphs showing O-ring erosion vs. the post-Challenger graph that mapped damage by temperature at the time of launch), and allow users to see the information in the way that makes the most sense to them.

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
    1. Re:Don't be so narrow by Anonymous Coward · · Score: 1, Insightful

      Is it a programming error if a stadium roof collapses because the engineers couldn't understand what the output of their computer model was saying?

      Nope, that's either a "stupid engineer" bug or a "poor interface design" issue, neither of which are programming errors.

      What about when the construction crew quietly substituted what they thought was an equivalent design to what the computer program came up with for a skywalk over a hotel lobby?

      Definnitely nope. I don't even understand how that could be considered a programming error.

      After almost 20 years in this field, I think that at least 80% of the serious "errors" I see are because the user didn't understand the results of the program, and only 20% of them are due to classic development errors.

      I've also been in the field for over 20 years and most of the serious errors I've seen are due to programming issues. I've also noticed that embedded systems tend towards programming errors and desktop applications tend towards design issues.

      (e.g., the infamous pre-Challenger graphs showing O-ring erosion vs. the post-Challenger graph that mapped damage by temperature at the time of launch)

      Note also that every engineer knew that the o-rings could and would fail at low temperature; the launch decision was political, not engineering.

      Dave

    2. Re:Don't be so narrow by FattMattP · · Score: 4, Insightful
      The lesson to learn from this: the user interface matters. Give some thought to presenting the information in a meaningful manner (e.g., the infamous pre-Challenger graphs showing O-ring erosion vs. the post-Challenger graph that mapped damage by temperature at the time of launch), and allow users to see the information in the way that makes the most sense to them.
      On a related note, a guy named Edward Tufte wrote a some books on just this type of subject. I believe it was called The Visual Display of Quantitative Information, or something like that. Basically, he goes show how thinking more about how you present the data can help you to communicate your ideas more effectivly. He also talks about the O-ring problem that you mention. He shows the charts from the NASA engineers and then shows the charts he had drawn. You could definitly see the problem much more clearly in his drawings.
      --
      Prevent email address forgery. Publish SPF records for y
  5. Use Google by SoSueMe · · Score: 2, Insightful

    No, really.
    Search and you will find.

    Learning to search effectively will serve you best.

  6. Re:already.. by Java+Pimp · · Score: 5, Insightful

    Yeah it is probably in the archives. I've read it before.

    Problem is, the slashdot search engine sucks. I haven't yet been able to query the archives and actually find what I'm looking for without needing to dig through hundreds of irrelevent discussions. Sometimes I think it might be faster to just scroll back through the "Older Stuff" section.

    Or we could just have another discussion about it. :-)

    --
    Ascalante: Your bride is over 3,000 years old.
    Kull: She told me she was 19!
  7. Another nominee... by Anonymous Coward · · Score: 1, Insightful

    Netscape.
    Stopped evolving in a mass-market way at 4.79 ... and that version is buggy as hell.

  8. Re:One Word by seattle2napa · · Score: 3, Insightful

    I love how complete bullshit gets moderated to +5 Informative on slashdot. Why not do a tiny bit of fact checking and slam people for misinformation rather than praising them for anything negative having to do with Microsoft???

  9. Re:not true by Anonymous Coward · · Score: 5, Insightful

    The database failure caused NT to crash. Good software design includes failure planning.

  10. Re:That is NOTHING -- 10,000 died in Bhopal, India by xA40D · · Score: 2, Insightful

    I can't believe no one has mentioned it yet -- it's probably because they don't care about the third world

    Well, I do care about the third world. But I was not aware the Bhopal disaster was down to dodgy software. I always believed it was reckless cost cutting by a faceless multi-national which took advantage of the fact that India, as a developing country, didn't have very good health & safety legislation.

    But I think the main reason disasters like this are ignored is that poor people don't make very good consumers, so the consumerist society pays little attention to their wants and desires. And their deaths are little more than statistics.

    10,000 dead? Bung them some cash.

    Now, what do you think would happen if a large number of decent hard working consumers were wiped out in a single event?

    --
    Do you mind, your karma has just run over my dogma.
  11. Re:That is NOTHING -- 10,000 died in Bhopal, India by entrylevel · · Score: 3, Insightful

    Actually I think that no one mentioned this since it has nothing to do with programmer error. (At least not according to what you linked to.)

    --
    Karma: Incomprehensible (Mostly affected by posting at +5, reading at -1, and metamoderating everything unfair.)
  12. Re:That is NOTHING -- 10,000 died in Bhopal, India by Cyno01 · · Score: 3, Insightful

    Now, what do you think would happen if a large number of decent hard working consumers were wiped out in a single event? it did, 09/11/01

    --
    "Sic Semper Tyrannosaurus Rex."
  13. Re:Challenger by Anonymous Coward · · Score: 1, Insightful

    NASA hasn't ever had a hardware problem. Or a software problem. Ever.

    you're full of crap. here are nasa bugs. there were people involved in the process of developing the software (surprise, surprise), but read nasa's admitted "root cause" for the climate orbiter ($$$$) being lost: it's software (not the people) which failed to translate units.

    next you're going to tell us software doesn't have bugs, programmers do. bs. if you're going to tell us nasa never has bugs ever, you better give us evidence!

  14. Always Mount a Scratch Monkey by calyxa · · Score: 3, Insightful
    --
    Decay! Decay! Decay! -Helium
  15. Re:Incorrect function usage. by antis0c · · Score: 5, Insightful

    No, I meant read it until you understand it. I don't want anyone working for me that doesn't think understanding documentation is a good thing or doing something the correct way rather than "it works so I might be doing it right."

    And there's a difference between not being able to code and understanding a particular function. I may read a function's man page 2 or 3 times to make sure I understand correctly what is going on. Not nessesarly because I'm incompetent, but because the wording my be confusing (wow, confusing wording in a manpage? Who would have thought..). That doesn't mean every single function for a particular language requires you to read the documentation for it multiple times. I assume nothing. Assuming something leads to bugs and insecurity. I've been programming in C for many, many, many years. When I do a little PHP programming to create some web interfaces I don't assume that just because both C and PHP have a function called strlen, and the general documentation says it returns the length of a string, that they work identically. So I read the entire strlen documentation for PHP to understand exactly whats happening. It only took less than a minute, but now I'm not assuming. I know. This goes for lots of things. The more complex functions you use, the more important it is to fully understand them.

    The point is coding correctly is the most important skill to learn. I have friends that hack together scripts and programs from examples and snipits of other code and a little bit of their own code to glue it together, with little to no understanding of what they are actually doing. Then months later something breaks they can't fix and they act as if it was the author who wrote the example code's problem.

    No, it's there fault. Not because they hacked together examples, but because they didn't take to the time to make sure they knew what the examples were doing, that the examples were implemented correctly, and that they understood exactly how the code in the examples worked.

    Take a look at OpenBSD's philosphy.. You can learn a lot from it.

    --

    ..There's a-dooin's a-transpirin'
  16. Re:That is NOTHING -- 10,000 died in Bhopal, India by Forager · · Score: 5, Insightful

    From the site:

    "In 1969, as part of its global empire, Union Carbide Corporation set up its pesticide formulation unit in the northern end of the city of Bhopal in central India. Initially it mixed and packaged pesticides imported from the US but was gradually expanded. In December 1979 its Methyl Iso Cyanate (MC) plant with an imtalled capacity of 5000 tonnes went into production.

    On the night of December 2, 1984, during routine maintenance operations in the Methyl Iso Cyanate (MC) plant, at about 9.30 p.m., a large quantity of water entered storage tank no. 610 containing over 60 tonnes of AEC.

    This triggered off a runaway reaction resulting in a tremendous increase of temperature and pressure in the tank and 40 tonnes of MIC along with Hydrogen Cyanide and other reaction products burst past the ruptured disc and into the night air of Bhopal at around 12.30 a.m. Safety systems were grossly under-designed and inoperative. Senior factory officials knew of the lethal build-up in the tank at least one hour before the leakage, yet the siren to warn neighbourhood communities was sounded more than one hour after the leak started.

    By then, the poisons had enveloped an area of 40 sq.kms. killing thousands of people in its immediate wake. Over 500 thousand suffered from acute breathlessness, pain in the eyes and vomiting as they ran in panic to get away from the poison clouds that hung close to the ground for more than four hours."

    Nothing to do with programming errors here that I can see. Sounds more like gross negligence and incompetence to me.

    -A.

    --
    student of animation and the fine arts
  17. Re:Challenger by GileadGreene · · Score: 4, Insightful
    NASA hasn't ever had a hardware problem. Or a software problem. Ever.

    Well, except for Mars Polar Lander, where the failure review board determined that the lander crashed because a flag indocating contact with the ground was not intialized to zero prior to the start of the retro-thruster loop. So the flag got set by the shock of deploying the landing legs, never got reset, and caused the thrusters to switch off as soon as they were on.

    I guess maybe you forgot about Apollo 13 as well (hardware)? Or the Galileo High Gain Antenna that failed to deploy (hardware)? Or the serious telemetry system problems they had with one of the Voyagers (hardware)? Or the faulty landing bag on one of the Mercury flights (hardware)? (was it Glenn's? I don't remember) Or that funky glitch in the landing computer during Apollo 11 (software)? You know, there's a reason that most space mission tend to be heavy on redundant hardware, and invest a lot of time and effort in fault protection software.

    Every problem can be directly tied to one specific person being a fscking moron.

    Well yeah, but that's the case with a lot of bugs, isn't it? Mistakes tend to be people issues.

    The closest you could come is that Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.

    You are at least correct about that - the problem was not a software issue. Lockheed Martin Astronautics was on contract to supply everything to NASA in SI units (which is what NASA uses for everything). LMA - or at least the part the caused this problem - uses English (Imperial) units internally, and neglected to perform the appropriate conversion before they sent the data on to NASA.

  18. Re:Why, the world's favorite mail client, by billbaggins · · Score: 5, Insightful
    Not quite. I don't think even Outlook was ever set to just run code automatically. What went wrong was that for a long time (and, in unpatched versions, even today), Outlook would implicitly trust the "Content-type" header for an attachment or message, and, if it was a "safe" type (like text/html or image/jpeg) then the attachment would be handed off to the document-opener to be rendered & displayed inline. Problem was, the document-opener didn't go by the MIME type but by the extension. So if you had something like
    Content-type: image/gif
    Content-disposition: attachment; filename="fux0r.scr"
    then the document-opener would say "ah, this is a screensaver, I should execute it" and before the poor user knew what was going on, all hell was breaking loose...
    --
    "The best argument against democracy is a five minute chat with the average voter."
    --Winston Churchill
  19. Re:Why this cant be right... by cybercuzco · · Score: 4, Insightful

    Thats not entirely true. Adding power will inccrease your altitude, but pulling up will too. When you pull up, you trade altitude for speed. In other words, youll go higher but your plane will be goins slower. Eventually you arent going fast enough to maintain level flight characteristics, so you have to add power or stop trying to go higher. In some cases youre right though, if you already are only going fast enough to maintain level flight, pulling back on the stick will slow you down and decrease your altitude, but this isnt always true. As for the person who didnt understand how adding power increased altitude, when you go faster, you increase the lift coming from your wings (since lift is a function of speed and angle of attack) so there is a net upward force on the aircraft, causing it to go upwards.

    --

  20. Wind was the *cause*. . . by kfg · · Score: 5, Insightful

    of the Tacoma Narrows bridge falling. The *fault* was with the design, and hence, the designers.

    An extended bolt puncturing the gas tank during a rear end collision was the *cause* of Ford Pintos exploding. The *fault* was with the design, and hence, the designers.

    Both of these items could have been claimed to be perfectly free of design flaws while being used as "intended."

    This argument did not help the designers in not being found liable for their design flaws.

    The divide by zero error was the *cause* of the operating system's failure. The *fault* was with the operating system. The *operating system* crashed. An operating system failure is *always* the fault of the operating system, and hence, its designers.

    Read any textbook on the design of operating systems and in the first page or two you find some sort of statement along the line of, " A faulty app should never cause the operating system to fail." This is correct design.

    Let me repeat. If an app fails, it is the fault of the app. If the operating system fails, no matter what an app has done, it is the fault of the operating system. An operating system must *assume* apps badly written by complete incompetents.

    It doesn't matter what operating system. Windows, Linux, Mac or just the beads on your abacus.

    * It is the responsibiltiy of the operating system not to fail.*

    The fact that such failures can be explained away as the fault of the app by people who should know better makes me grieve for the state of engineering these days. It can only result in products being produced with greater and greater "craposity" factors eventually resulting in a culture of complete "crapitude."

    KFG

  21. Re:Train collision by Reziac · · Score: 3, Insightful

    IANAP, but I'd think your bulletproof software should also have some way to gracefully account for "impossible" conditions, which users are so clever at creating!!

    --
    ~REZ~ #43301. Who'd fake being me anyway?
  22. New Slashdot Category? by po8 · · Score: 3, Insightful

    Could we have a new Slashdot category entitled Ask Slashdot To Do My Research/Homework For Me? Then I could mark this category unread and avoid some annoyance.

    There is so much information readily available on the subject of software failures online and in scientific and popular publications. (See other responses to this question for examples.) IMHO, the questioner should go look for the answer to this kind of question directly before bugging the entire Slashdot audience; the editors should enforce this policy.

    1. Re:New Slashdot Category? by Mac+Degger · · Score: 3, Insightful

      I'm kinda sick of seeing this kind of comment. If it where to hold any merit, WTF would 'ask slashdot' be for? To me, the whole purpose of the 'ask slashdot' catagory is to plumb the experience of the people who frequent this place. If you're not allowed to do that, what would you use it for?

      --
      -- Waht? Tehr's a preveiw buottn?
  23. London Ambulance disaster by os2fan · · Score: 3, Insightful
    There was a disaster in the dispatcher software that was written for London Ambulance. This was documented in a book on computer disasters.

    The system did not collapse per se but progressively became bogged down by a series of poor design issues and implementation issues.

    What happened was there was a memory leak, in that not all the memory used when a call was processed was released. This meant that each call chewed up a small part of core.

    As the day wore on, this loss of memory started to make the system run slower, and created more calls as users started to worry about the non-show of the ambulance.

    Meanwhile, back at the control centre, the operators started getting blasted by messages about over-due ambulances, and other system warnings. They were spending time simply dismissing Error dialogues.

    By the end of the day, they were still dealing with the emergency issues notified at 12.00.

    Of course, in the inquiry, there were many different management and design issues to be addressed, including the reliability and scalability of the software. [It was a Visual Basic program.]

    I have seen a number of instances personally, most of these tend to be ignored by management keen to see the system up and running. The most often case for dismissal of problems is "teething problems", and "Luditism".

    In practice, the real issue here is the UI. Not so much "flash chrome", but that the buttons and so forth will actually do what the user expects them to do. The user must be able to understand how to process and correct errors in relation to the application data itself. That is, if I enter 1200, and I mean 1130, I should be able to correct that.

    The other disaster happening out there is that the program must be useful to the operator. So apart from entering data, the operator must be able to extract useful information from it. What the back end does does not really matter.

    For example, a clerk who has to enter data on the screen each sale, in addition to operating the till, would be reluctant to use it. On the other hande, if the program is part of the till operation, and it provides information on how much stock is left, the clerk is more accepting of the change.

    Implementing a system is not about plonking a pc with a program on a user's desk. It's about a user process. Users are looking for outcomes, not process. So if you want to go to a shop, you want to buy something, and the clerk wants to sell it to you. All the rest is administrivia.

    Software design is important. So is user training.

    --
    OS/2 - because choice is a terrible thing to waste.
  24. You just can't win, can you... by achurch · · Score: 4, Insightful

    If the Y2k bugs hadn't been fixed, things would have broken left and right, and we would have been blamed for not fixing them ahead of time.

    Since the Y2k bugs were fixed, very few things broke, and we got blamed for wasting tons of money to no effect.

    C'est la vie, I guess.