Slashdot Mirror


When Computers Go Wrong

Barence writes "PC Pro's Stewart Mitchell has charted the world's ten most calamitous computer cock-ups. They include the Russians' stealing software that resulted in their gas pipeline exploding, the Mars Orbiter that went missing because the programmers got their imperial and metric measurements mixed up, the Soviet early-warning system that confused the sun for a missile and almost triggered World War III, plus the Windows anti-piracy measure that resulted in millions of legitimate customers being branded software thieves."

16 of 250 comments (clear)

  1. Computers do what they are told to by adosch · · Score: 5, Insightful

    TFA article should have been named the 'Worlds ten most calamitous logic cock-ups' instead. Because in the end, malformed, ill-tested or and unforeseen logic compensation(s) caused those issues, not computers themselves.

    1. Re:Computers do what they are told to by the_humeister · · Score: 5, Informative

      I'm surprised they didn't mention incidents where people actually died, such as the Therac-25 incident.

    2. Re:Computers do what they are told to by hairyfeet · · Score: 5, Informative

      Yeah there really wasn't much computer related there. If you wanted computer related I would have added WinME, aka "what idiot thought mixing WDM and VXD drivers was a good idea?" along with Vista Capable, aka "We've got to let the OEMs dump their crappers on Best Buy, so pretend it runs, okay?" and finally the early Athlon without thermal monitoring aka "Heat problem? What heat problem?".

      And of course if you wanted some real old time badness there was Bonzi Buddy, also known as "Kill that GODDAMNED MONKEY DEAD!!" and Geocities with the ever popular "WTF? Why is there a pocketwatch hanging off my mouse like a ball of snot and who thought pink OMG Ponies! text on a lime green background with sparkles and GIFs was tasteful?" and of course MSFT Bob, an OS made for the clueless that needed a fricking gamer rig just to run and spawned the electronic son of Satan known as Clippy.

      Finally on the hardware side I'd add the Pentium 4, also known as "Mr Piggy Super Space Heater", the Geforce 5xxx Hoover Edition, which was famous for not only filling your PC with the sounds of sucking but thanks to cheating by Nvidia on rendering actually gave you REAL sucking as well! Quite an accomplishment that, the Seagate "I hope you didn't actually NEED your data for anything" bug in the early 1.5TB drives, the early Phenom "watch this patch suck away your performance" TLB bug, the iPhone 4 which gave us such lovely phrases such as "WTF do you mean I'm holding it wrong?" and finally to show they can still make incredible mistakes the Nvidia bumpgate, also known as "We do NOT have a problem with our GPUs, its a power saving feature! See it makes your computer shut down and everything!!". These I think would have been a little more computer centric than stolen code and a screwed battery on a Volvo.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    3. Re:Computers do what they are told to by crunchygranola · · Score: 5, Informative

      And of course there is the Patriot missile software clock issue - that led to a failure to engage a SCUD on February 25, 1991 at Dhahran, Saudi Arabia, killing 28 soldiers.

      This failure is rather similar to the Soviet defense and NORAD errors mentioned in the article in that it was a weakness designed into the system that did not account for the range of operational condition and issues. In the Petrov Incident case - a natural condition, in the NORAD case an easy to make operator error, in the Dhahran barracks Patriot incident it was a failure to consider that a unit might be operated for weeks without a restart.

      --
      Second class citizen of the New Gilded Age
    4. Re:Computers do what they are told to by jc42 · · Score: 5, Insightful

      Another aspect to this is a common property of most "digital" computations. I've seen it expressed as "Digital errors have no order of magnitude". Another phrasing is "Getting one bit wrong is generally indistinguishable from randomizing all of memory". So when a digital calculation goes wrong, a tiny, inconsequential error is just about as likely as a total meltdown of the entire system.

      Programmers tend to get familiar with this phenomenon very early in their career. They write a small chunk of code that does a simple calculation, and the result is orders of magnitude wrong. When they investigate, they discover it was caused by a one-character typo, perhaps an "off by one" error such as using '<' instead of '<=', or vice-versa. This quickly leads to what many "normal" people consider the major character failure of software geeks, the insistence that everything be exactly right, no matter what, and the willingness to spend long hours discussing insignificant minutiae as if they mattered. In their work, it's usually such insignificant minutiae that brings the whole house of cards tumbling down.

      If you're unwilling to take the difference between a comma and a simicolon seriously, you have no future as a software developer. This is often why something goes badly wrong and we have events like those described in this story.

      OTOH, it is interesting that, despite all the software disasters like the metric/imperial-units story, the software world has never insisted that programming languages include units as part of variables' values. It's not like this is anything difficult, and it has been done in a number of languages. But none of the common languages have such a feature. It is a bit bizarre that we can get into long discussions of complex, obscure concepts such as type checking or class inheritance, when our calculations are all susceptible to unchecked unit mismatches (without even a warning from the compiler or interpreter). There's a lot of poor logic when the topic is the relative importance of various sources of bogus calculations.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    5. Re:Computers do what they are told to by kennykb · · Score: 5, Interesting

      "Units are parts of variables" usually comes along with systems in which there is no escape. Dimensional analysis is fine up to a point, but when you get into weird quantities like dBm/sqrt(Hz) (seriously: ten times the log-base-10 of a quantity measured in milliwatts, over the square root of another quantity measured in hertz), the systems that enforce units tend to fall apart, and often it turns out that they simply lack the notation you need. (By the way, "dBm per root hertz" was a unit that I used in daily work at an earlier time in my life. And I still use weirdness like neper-coloumb per square micron.)

  2. therac 25 by Anonymous Coward · · Score: 5, Informative

    List fails without the therac 25

  3. Imperial - Metric by Anonymous Coward · · Score: 5, Interesting

    Due to the imperial-metric mash-up, the sums were so far askew that when Ground Control initiated boosters to secure the pod in orbit, all they succeeded in doing was firing it closer to the planet, where it burnt up in the atmosphere.

    When I see the Imperial-Metric confusion shit, I just want to slap the shit out of someone. That waste because some engineers are incapable of using Metric or some vendor just doesn't want to spend the money to modernize their machinery. I know of an aerospace contractor that is using machinery from the 50s - yep, they're constantly being recalibrated and sometimes they don't notice - ooopsie!

    And when I see that we, the US, are one of two countries still on Imperial - one is some Third World non-industrial country, I want to barf.

    And then, when I have to buy two sets tools to work on a car, I wish for the entire US auto industry to go bankrupt and be replaced with some modern companies.

    I love Metric. It makes measurements and calculations much easier - quick! What is the mass of 329 mL of water? You'd need a calculator to do something similar in Imperial.

  4. Ariane 5 missing on the list by Anonymous Coward · · Score: 5, Informative

    It isn't smart to assign a 64 bit floating point to a 16 bit integer - unless you want to crash you first flight of the heavy Ariane 5 rocket... (http://en.wikipedia.org/wiki/Ariane_5#Notable_launches)

  5. It's a simple rule by Anonymous Coward · · Score: 5, Funny

    As a fellow programmer I worked with years ago was fond of saying, "Computers don't make mistakes. They do, however, execute yours VERY carefully."

    1. Re:It's a simple rule by jc42 · · Score: 5, Interesting

      "Computers don't make mistakes. They do, however, execute yours VERY carefully."

      That's a good way of phrasing it. But it does miss the fact that not all "computer errors" are due to software mistakes.

      One example, of course, is the Pentium FDIV failure. That was a hardware failure, "programmed" into the CPU by Intel's experts in solid-state hardware design. There wasn't a whole lot that any software developer could do to defend against that failure.

      Another, more subtle one, came up when I was a grad student back in the 1970s. At that time, most of the campus research computing was done on the big mainframe in the campus Computer Center. After discovering a number of (published ;-) results that turned out to be wrong, some researchers investigated, and found that they were due to undetected overflows in the calculations. Yes, the hardware could and did test for overflows, and set a status bit when they occurred. Almost all this calculating was done in Fortran, and the Fortran compiler had a run-time flag that could turn the status-bit checking on or off. It defaulted to OFF. They did a bit of analysis, and concluded that about half the runs of Fortran programs on that machine produced output that included numbers that were incorrect due to undetected overflow.

      So why didn't they make the overflow-detection flag default to ON? Well, they did a little survey of the users. They found that the overwhelming response was that, if enabling overflow checking made the program run slower, then overflow checking shouldn't be done. Somewhere around 90% of the people asked said this. They weren't mathematically ignorant people; they were the people using the Fortran compiler for the data in their professional publications.

      This told us a lot about the way such things are done. Since I left academia and worked in what passes for the Real World, I've found that this is a nearly universal attitude. Faster and cheaper is always preferable to correct. This is still true even when we have computers in commercial aircraft and hospital operating rooms. And you can't call this sort of thing a "human error". People don't decide to disable overflow checking by accident; they do it knowing full well what the effect will be. When the computer fails in such cases, it wasn't executing a human's mistake; it was doing what the human wanted it to do.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  6. The creation of the EFF by Anonymous Coward · · Score: 5, Interesting

    The "Switchboard meltdown" problem sounds like the incident which led to the creation of the EFF.

    Basically, someone forgot to include a ";" in a C program, which led to the problems at ATT. Originally, they thought it was due to "hackers", and called in the Secret Service.

    The Secret Service in turn busted a gaming outfit called "Steve Jackson Games". Who was completely innocent, of course, but that has never mattered to the Secret Service when they need to look like they are actually useful. The SS confiscated the computers, all illegally.

    The ACLU refused to get involved, so John GIlmore (formerly of Sun, and who worked with Richard Stallman to get out an open Operating System around that time) created the EFF to fight the unconstitutional raid on Steve Jackson Games. The EFF trounced the Secret Service in Court, and was thus born. I believe if you google for "Steve Jackson Games", you can still find the original story around.

    So, in a way, you can say that the EFF was created due to the single misplacement of a semicolon in a C program. Would that all of our bugs have such results. :)

  7. And sometimes it isn't the computers... by BrokenHalo · · Score: 5, Interesting

    (See title.)

    Any of us who have been in a sysprog or sysadmin role for a significant amount of time (by which I mean double-digit years) will often have at least one anecdote of some monumental cockup we've perpetrated.

    My worst case in point is where I managed (IIRC after a long liquid lunch) to delete the :per directory (more or less equivalent to /dev on a *nix box) on a Data General mainframe machine running AOS/VS. While hundreds of users' processes disappeared off the system (which took about 90 minutes), I found it expedient to simply make my confession to the boss.

    Fortunately, in this case, the escapade was more or less written up as "Shit Happens", which I thought was generous...

  8. 1982 explosion did not happen by mike449 · · Score: 5, Informative

    Te Soviet pipeline explosion seems to be an urban legend, traced to a single source: At the Abyss: An Insider's History of the Cold War, by Thomas C. Reed.
    There is no mention of this explosion anywhere else, either in Russian or Western sources. If you can read Russian, some debunking is here:

    link
    One of the facts mentioned there is that there was no SCADA on Soviet pipelines until late 80-s. All control was still pneumatic in 1982, with no software involved.

  9. "Black day for power programmers" Windows virus by Locutus · · Score: 5, Insightful

    to comments, I thought the deal with the big blackout was that the network(TCP/IP) was flooded with a Windows virus infection and if you know TCP/IP, it's not very good with lots of traffic. There was so much traffic that the computer( a UNIX box ) sending status messages to the control room display system could not get messages out of it's buffers. TCP/IP does this thing where the message isn't put on the network if there's going to be a collision and it waits some before trying again. With the network flooded with Windows based computers trying to infect each other, the warning messages were stuck in the UNIX box and eventually the buffers filled up as more and more warning messages queued up. They seem to be blaming the UNIX box software because the software ended up crashing because they didn't catch the situation where they buffers overflowed. IMO, that was caused by Windows and it's ability to be a great petri dish for viruses and the idiots who keep putting Windows systems on critical networks.

    The second comment I have on this is about missing the LAX Communications system software crash which caused multiple near misses on the tarmac and in the air when air traffic controllers could not communicate with pilots because of the crash. The cause of the software crash was a UNIX system was replaced with a Windows based system which had a known flaw. The flaw was that the OS could not run for more than 39 days no matter what was running on it. The system and software was still approved and put inplace with a maintenance instruction of rebooting the computer every 30 days. In comes a new employee who sees things are working fine so he/she doesn't reboot the computer and 9 days later the system crashes. The backup does the same and both are unable to recover and it takes hours to get the system back running again. That should have been in the list IMO.

    There was also the CSX Railway situation when lots of its signals go offline because they are run by Windows and their Windows computers got a virus.

    It would be nice to see a more complete and more accurate list of these kinds of computer software failures.

    LoB

    --
    "Anyone who stands out in the middle of a road looks like roadkill to me." --Linus