Debug your Code, or Else!
Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "
The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.
He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:
In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.
He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."
-- Adam
Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.
--Jim
Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:
http://www.fas.org/spp/starwars/docops/rp911024.ht m
http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html
GMD
watch this
That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.
Kintanon
Check out JoshJitsu.info for Brazilian Ji
Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.
Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.
This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.
The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.
When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.
I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Not quite. The software was built for the Arianne 4. On the Arianne 4, it was physically impossible for that value to ever get high enough to overflow. So on the Arianne 4 the assumption that an overflow could only be due to a hardware failure was entirely correct.
If they had known that years later an Arianne 5 would come along, and those engineers would stupidly reuse the Arianne 4 code without testing it once, then perhaps they would have made a different decision. But I think the blame goes on entirely on the Arianne 5 guys, who were *not* the ones who wrote that code.
So we have a specification problem and a system design problem. Neither is a pure "programming problem".
Software crashes are like airplane crashes -- blame the lowest guy on the totem pole. In air crashes, it's the pilot. In software, it's a coder.
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
Well, not exactly. It was used for cancer treatments, not x-ray imaging. And not all of the radiation overdoses were fatal.
It was a UI bug rather than a software bug.
Again, not exactly. The problems with the Therac-25 included hardware issues and some UI problems that lead operators to do some interesting things. They also included some race conditions that were definately software bugs.
You can check out a reprint of an IEEE article discussing it in depth here.
Just for some history: AECL, the Canadian government crown corporation who made the Therac-25, spun off its medical operations into private companies in the 1980s. The first was Nordion, where I worked for a summer as a co-op student, produces radioisotopes for medical use. Nordion was bought my MDS. The other company was Theratronics, which was responsable for devices like the Therac-25. It went without a purchaser for many years becuase of the stigma of Therac-25, but it was eventually (IIRC) bought my MDS as well.
Both companies are in my hometown, and the fallout from the Therac-25 (like the IEEE article) was front-page news when I worked at Nordion in the early 1990s. I just worked on sofware to measure how much of a given isotope to dispense to fill an order, but the whole Therac-25 incident was definately on everyone's mind.
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.
Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.
PostgreSQL has regression tests.
There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.
The Boost code has automated testing.
There's a project called qmtest.
The Wine people have recently started using regression tests.