Debug your Code, or Else!
Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "
The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.
He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:
In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.
He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."
-- Adam
Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.
--Jim
Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:
http://www.fas.org/spp/starwars/docops/rp911024.ht m
http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html
GMD
watch this
That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.
Kintanon
Check out JoshJitsu.info for Brazilian Ji
Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.
Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.
This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.
The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.
When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.
I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.
Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.
PostgreSQL has regression tests.
There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.
The Boost code has automated testing.
There's a project called qmtest.
The Wine people have recently started using regression tests.