Debug your Code, or Else!

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Thursday May 2, 2002 @05:14AM from the trip-down-ramification-lane dept.

Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "

7 of 485 comments (clear)

Min score:

Reason:

Sort:

Pentium bug in perspective by Alomex · 2002-05-02 05:32 · Score: 5, Informative

Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters".

The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.
My prof at Georgia Tech stressed this a lot by delphin42 · 2002-05-02 05:39 · Score: 5, Informative

He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:

In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.

He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."

--
-- Adam
Read comp.risks by kzinti · 2002-05-02 05:43 · Score: 5, Informative

Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.

--Jim
It's Worse: The Patriot Never Worked by GuyMannDude · 2002-05-02 05:51 · Score: 5, Informative

The Patriot missle defense system never worked -- the bug mentioned in the article is a red herring. The main problem was that the Iraqis had modified the scud with additional fuel tanks. The resulting missle was unstable and would start to break apart in flight. The Patriot couldn't lock on to the missle because it of all the schrapnel. In addition, the scuds are poor missles to begin with. When they fly, they do so with a wobble -- like a poorly thrown football. The Patriots had been tested prior to the war on good-quality American missles which flew in a smooth trajectory. The Patriots simply couldn't deal with a missle that "danced around" in midflight. Bottom line: the Patriots simply do not protect against scuds because of poor design -- not some floating point error. The floating point explanation is analogus to that Coriolis-effect-causes-water-to-swirl-in-the-toile t myth that you find in so many physics textbooks (the Coriolis effect only works on planetary scales). It looks good on paper but if the "experts" had bothered to perform a test they would see that the explanation is dead wrong. The failure of the Patriots to intercept scuds (and the fact that the media never mentions this) has grave implications for our anti ballistic missle shield.
Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:

http://www.fas.org/spp/starwars/docops/rp911024.ht m
http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html

GMD

--
watch this
Re:Patriot Scud Time Error by Kintanon · 2002-05-02 06:01 · Score: 5, Informative

That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.

Kintanon

--
Check out JoshJitsu.info for Brazilian Ji
We have a difficult battle ahead ... by jc42 · 2002-05-02 06:09 · Score: 5, Informative

Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.

Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.

This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.

The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.

When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.

I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:speaks more to TESTING by slamb · 2002-05-02 09:16 · Score: 5, Informative

What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.
Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.
PostgreSQL has regression tests.
There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.
The Boost code has automated testing.
There's a project called qmtest.
The Wine people have recently started using regression tests.