Debug your Code, or Else!

Pentium bug in perspective by Alomex · 2002-05-02 05:32 · Score: 5, Informative

Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters".

The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.

My prof at Georgia Tech stressed this a lot by delphin42 · 2002-05-02 05:39 · Score: 5, Informative

He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:

In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.

He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."

--
-- Adam

Read comp.risks by kzinti · 2002-05-02 05:43 · Score: 5, Informative

Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.

--Jim

It's Worse: The Patriot Never Worked by GuyMannDude · 2002-05-02 05:51 · Score: 5, Informative

The Patriot missle defense system never worked -- the bug mentioned in the article is a red herring. The main problem was that the Iraqis had modified the scud with additional fuel tanks. The resulting missle was unstable and would start to break apart in flight. The Patriot couldn't lock on to the missle because it of all the schrapnel. In addition, the scuds are poor missles to begin with. When they fly, they do so with a wobble -- like a poorly thrown football. The Patriots had been tested prior to the war on good-quality American missles which flew in a smooth trajectory. The Patriots simply couldn't deal with a missle that "danced around" in midflight. Bottom line: the Patriots simply do not protect against scuds because of poor design -- not some floating point error. The floating point explanation is analogus to that Coriolis-effect-causes-water-to-swirl-in-the-toile t myth that you find in so many physics textbooks (the Coriolis effect only works on planetary scales). It looks good on paper but if the "experts" had bothered to perform a test they would see that the explanation is dead wrong. The failure of the Patriots to intercept scuds (and the fact that the media never mentions this) has grave implications for our anti ballistic missle shield.

Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:

http://www.fas.org/spp/starwars/docops/rp911024.ht m

http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html

GMD

--
watch this

Re:Patriot Scud Time Error by Kintanon · 2002-05-02 06:01 · Score: 5, Informative

That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.

Kintanon

--
Check out JoshJitsu.info for Brazilian Ji

We have a difficult battle ahead ... by jc42 · 2002-05-02 06:09 · Score: 5, Informative

Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.

Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.

This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.

The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.

When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.

I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Re:The Ariane blowup was especially amusing by T.E.D. · 2002-05-02 06:16 · Score: 4, Informative

Then to make it funnier, turns out the system engineers had decided that since software is infallible, any exception condition would indicate a hardware failure(!), so instead of a reset they shut the affected computer down altogether.

Not quite. The software was built for the Arianne 4. On the Arianne 4, it was physically impossible for that value to ever get high enough to overflow. So on the Arianne 4 the assumption that an overflow could only be due to a hardware failure was entirely correct.
If they had known that years later an Arianne 5 would come along, and those engineers would stupidly reuse the Arianne 4 code without testing it once, then perhaps they would have made a different decision. But I think the blame goes on entirely on the Arianne 5 guys, who were *not* the ones who wrote that code.

Coupla Notes by StormyMonday · 2002-05-02 06:20 · Score: 4, Informative

The Patriot time-drift was caused by the system being operated outside of its dsign parameters. It was designed to operate during a Soviet invasion of Western Europe, and expected to have to relocate every 8 hours or so. The spec, therefore, assumed that the software would reboot every 8-12 hours. From my experience with the military, if a programmer had put in a clock algorithm that would track indefinitely, he or she would have been ordered to take it out. (Been there. Done that. Broke the coffee mug.)
The Yorktown crash was the result of mixing mission-critical and non-mission-critical programs on the same box. Big no-no.

So we have a specification problem and a system design problem. Neither is a pure "programming problem".

Software crashes are like airplane crashes -- blame the lowest guy on the totem pole. In air crashes, it's the pilot. In software, it's a coder.

--
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.

Re:32. Therac-25, X-ray by irix · 2002-05-02 08:05 · Score: 4, Informative

The Therac-25 was an automated x-ray machine that overdosed patients. Fatally.

Well, not exactly. It was used for cancer treatments, not x-ray imaging. And not all of the radiation overdoses were fatal.

It was a UI bug rather than a software bug.

Again, not exactly. The problems with the Therac-25 included hardware issues and some UI problems that lead operators to do some interesting things. They also included some race conditions that were definately software bugs.

You can check out a reprint of an IEEE article discussing it in depth here.

Just for some history: AECL, the Canadian government crown corporation who made the Therac-25, spun off its medical operations into private companies in the 1980s. The first was Nordion, where I worked for a summer as a co-op student, produces radioisotopes for medical use. Nordion was bought my MDS. The other company was Theratronics, which was responsable for devices like the Therac-25. It went without a purchaser for many years becuase of the stigma of Therac-25, but it was eventually (IIRC) bought my MDS as well.

Both companies are in my hometown, and the fallout from the Therac-25 (like the IEEE article) was front-page news when I worked at Nordion in the early 1990s. I just worked on sofware to measure how much of a given isotope to dispense to fill an order, but the whole Therac-25 incident was definately on everyone's mind.

--

Do you even know anything about perl? -- AC Replying to Tom Christiansen post.

Re:speaks more to TESTING by slamb · 2002-05-02 09:16 · Score: 5, Informative

What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.

Perl is really good about this. The Test::Harness and Test::More modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.

PostgreSQL has regression tests.

There's a really nice test environment for Java code called JUnit. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit in one of my projects.

The Boost code has automated testing.

There's a project called qmtest.

The Wine people have recently started using regression tests.

10 of 485 comments (clear)