Debug your Code, or Else!
Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "
Remember that time when that kid dialed into NORAD and used that security exploit to get into the Thermo-nuclear war simulator and everyone thought it was real until he and the inventor were able to trick the computer into playing Tic-Tac-Toe? I see a LOT of bugs in the software there but no one ever seems to care about that...
The ultimate network admin tool needs HELP!
2) 2000 - Poorly coded garbage collection causes Word 97 to crash, lose last 2 hours of research paper. Class was in 30 minutes, paper was late. I lost my scholarship.
3) 2002 - IE Crashes while writing AWESOME first post for /., My karma never recovered.
What, me worry?
The surprise isn't how many situations have cropped up because of software bugs, but rather how few. If you think of all the things that code is written for, and yet there hasn't been any major 'disaster'. Yes, the deaths and accidents are tragic, but on the grander scale of things, it's amazing that nothing truly catastrophic has happened.
sig--we don't need no goddamn sig
I'd take issue with the inclusion of the London Millennium Bridge; that wasn't so much a failure of software, but a flawed model, that failed to take into account the effects of swaying pedestrians. After it was rectified, there were new data - never used in any bridge model - incorporated into such models so that it won't happen again. That's science; not a bug.
It really amazing how many software project managers that don't fully understand what regression testing is all about.
Software engineers simply cannot be trusted to do more than small unit level testing! We get into a pattern of behavior, we know what to expect, and simply do not stress test the system.
Thats why I like hiring sales people and 2-year olds to test my code at the unit/integration level.
Old age and treachery almost always overcome youth and skill.
Software Horror Stories linked from the post's link
I believe more patients' lives are lost because of mistakes by doctors/hospitals/nurses, or sheer negligence. In some parts of India, for example, private hospitals are afraid to admit victims of accidents or crimes because the hospital itself might get into some trouble. Personally, I have seen doctors giving stupid advice, and people losing lives.
To put things in perspective, fatalities caused by human errors (non programming related) outnumber those caused by software errors by orders of magnitude, in many fields (except, say in launching unmanned space vehicles).
S
The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.
I wrote the algorithm that scheduled the dialins. It used a pseudo-random approach during the day, weighted by outstanding traffic.
But at night, there was period during which we had to unload all messages before the next day's processing. During this time, the pseudo-random algorithm was replaced by a deterministic one that assigned computers time slots.
The computers also had auto-rety in the case of failure, so each call could result in several if it were blocked.
Unfortunately, during coding I had put in the number of modems answering phones at 20 (as an arbitrary number for testing). During the hectic rollout, this never was changed to the actual number which was much smaller.
Once the system came on line, every night at 1AM portions of Omaha (which included lots of call centers) would lose all long distance service for a couple of hours, as all these computers called in and retried several times.
Eventually the phone company figured it out and contacted us, and we discovered and corrected the discrepancy.
Another issue was that we had a number of hotels that were using pulse dialing (this was a long time ago in a galaxy far far away). Sometimes these would be off by one due to the inherent unreliability of pulse dialing, and the result was a lot of calls to certain numbers related to the 800 number, all in the middle of the night.
BTW... as far as I know, this was the first large widely distributed commercial computing system to use switched telephone circuits for communications (but no doubt some other grey-haired slashdotter knows of another).
The only good weather is bad weather.
He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:
In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.
He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."
-- Adam
Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.
--Jim
Sure, some people here gripe about this not being newsworthy. But as a hardware guy, I am happy to see that software guys are finally going to be held to some sort of standard.
In electronics, if your hardware has ONE little problem, it's almost bankruptcy time. Remember the Pentium FP bug? And how it would have affected very little? Remember the hoopla, people wanted new processors, etc..
But software bugs? Who cares! It's NORMAL, it's EXPECTED. Well, geeks and nerds, time to get your asses in gear and live up to the same standards mechanical and electrical engineers have been living up to for decades.
I'm tired of being held to a standard of perfection that the software people (who make more money than me!) don't even KNOW about.
The Therac-25 was an automated x-ray machine that overdosed patients. Fatally. It was a UI bug rather than a software bug. It's dissected in "Killer Defects" (IIRC) by Negroponte (again, IIRC).
Best Slashdot Co
Don't take my word for it. Do a web search and see for yourself. Here are some references to get you started:
http://www.fas.org/spp/starwars/docops/rp911024.ht m
http://www.csmonitor.com/durable/1997/09/08/opin/l etters.1.html
GMD
watch this
I'm sure we all have those bugs that we catch in bench testing. Mine was forgeting to add a cancel button to the following dialog box:
... well, you read the papers. Glad I caught that one before I released it to test.
"OK to delete database"
When I caught that one I had visions of a user who had his/her million dollar database deleted charging into our office with a shotgun and
Go check out the Risks Forum, links are available from the ACM webpage. There is plenty of proof and explanation for hundreds of software related mishaps. You're obviously looking in the wrong places.
I'm the big fish in the big pond bitch.
Incoming missile sir! What do we do?
<officer> Don't worry, it's one of ours.
<private> But sir, it's still going to HIT us!
This not only sounds like something that belongs in a Dilbert strip, but also the basis for the logic that allows the spreading of all these e-mail viruses.
Half of it boils down to, We are not responsible for anything bad, even if we were warned about it and have a fix for it that we are holding on to sell as part of an upgrade.
Fight Spammers!
The bug that caused Airane explosion was a requirements analysis bug. The Pentium FP bug was a hardware bug.
A quick skim of the rest nets me at least 6 more non-software software bugs
After seeing that, I can't really trust the list on things I don't have a good knowledge about.
Here's a challenge for someone: Go through the list and find out how many (if any) of the listed software bugs are actually software bugs.
The claim for this one is that a Patriot during the Gulf War failed to intercept a Scud missle and the Scud missile killed 26. Ergo, a software bug killed 26 people.
Considering that even the military now admits that no Patriot *ever* intercepted an Iraqi scud, this inference is unfounded.
Ah well you see, that's the problem of becoming overly dependent on paper-based systems for mission-critical applications.
Cheers,
Ian
That system just wasn't designed for that purpose. It was VERY well designed for its actual purpose, which was tracking AIRCRAFT going WAY slower than that missile. And it was only rated for 14 hours of continuous usage, not 100. So it wasn't a fault in the program per se, but a misapplication of a system designed for a different use.
Kintanon
Check out JoshJitsu.info for Brazilian Ji
Blatant karma whoring...
The risks forum is available as a moderated newsgroup, or you can subscribe to the e-mail version. See the Risks info page.
and goes on to say:
seven times the loss of human life, but less of a tragedy? I guess they are soldiers so fuck 'em, eh?
This story is over two years old, so they have had ample opportunity to correct it. The "comment" button on that page just takes me to the front page. Nice.
Also on that page, "The DoubleSpace automati hard disk comparision software included in Microsoft MS-DOS 6.0 [. .
Ironic that there are such glaring errors in an article about buggy software.
Well, I wasn't particularly a fan of Byte before, but now I'm convinced that they suck.
-Peter
Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.
Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.
This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.
The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.
When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.
I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Not quite. The software was built for the Arianne 4. On the Arianne 4, it was physically impossible for that value to ever get high enough to overflow. So on the Arianne 4 the assumption that an overflow could only be due to a hardware failure was entirely correct.
If they had known that years later an Arianne 5 would come along, and those engineers would stupidly reuse the Arianne 4 code without testing it once, then perhaps they would have made a different decision. But I think the blame goes on entirely on the Arianne 5 guys, who were *not* the ones who wrote that code.
So we have a specification problem and a system design problem. Neither is a pure "programming problem".
Software crashes are like airplane crashes -- blame the lowest guy on the totem pole. In air crashes, it's the pilot. In software, it's a coder.
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
My software always works perfectly on my system. Zero bugs.
I have no idea what the hell the users do to it to screw it up.
When after sitting down for 36 hours straight when I first learned to program in C, I wrote a small, but usefull, payroll program. By the end, during the function that would print out the check, I added "Press any key to continue, any other key to abort". Lucky for me I never released that program.
---
All comments are not factual unless stated otherwise.
There were many things that went wrong during the incident, but one of the FEW things that worked correctly was the AEGIS weapons system on board the guided missile cruiser. The error lay in the crew's mistaking the range information reported on the radar screen with altitude information. As a result, the CO thought that the incoming contact was flying straight towards his ship and decreasing in altitude (preparing to attack).
Blaming a "cryptic display" is hardly a software bug if anyone is familiar with radar screens. That's why we train people to read them!
Actually, those caused by VB left no survivors to tell the story...
Your first link is a translation of a patriotic Israeli article cheerleading the competence of their military. It doesn't necessarily make what they're saying false, but does make it suspect.
The second link is way low on content, I'm not sure how to judge it. All it says is "we looked at a bunch of videotapes and arrived at this conclusion". And then goes on to mention the bitter dispute between the U.S. and Israeli military over why the system didn't work so well in Israel.
I'm not sure I'm going to buy either argument. I know enough about flight characteristics to question the assertion that the scuds were so good at jinking and chaff the patriots (which were originally designed to hit jinking, chaff releasing aircraft) couldn't hit them.
If the scuds were dropping debris because extra fuel tanks made them unstable:
1) Why wasn't the wobble a pronounced problem at launch when the extra weight would have completely thrown off the trim characteristics of the missile?
2) Dropping "debris" is a bad thing, and it's only a matter of time before doing so results in an uncorrectable failure of the missiles flight aerodynamics. Why weren't most of them failing earlier?
3) Missiles don't fly in smooth trajectories nearly as often as you think. They jink to try and make anti missile systems (like say the Phalanx close-in weapons system) miss them or think they are dead and not worth any more attention.
Even if the patriots did fail, why would that have grave implications for our anti ballistic missile shield? SCUDs are cruise missiles, not ballistic missiles. Why do you think those big computers at Norad can accurately predict where the warheads will hit just after boost?
Education is a better safeguard of liberty than a standing army.
Edward Everett (1794 - 1865)
Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion. more info
Armys break cadence when marching across bridges, at least as far back as Napoleon's time. Presumably they learned that the hard way.
On a more personal note, I have participated in the unintentional destruction of a gymnasium. 80 or so people crowded together in the middle, bouncing up and down, and then "down and down". We fractured the engineered wooden joists. Fortunately it failed gracefully. Just sagged down about 4 feet in the middle.
What I'm trying to say, not particularly directly, is "don't give the designers of the bridge a pass because this new phenomenon struct their bridge". Chastise them for risking people's lives and wasting resources by neglecting the loads placed on bridges.
This IS the text on this very sort of thing. I love techno-"oops, that's not right, is it?"-horror stories and this book is filled with them. I REALLY recommend this book! Here's an example of the page after page of entries it contains:
Making Rupee!
Due to a bank error in the current exchange rate, an Australian man was able to purchase Sri Lankan rupees for (Australian) $104,500, and then to sell them to another bank the next day for $440,258. (The first banks' computer had displayed the Central Pacific franc rate in the rupee position.) Because of the circumstances surrounding the bank's error, a judge ruled that the man had acted without intended fraud, and could keep his windfall of $335,758.
Computer Related Risks - Peter G. Neumann - ACM Press - 1995
The bottom line here is "computing is, in a technical sense, a risk". Actually, technology - of any kind - is a risk. Which I suppose leads us to remember that life is a risk.
At which point, I'll just stop rambling and point you all to Amazon to buy the book.
I swear by MacOS X. Although I use to swear *at* MacOS 9...
I think most of the bugs in software are the result of "Coding Under Influence". Wether it is a strict time-limit, ambiguous specifications, no sleep or other disturbances, it leads to blatant dumb assumptions or similar faults. Everyone knows that driving under influence is dangerous and can lead to accidents. Why do "software architects" think this is different when someone writes important programs?
I think part of the problem is that writing software is a rather new handwork in comparison to e.g. metalworking. Programmers don't have a union, often they work under poorer confitions than workers at conveyor belts if you consider the higher responsibility they have.
(don't mind the yellow)
In the old USSR (Stalin times), there was a standard bridge acceptance test:
1) put project managers, lead architects and engineers under the bridge;
2) put heavy loaded trucks on the bridge.
That was real extreme testing.
1. You'd have lost this anyway - that's the point of the test.
2. They designed a failed bridge, and you want them to design the new one?
3. These are cheap when you have hundreds of millions of slaves.
I'm pretty sure the media has mentioned this, beyond those two media links you already posted, I mean. The issue has been debated since the first Patriot experiences during the Gulf War.
But I don't really see how this has "grave implications" for an anti-ballistic missile shield. The effectiveness of the Patriot missile used during the Gulf War era is in doubt, but a that does nothing to invalidate the general concept of destroying a ballistic missile with another interceptor missile. It certainly isn't easy to do, and there may be better ways to accomplish the same goal or things more worthy of our limited resources, but to claim that it's somehow physically impossible is both disingenuous and incorrect.
Wrong Starting Estimate of Uranus mass
But I thought Uranus is a hole...
Any hole sufficiently big enough is bound to have some mass in there, somewhere.
"And like that