Debug your Code, or Else!
Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "
The surprise isn't how many situations have cropped up because of software bugs, but rather how few. If you think of all the things that code is written for, and yet there hasn't been any major 'disaster'. Yes, the deaths and accidents are tragic, but on the grander scale of things, it's amazing that nothing truly catastrophic has happened.
sig--we don't need no goddamn sig
I'd take issue with the inclusion of the London Millennium Bridge; that wasn't so much a failure of software, but a flawed model, that failed to take into account the effects of swaying pedestrians. After it was rectified, there were new data - never used in any bridge model - incorporated into such models so that it won't happen again. That's science; not a bug.
I wrote the algorithm that scheduled the dialins. It used a pseudo-random approach during the day, weighted by outstanding traffic.
But at night, there was period during which we had to unload all messages before the next day's processing. During this time, the pseudo-random algorithm was replaced by a deterministic one that assigned computers time slots.
The computers also had auto-rety in the case of failure, so each call could result in several if it were blocked.
Unfortunately, during coding I had put in the number of modems answering phones at 20 (as an arbitrary number for testing). During the hectic rollout, this never was changed to the actual number which was much smaller.
Once the system came on line, every night at 1AM portions of Omaha (which included lots of call centers) would lose all long distance service for a couple of hours, as all these computers called in and retried several times.
Eventually the phone company figured it out and contacted us, and we discovered and corrected the discrepancy.
Another issue was that we had a number of hotels that were using pulse dialing (this was a long time ago in a galaxy far far away). Sometimes these would be off by one due to the inherent unreliability of pulse dialing, and the result was a lot of calls to certain numbers related to the 800 number, all in the middle of the night.
BTW... as far as I know, this was the first large widely distributed commercial computing system to use switched telephone circuits for communications (but no doubt some other grey-haired slashdotter knows of another).
The only good weather is bad weather.
It delayed the re-opening of the airport for about seven months or so. After it did finally open the system wasn't working yet so the baggage system had to be operated manually for a couple of months.
The Therac-25 was an automated x-ray machine that overdosed patients. Fatally. It was a UI bug rather than a software bug. It's dissected in "Killer Defects" (IIRC) by Negroponte (again, IIRC).
Best Slashdot Co
If I misestimate the mass of a planet, is that a software bug?
If my software sells stock when a certain threshold is hit and yours does the same, and that least do a financial industry meltdown, did my software not work as planned, or is the issue more the dynamics of the market being somewhat unpredictable?
The tacoma narrows and london milennium bridges are both listed here, yet neither one is a software issue - hell the tacoma bridge collapsed in 1940!
That said, it is a pretty interesting list, but calling it a list of software bugs and using it to underscore the importance of regression testing software is a bit of a stretch. If anything, it underscores the importance of editing and proofreading your content.
Half of it boils down to, We are not responsible for anything bad, even if we were warned about it and have a fix for it that we are holding on to sell as part of an upgrade.
Fight Spammers!
The claim for this one is that a Patriot during the Gulf War failed to intercept a Scud missle and the Scud missile killed 26. Ergo, a software bug killed 26 people.
Considering that even the military now admits that no Patriot *ever* intercepted an Iraqi scud, this inference is unfounded.
(A former Theratronics employee is standing right behind me, but he denies having worked on the Therac-25.)
There is one highly relevant difference between the way that we deal with hardware and software. With hardware, inner details, schematics, and the like are usually easily available. Often this is required by law in any critical applications.
;-)
With software, most programmers are writing code to run on systems (kernels, runtime libraries, and the like) that are usually proprietary. The inner details are not just neglected; the companies intentionally keep them secret and prosecute people who leak them.
As a result, software can't be made reliable, not even in principle.
We do have a few exceptions, e.g. linux and all the GNU stuff. If *everything* underneath your code is Open Source, then in principle you can examine it and find problems. (It ain't easy, but at least it's doable if your employer will permit the time that it takes).
But we're facing a major battle just getting Open Source software accepted by a tiny part of the market. In most jobs, you are required to write code for systems whose inner working you are not permitted to know.
The US government is even using proprietary, binary-only computer systems in secure and mission-critical situations. Anyone who expects the code in such situations to be reliable is either utterly ignorant or actively malicious.
Myself; I'd welcome rules that make me and other software developers responsible for bugs in our code. If there were such a legal requirement, I could point to it when someone denies me access to the information that I need, and say "I can't possibly write correct code when you are keeping vital information from me. Show me the inner details of these parts of the system, and I'll agree to write reliable code for it."
Of course, in a couple of cases, when I've gotten my hands on such details, I've proceeded to write a proof that certain things could not be done reliably on that system. "Fix that bug in that library, and I'll vouch for my code. Until then, here's my bug report describing exactly how it will fail."
Unfortunately, when I've done this, the usual result was that I was looking for another job soon thereafter.
(One such lost job was when I proved that certain sensors in a nuclear power plant could not be made to work reliably due to their software. But that was 20 years ago; maybe they've fixed it by now.
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion. more info
Armys break cadence when marching across bridges, at least as far back as Napoleon's time. Presumably they learned that the hard way.
On a more personal note, I have participated in the unintentional destruction of a gymnasium. 80 or so people crowded together in the middle, bouncing up and down, and then "down and down". We fractured the engineered wooden joists. Fortunately it failed gracefully. Just sagged down about 4 feet in the middle.
What I'm trying to say, not particularly directly, is "don't give the designers of the bridge a pass because this new phenomenon struct their bridge". Chastise them for risking people's lives and wasting resources by neglecting the loads placed on bridges.
This IS the text on this very sort of thing. I love techno-"oops, that's not right, is it?"-horror stories and this book is filled with them. I REALLY recommend this book! Here's an example of the page after page of entries it contains:
Making Rupee!
Due to a bank error in the current exchange rate, an Australian man was able to purchase Sri Lankan rupees for (Australian) $104,500, and then to sell them to another bank the next day for $440,258. (The first banks' computer had displayed the Central Pacific franc rate in the rupee position.) Because of the circumstances surrounding the bank's error, a judge ruled that the man had acted without intended fraud, and could keep his windfall of $335,758.
Computer Related Risks - Peter G. Neumann - ACM Press - 1995
The bottom line here is "computing is, in a technical sense, a risk". Actually, technology - of any kind - is a risk. Which I suppose leads us to remember that life is a risk.
At which point, I'll just stop rambling and point you all to Amazon to buy the book.
I swear by MacOS X. Although I use to swear *at* MacOS 9...