Examples of Programming Gone Wrong?
LightForce3 asks: "I'm a beginning CS student, and in my studies I've come across examples of programmer error causing very large problems, such as the Ariane 5 failure and the Therac-25 accidents, often as tales of caution to beginner programmers such as myself. My (morbid?) curiosity has been piqued, and I'm looking for other examples of programmer error leading to serious problems. After all, it is better to learn from the mistakes of others than from your own, right? ;) What programming-related accidents, incidents, and failures, both well-known and obscure, do Slashdot readers know about, and are there any good resources for researching these?"
Erik, don't troll. The Challenger accident was a mechanical failure, had nothing to do with software, but if you want a software project gone wrong example, I'll give you one: Gah NU/Hurd
12 years of devlopment and it still sucks!
In the 80's, Robert T Morris accidentally released a worm that exploited problems in sendmail and other common internet daemons that took down most of what was the internet at that time. This was expecially bad since about half of it was military.
The only reason it took thigns down was because a timing loop was messed up, and it was spreading something like 1000 times too fast. It was supposed to spread everywhere, yes, but by crawling slowly.. it was not intended to eat up all connections on all machines.
Had that been the case, it would have been much more widespread and caused much less damage.
I'll agree that some programming errors *could* be fatal, but the one that comes to mind is the "2 line change" from AT&T that essentially knocked out phone service throughout the east and mid-west in 1990. It was the topic if many quality assurance seminars for the better part of the early 90's. I only remember it because it effected my company -- we lost phone service for 2 days. It was also one of those traditional "last minute changes" that someone clearly f*cked on...
http://www.soft.com/AppNotes/attcrash.html
Outlook!
Built with the idea that code in attachments should be executable, often automatically. Also full of exploitable bugs, to get even more stuff running automatically, regardless of who who sent it. Responsible for a huge amount of damage by all sorts of worms, trojans, etc.
Someone, somewhere got the idea that email would look better with html; and if it got html, it should get scripting too, that's consistent with web pages! And it's cool if attachments (like pictures) can be opened in their appropriate program automatically - let's run any executables then, that's consistent!
This is oversimplified, but I really feel that this is a case of stupid consistency that caused multi-billion dollar damage. Email should never be executed by the mail client.
I believe posters are recognized by their sig. So I made one.
Oh wait... -1 Redundant
Here's a good site though with tons of examples.
My favorite would be the infamous time when NASA did half its calculation in metric and the rest in SI. ;)
F-bacher
James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."
that was told to my class about the altitude of fighter jets.
A company was hired to rewrite the code that was used on one of the models of fighter jets, and they offered to fix an unusual bug.
The details are: apparently they had two altimeters - one was barometric, and the other I don't remember.
Anyway, the programmer was coding along, and was writing code to determine what would happen if the altimeters stopped functioning.
He came to the case where they both weren't working, and couldn't figure out what to do, so called one of the pilots that was acting as an information source for the developers, and asked him what altitude they normally flew at, and he answered, "12,000 feet" or something similar.
So the programmer wrote,
if altimeter1 not working
{
if altimeter2 not working
{
set height = 12000;
}
}
Stupid, but this code could not be changed. The pilots had the following rule deeply ingrained: if the altitude stays at 12,000 for more than a few seconds, pull up, as your altimeters aren't working.
A company I once worked for (as an intern) was in the business of what's called "train control" software. Briefly, it's the software that dispatchers use to monitor the status of the switches, the position of all the trains being tracked by the system, etc. One of the features of the system is to provide early-warning of potential collisions. Well, the system is quite reliable (having been in service, in one form or another, since the 70's). However, there have been some accidents.
Once such accident, in Mexico, was caused by an unexpected combination of several simultaneous failures. One day, for some reason, one of the servers needed to be reset. At the same time, two freight trains were stopped at a switch, in the process of what's called a "pass," where one train turns off onto a side track to let the other train pass by on the main track. Long story short, the status bits of the switch got lost during the server reset (there is a provision for restoring track states when the backup servers take over, but it didn't work for some reason). After asking if the track was clear, the driver for train1 recieved a green light from the dispatch office. The dispatcher, not knowing that train2 hadn't cleared the switch yet, figured everything was ok. The trains collided at very low speed, and not head-on, but nonetheless the collision cost the rail line several million in equipment and downtime. No one was hurt.
The lesson: When writing bullet-proof software, check every possible condition! More extensive field testing would have caught the failover bug.
I'm an AC for a reason...
Let's just say that two years ago a very large international shipping company suffered two days of worldwide failure in the package routings printed on labels. The bug was caused by an incorrectly placed paren in an index offset calculation, leading to truncation of an intermediate result (to a 16 bit unsigned int, when it should have been 32). The bug sat dormant for five years because the result matrix it was indexing into was smaller than 64kbytes. As soon as it grew over that size - boom! What a way to wake up at 2am when the Asian-Pacific region starts calling...
I didn't make it, but I was definitely involved with the fix. After that we did some very thorough auditing on all of the routing code - and fortunately didn't find any other surprises lurking.
This isn't really a programming error, but a user training error.
In the Airbus if the pilot tries to correct (use the flight controls) while the computer is engaged the computer will correct the pilot's correction. Unlike in a car with cruise control where if you hit the breaks it just cuts the cruise control. Many China Airlines planes have crashed due to poor pilot training in this regard. They weren't trained well enough to shut off the computer control before taking control of the plane.
I'm also sure someone can be a little more detailed than this, but it is, IMO, at least a design error that has caused hundreds of deaths.
As a side note, my Software Engineer professor refused to ever fly on a fly by wire plane, and was opposed to SDI simply because he didn't beleive that either had been or ever would be debugged properly. (if there is one error in every 10,000 lines of code, and it has 3 or 4 million Lines of Code, how many errors is that? His answer: too many to trust)
I worked for a programmer back in the 80's who made a mistake that caused all credit card purchases to disappear from the electronic journal. This meant that their purchases were not recorded on their credit card statements. Fortunately for the company the bug did not affect the recording of the transactions on the paper journal. This bug wasn't discovered for a few days and it took quite some time to rekey all the credit transactions.
Unfortunately this was not her first or last mistake of this magnitude. Retailers often see IT as an expense rather than an asset and are as cheap as possible. This has a tendency to cause shoddy programming since they hire as few programmers as possible and overwork them and often software is put into production without being thoroughly tested. At least this was the case when I worked in retail some ten years ago--I don't think I'll do that again.
But I am finding that insurance companies have the same philosophy.
Not ONE hardware problem...ever?
Clearly you are forgetting the Apollo I fire which resulted from a spark in a pure O2 atmosphere. The spark was caused by a frayed wire. That's a hardware problem for sure.
Despite what may programmers think, they do make many mistakes. Having been in QA for more than 7 years, blimey, the stories I could tell.
For example. Once there was a requirment for a windows program to do nothing. If it started up, it would just shut down . Simple? I would have thought so - even if it wasn't, it was simple for the developer to unit test. It took 7 attempts. Ranging from opening a window and sitting there - through several GPFs - and at least one reboot.
Then there was one time (of many) where despite assurances from development that the product had been properly unit tested, it would core dump on start up.
My point is that any CS student should understand the whole development process. It is more than just programming. Whilst neither of the above were life threatening, it illustrates a point. No matter how many examples of catastrophe and failure you find, there would be alot more without testing and QA.
Of course, you could take the point that all those public failures are a result of lax QA.
F16 autopilot flipped plane upside down whenever
it crossed the equator.
They should have known from the water going the wrong way when flushing !
What about Apollo 13?
Well the reason why Y2K wasn't the huge disaster the media were predicting was because in the years leading up to it the world's programmers were running around like blue-arsed flies fixing everything :P
.
I've read in-depth technical analyses of the Apollo fire, and I have an MSc in Physics.
Before that, *no-one* knew that a spark in one place could cause a fire TWO FEET AWAY.
(You get little hot bits of burnt dust floating around in a pure oxygen atmosphere, and they keep themselves hot enough to set something else afire quite a ways away. Of course things are *easier* to set fire to in that atmosphere as well.)
Usually the story goes something like, well, take your pick ...
I am a ./ reader so I am a geek and so I do know.
...
... and who are you by the way ?
It compiles, it works, so it must be correct.
But
Whether you will be willing to accept what you are going to see is a different question altogether and of course having a good laugh at others is more fun, yet it there is a difference between being just another coder out there and being a developer.
IMHO one ought to aim for the latter and once you have become your harshest critic you are on the right path.
It was a new financial system, and it was a real mess - something like £9m initial cost and £20m due to its flaws. According to Anthony Finkelstein, who's written a very detailed report on the fiasco:
You can read his full report here (pdf) or here (google html version). There are also news reports on the system here and here.
Basically, it was bad management throughout... a classic case of a big software project gone wrong.
About 25 years ago, Washington State Ferries had a new fleet of boats with computer controlled engines. The code included "safety" features to protect the engines and transmissions from abuse.
So, when a ferry was about to crash into a dock, and the captain called for full reverse power, the software would shut the engine down to protect it......and the ferry would crash into the dock.
Horror stories (lost rockets, etc) are certainly attention-getters, but a more useful question might be what kinds of errors got made, regardless of how severe the outcome.
For example, I once helped a newbie employee with a program that was working fine in a simple test case, but was blowing up when it tried to crunch through a production file.
After digging a little, I noticed that she was using recursion in her "GetNextInterestingRecord" routine! The logic was:
1) Get a record
2) See if it's the kind we want
3) If not, Call self
4) return record to main
I'm not sure why she chose to use recursion (too many classroom lectures on "cool" stuff and too little experience with getting useful stuff done?), but the program needed "interesting" records every so often to keep from overflowing the stack.
Clearly recursion should be confined to those problems where it's really needed, and not used just because you can find a way to state the problem using recursion. And even then, you need think about how big the stack will get, and what sorts of scenarios could cause it to get too big.
- [...] Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.
So if it's not a bug, it must be a featureHave an article on the guys who write the stuff. They're damn good, but they say themselves their programs contain errors: "the last three versions of the program [...] had just one error each. The last 11 versions of this software had a total of 17 errors." Apparently never caused a problem, but not bug-free.
Then there was the Canadarm2 issue. Or wasn't that a bug either
yes, we have no bananas
Oh, yes. Personally, I'm am very glad our military has placed its faith (and the lives of our mariners) in such reliable technology.
Bottom line: that stuff about the floating point error in the PAC-2 system looks neat on paper but it's not at all clear that the faulty calculation was responsible for the loss of life.
GMD
watch this
my cs teacher told me this one back in college...he said one of the first runs of the f-16 (or maybe another one of the computer controlled fighers in the air force) they were flying and everything worked just fine. however they took it across the equator and the plan flipped upside down. so the pilot corrected it and everything went back to normal. then he flys across the equator again and it flips.
so they took a close look at the software, and there was a bug in their sin function so that when they went across the equator they angle changed from positive to negative and the sin function didn't have the negative incorporated. so basically when the plane went over the equator it thouht it was upside down and corrected itself by flipping itself upside down.
i think it's a funny example of a stupid mistake possibly making a catastrophe. i've never seen this mentioned elsewhere, so i'm not to sure about this. but i do trust the cs prof who told me, before coming to my school he did a bunch of government contract work.
This isn't variable initialization, but the principal replies. Data that you know are junk should look like junk! Trying to "fake it" or make it "look good" is exactly the wrong thing to do.
-Peter
Was working for a small isp. Sitting at work developing a script to blank the accounts off our old mail server (outsourced) for when our new mail server is completly online and ready to go. Its done, i remove my debugging code and the limites I had placed (i had limited it to work with only 2 test accounts) Congradulating myself on a job well done I head to the hall to grab myself a coke, i come back and my boss is at my comp, now the program was written in VC++ so the 'play' button is pretty obvious and hes seen me use it before, the idiot wanted to see what i was working on and ran it, blanking all accounts off of the mail server. Took us 3 days to get the outsourceing company to restore from a backup (one of the reasons we were co-locating our own), and even then all mail recieved after the backup (the night before) was gone ofc. I just about strangled my boss, on the upside, he never touched my workstation again.
Jesus saves, everyone else takes full damage from the fireball.
There's also the tiny truth about issues that are fixable don't sell nearly as well as "Airplanes will FALL FROM THE SKIES! Don't step in an elevator, THEY'LL FALL! Withdraw all of your money or YOU'LL LOSE IT!"
you know?
You can read about it from James Gosling's home page (also has info on Arianne 5).
Luckily the engineers were able to upload a patch to Mars. That's remote debugging/patching for you :-)
You can't blame the OS for this.
HUH? Anytime One program can bring down other programs whether they have the best error checking/handling in the world doesn't make it blameless. One application should not cause me to loose my other data/open applications just because some dips*** forget to check for a divide by zero error.
In a multi-tasking environment I have no business knowing or interfering with data/address-spaces that does not belong to me. It is the responsibility of the OS to take the mundane tasks of making sure that the programs "play well with others" and make sure they do.
I have to admit that the USS Yorktown should have had a redundant failsafe system (even if it meant that all people on board grabbed an oar and started paddling). According to the laws of the sea if I see a disabled ship and they accept my offer of a tow then I now "own" that ship. So be careful when your windows systems crash and you need to use someone else's restore disk they now "own" your computer.
Back issues of "Communications of the ACM" are a gold mine for such blunders of the art. Most issues have a back page column "Inside Risks" that are or were written by Peter Neumann but various others have contributed. Usually each covers a theme since the subject material is so broad and seemingly unending.
"Flight instruments don't lie"
... it has an electronic AOA.
... no matter what the real AOA was.
... Fortunately, it was expensive and not lethal.
First, BEFORE YOU LEAVE THE GROUND, pilots are taught that instruments don't lie. Specifically, when the human inner ear is placed in flight, things go wrong (the inner ear canals are static, not dynamic, devices; the fluid has no dampening or rate sensors). When there is no external reference, the inner ear canals adjust to the eye's visual presentation. It's called the 'leans.' Bad joo-joo. Many a perfectly good aircraft has been flown into the ground because the pilot believed his ears and eyes and not his instruments.
Second, IN FLIGHT, angle-of-attack (AOA) is a spectacular indicator of where your airfoil exists within (or outside) the flight envelope for your aircraft. Inside the flight envelope, you can seek best range (mpg) or best endurance (loiter) or best climb.
In most aircraft, the angle-of-attack indicator is a manual instrument (on the skin is a sensor which looks like a big euro-style handle and it runs to an indicator in the cockpit).
Many pilots are correctly taught to 'fly' the angle-of-attack.
Third, ON THE GROUND, when you land, you use the aircraft shape as an airbrake. You hold the aircraft nose off the ground as long as possible to create drag.
Fourth, ON THE GROUND, when you land, you do not want to hold the aircraft nose too far off the ground or the tail will scrape the runway and your fitness report will reflect and you'll be the butt of bad jokes at Snopes for eternity.
The AOA is used to assist in the performance of aerodynmic braking. The aircraft performance manual publishes the tried and true range of AOAs for aerodynamic braking. [It also indicates when too much AOA will ding the aircraft.]
Aerodynamic braking is part art and part science and requires accurate instruments.
Enter the F-16
F-16 pilots were taught to fly the flight direction indicators to land.
However, many old and new pilots fell back on the old AOA once the wheels touched the ground to do aerodynamic braking.
Suddenly, F-16 tails were scraping along the runway at an alarming (and expensive) rate.
[As an aside, the problem was probably ignored until a senior officer ground off a few inches of aluminum THEN there was a problem.]
The programmers who wrote the AOA routines were rightly told that the AOA is used in flight. So, when the AOA detected that the aircraft had placed weight on the wheels (weight-on-wheels - WOW), it was programmed to quit working. Unfortunately, it kept the last AOA reading
Pilot flies, pilot lands, pilot believes instruments, pilot scrapes multi-million dollar aircraft's tail along runway.
The programming solution was simple: when there was WOW, fade the AOA.
This was another case when contracts pit spec wording against spec intent against functional application and understanding of how it's supposed to work
"Why did they call you 'sparky' and why are you driving school buses in North Topeka?"
A bug in a factory PLC program allowed a machine to start when a metalic object (such as a wedding ring) went in front of a sensor.
Later, a program modification allowed an aircylinder to extend while the machine was turned off for maintenance. The guy jumped out of the way in time, but let us know about it. (This was before lockout tagout.)
Bottom line - a bug in a PC program typically results in data damage. A PLC bug can literally smash someone's head!
That was the opinion of NewScientist magazine, but shortly before the actual date, something happened in Australia that changed their mind ( it was in an editorial, IIRC, not in the online version -- the mag really is worth it ).
What changed their mind, is that some smelting operation ( again, IIRC ) destroyed itself automatically, when the computers that poured fuel ( coal? ) into the furnaces kept doing so, while the computers that poured ore into the system stopped doing so, because Feb 29th didn't, according to them, exist.
Autodestruct, though not quite HAL-style ( as an aside, didn't HAL stand for Holographic Algorithmic Logic? -- remember the clear blocks they used as HAL's units in the computer-room, too )
Sudden, Colossally Expensive equipment damage, but no lives lost.
Had that happened in one model of autopilot. . .
And yes, I remember some city administration stating that they'd done a 'dry run' of rollover, and discovered that the basic infrastructure didn't work ( water was one item specifically mentioned, though I don't remember if it was treatment or what ).
Of course, 'no disaster that had been about-to-be-caused by this code that we discovered to be non-correct didn't happen' . . isn't front-page news.
I know for fact, that some federal gov't contracters were writing NON-Y2K compliant code in 1998 ( either being committedly braindead, or hoping that the re-write contracts would pay extravagantly when the social-insurance system broke on-the-day ).
Messages to/for me ( in me journal )
From the Pacific Northwest, home of "innovative" approaches to software reliability, comes:
3 4563661_ship27m.html
...
http://seattletimes.nwsource.com/html/localnews/1
"Officials could not say for certain what caused the ship to heel, but they think the ballast system was probably at fault. A malfunction became evident about 3:30 a.m., when the 653-foot ship started to tilt. The crew was evacuated and no one was hurt.
The ship, in operation since June, has an automated ballast system that adjusts water levels in 28 compartments to keep it righted on the high seas."
Kind of frightening - wonder if the crew even knows how to do a manual override. (Also weird that evacuating the upper port balast chamber would cause it to list to port...)
In Cook County, northern Minnesota, a large percentage of households are heated by "off-peak electric stored heating". At midnight, December 21st 1999 (precisely 10 days before Y2K) the software controlling the radio signal which keeps all the heaters from going online simultaneously, crashed. The resulting overload shut down the power in the county for hours. This utility was not believed to be Y2K sensitive. Surprise!
On February 28 2000, (one day before the infamous 2/29/2000) credit card traffic into VisaNet (through Vital Processing) was failing out with the error code corresponding to "Invalid Date". Since the date 2/29/1900 is invalid, good Y2K test procedures usually call for testing that condition. AFAIK, the company never admitted to having a Y2K problem.
The National Reconaissance Office had some of its most valued spy satellite systems go offline due to Y2K troubles. I think they were down for at least a day or two. (ouch!)
I appologize for the confusion, I'll attempt to make it more clear. These two goals are actually not contadictory. One of the methods by which a chunk of code can be made easy to re-use is by abstracting it out into a separate module or subroutine. In this manner, anything that needs the functionality that that chunk of code provides, at any time in the future, can simply call it. In other words, you don't want to "cut and paste" any given chunk of code into several places, since if you need to make a change to it you'd have to change the same code in several places instead of just one. The idea here is that we want to save time, and increase maintainability.
Think about it like this. Let's say you want to read a book (use some chunk of code). You have two choices. You can get one copy of the book and keep it in a central location (abstract the code out to one subroutine or module), or you can get a dozen copies of the book and place it at seemingly convienient locations around your house (cut and paste, i.e. duplicate, the code in many different places). You start at the beginning and read a chapter or two. If you have one book you can simply place a bookmark (modify the code) where you left off. If you have a dozen books you're forced to place twelve bookmarks. Now, what would happen if the author puts out a revised edition of the book? Would you rather replace twelve books or one?
Ostensibly, the above example is somewhat contrived, but hopefully it answers the question.
--- Fox
Almost a decade ago when I worked for a differect credit card company that shall remain nameless, a member of my team (I was the lead) introduced a defect that was responsible for about $40K is mis-applied credits. I am not sure whether we ever got the money back.
The program was written in C, and he had changed a do-while loop to a for loop, in editing he had kept the line that contained the original condition (including the trailing semicolon). As many of you C-ers out there are aware, a semicolon following a for() statement will not execute the subsequent code block in the loop!
A very memorable lesson in the value of lint and thorough regression testing!
This may not qualify as a disaster, but I distinctly remember having to give an account for the defect to the corporate controller with an aufience of grand and exalted poobahs. She was a very intolerant and technically ignorant person that actually intimated that this had been done maliciously.
KK4SFV