Examples of Programming Gone Wrong?
LightForce3 asks: "I'm a beginning CS student, and in my studies I've come across examples of programmer error causing very large problems, such as the Ariane 5 failure and the Therac-25 accidents, often as tales of caution to beginner programmers such as myself. My (morbid?) curiosity has been piqued, and I'm looking for other examples of programmer error leading to serious problems. After all, it is better to learn from the mistakes of others than from your own, right? ;) What programming-related accidents, incidents, and failures, both well-known and obscure, do Slashdot readers know about, and are there any good resources for researching these?"
this is already any ask slashdot from a while back.. check the archives.
Erik, don't troll. The Challenger accident was a mechanical failure, had nothing to do with software, but if you want a software project gone wrong example, I'll give you one: Gah NU/Hurd
When Programming goes Wrong 2! Thrill to our latest reality TV series where we show REAL LIFE footage of poorly thought out database schemas, unchecked buffers and even explicit shots of forbidden goto statements.
12 years of devlopment and it still sucks!
This book is devoted to just that. It's what you're looking for...go get it and read it.
http://slashdot.org/articles/99/09/30/1437217.shtm l
-- Kircle
1.) Patriot missile failure
2.) Intel f*cking up floating-point calculations in one of their chips
3.) High-tech toilet glitch (no, really!)
4.) Windows ME
If you celebrate Xmas, befriend me (538
What happened to Challenger wasn't a programming mistake, but rather a case of not following policy. The solid rocket boosters were never designed to operate in cold temperatures. The result of working outside of design specs was catastrophic failure, yes, but that wasn't the result of a programming error.
In the 80's, Robert T Morris accidentally released a worm that exploited problems in sendmail and other common internet daemons that took down most of what was the internet at that time. This was expecially bad since about half of it was military.
...wherein a technique to save memory on older computers resulted in a massive media panic twenty years later. Oh, and it caused a couple glitches
Co-founder of GerbilMechs
A Central Office (CO) switch is basically a mainframe-class computer programed in assembler. A few years back, a newly-installed switch failed due to a bug in the code, causing a cascading failure of the phone system for a few hours.
Incorrect: This was not a programming issue. Nor was it a software issue at all. The problem was the O-ring seals in the SRBs (Solid Rocket Boosters). The manufacturer stated that they should not be operated under 53 degrees, and NASA overrode the recomendation and launched anyway. The expected happened.
NASA hasn't ever had a hardware problem. Or a software problem. Ever. Every problem can be directly tied to one specific person being a fscking moron. The closest you could come is that Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.
Professor Falkin was always saying, "Leave a backdoor in any program you write, just in case your code becomes self-aware."
Checking in code before its completely debugged not knowing that that night's build would be going out as client update.
Very embarrassing when you leave a ShowMessage('In Here 239') line of code... Opps.
Moral of the story... Beware of checking in incompleted work into a Code Vault.
Tournament Management Online &
The only reason it took thigns down was because a timing loop was messed up, and it was spreading something like 1000 times too fast. It was supposed to spread everywhere, yes, but by crawling slowly.. it was not intended to eat up all connections on all machines.
Had that been the case, it would have been much more widespread and caused much less damage.
The RISKS Digest is a mailing list and usenet newsgroup that describes all kinds of situations where technology has gone wrong. Many of the stories involve programming errors.
Google's RISKs Archive
I'll agree that some programming errors *could* be fatal, but the one that comes to mind is the "2 line change" from AT&T that essentially knocked out phone service throughout the east and mid-west in 1990. It was the topic if many quality assurance seminars for the better part of the early 90's. I only remember it because it effected my company -- we lost phone service for 2 days. It was also one of those traditional "last minute changes" that someone clearly f*cked on...
http://www.soft.com/AppNotes/attcrash.html
Outlook!
Built with the idea that code in attachments should be executable, often automatically. Also full of exploitable bugs, to get even more stuff running automatically, regardless of who who sent it. Responsible for a huge amount of damage by all sorts of worms, trojans, etc.
Someone, somewhere got the idea that email would look better with html; and if it got html, it should get scripting too, that's consistent with web pages! And it's cool if attachments (like pictures) can be opened in their appropriate program automatically - let's run any executables then, that's consistent!
This is oversimplified, but I really feel that this is a case of stupid consistency that caused multi-billion dollar damage. Email should never be executed by the mail client.
I believe posters are recognized by their sig. So I made one.
Here's the Link
-- Kircle
The most obvious example is the Mars orbiter. Gotta get those metric/imperial conversions down. Less common, but quite important are things like radian/degree (if doing 3d stuff), various units of time, currency (Office Space, anybody?). Math stuff in general is easy to mess up without proper testing. In addation, be sure you'll never get something that's out of the domain of what you're working with. If your program starts turning up areas of -14 square kilo-ounces or something, you may have problems ;)
:)
Of course, if you only use one unit for something, you never have to worry about conversions
--Justin Mitchell
"2nd Place is a fancy word for losing" --Bender (Futurama)
Oh wait... -1 Redundant
Here's a good site though with tons of examples.
My favorite would be the infamous time when NASA did half its calculation in metric and the rest in SI. ;)
F-bacher
James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."
MIT runs a class called 6.033: Computer Systems Engineering. These lecture notes contain a list of projects that had great sums of money spent on them only to be abandoned. Also the reading list has a bunch of papers that discuss the "big splash" failures like Therac 25.
I saw this one coming as soon as I read the headline.
How about something a little more unpredictable next time?
but couldn't find it.
Anyway, here are a couple of links.
Software horror stories
More horrors
that was told to my class about the altitude of fighter jets.
A company was hired to rewrite the code that was used on one of the models of fighter jets, and they offered to fix an unusual bug.
The details are: apparently they had two altimeters - one was barometric, and the other I don't remember.
Anyway, the programmer was coding along, and was writing code to determine what would happen if the altimeters stopped functioning.
He came to the case where they both weren't working, and couldn't figure out what to do, so called one of the pilots that was acting as an information source for the developers, and asked him what altitude they normally flew at, and he answered, "12,000 feet" or something similar.
So the programmer wrote,
if altimeter1 not working
{
if altimeter2 not working
{
set height = 12000;
}
}
Stupid, but this code could not be changed. The pilots had the following rule deeply ingrained: if the altitude stays at 12,000 for more than a few seconds, pull up, as your altimeters aren't working.
If you've ever seen Edward Tufte's canned speech, he partially attributes the event to an inability of the engineers involved to organize and present their information in a clear way to communicate the nature of the problem.
Wouldn't setting it to something like 0 be better? I mean, I could miss it sticking at 12,000 for a while, but if I notice that my altitude is suddenly 0, I think my first instinct will be to pull up as fast as possible.
F-bacher
James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."
i guess we could call "slashdotting some inocent site" an example of programming gone wrong
-- SouNerd.com
I used to work for a 1999/2000 'golden child' dot-bomb which dealt in file trading... a proposed legal form of napster. It was a fucked company from the start, but it still had a lot of traffic in the early days.
:)
:)
We always had problems with downloading files from the site.... the files kept getting corrupted, and occasionaly, a member would complain that they tried to download a powerpoint presentation and ended up getting 4 way anal porn.
This perplexed the developers, and it was not until 9 months after going online with the site, did they realise that the java class that dealt with the downloads was a single process shared by all users!
So, your download would go ok IF nobody else tried to download at the same time. If two people clicked download at about the same time, you would download the file that the second person wished to download.
No wonder they went bankrupt
-- 7 string electric violin + live loop samplers
When they played Heidi over the end of the greatest come-back in football history. Oh wait, you didn't mean that kind of programming, did you?
Do what you can, with what you have, where you are.
Don't be so narrow in your approach. Is it a programming error if a stadium roof collapses because the engineers couldn't understand what the output of their computer model was saying?
What about when the construction crew quietly substituted what they thought was an equivalent design to what the computer program came up with for a skywalk over a hotel lobby?
After almost 20 years in this field, I think that at least 80% of the serious "errors" I see are because the user didn't understand the results of the program, and only 20% of them are due to classic development errors.
The lesson to learn from this: the user interface matters. Give some thought to presenting the information in a meaningful manner (e.g., the infamous pre-Challenger graphs showing O-ring erosion vs. the post-Challenger graph that mapped damage by temperature at the time of launch), and allow users to see the information in the way that makes the most sense to them.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
... should be enough to write a dissertation on the thoughtless computing leading to serious problems.
Anyone got a reliable link for the toilet one?
... well ... acurate news.
I've never really considered the register to be a source of
A company I once worked for (as an intern) was in the business of what's called "train control" software. Briefly, it's the software that dispatchers use to monitor the status of the switches, the position of all the trains being tracked by the system, etc. One of the features of the system is to provide early-warning of potential collisions. Well, the system is quite reliable (having been in service, in one form or another, since the 70's). However, there have been some accidents.
Once such accident, in Mexico, was caused by an unexpected combination of several simultaneous failures. One day, for some reason, one of the servers needed to be reset. At the same time, two freight trains were stopped at a switch, in the process of what's called a "pass," where one train turns off onto a side track to let the other train pass by on the main track. Long story short, the status bits of the switch got lost during the server reset (there is a provision for restoring track states when the backup servers take over, but it didn't work for some reason). After asking if the track was clear, the driver for train1 recieved a green light from the dispatch office. The dispatcher, not knowing that train2 hadn't cleared the switch yet, figured everything was ok. The trains collided at very low speed, and not head-on, but nonetheless the collision cost the rail line several million in equipment and downtime. No one was hurt.
The lesson: When writing bullet-proof software, check every possible condition! More extensive field testing would have caught the failover bug.
A poor communication among the software guys that results in improper units being used in a program isn't a software problem?
If that wasn't a software problem, what was it? All software problems are defects in design or implementation, they all go back to a miscommunication or mistake among some software people somewhere.
I'm an AC for a reason...
Let's just say that two years ago a very large international shipping company suffered two days of worldwide failure in the package routings printed on labels. The bug was caused by an incorrectly placed paren in an index offset calculation, leading to truncation of an intermediate result (to a 16 bit unsigned int, when it should have been 32). The bug sat dormant for five years because the result matrix it was indexing into was smaller than 64kbytes. As soon as it grew over that size - boom! What a way to wake up at 2am when the Asian-Pacific region starts calling...
I didn't make it, but I was definitely involved with the fix. After that we did some very thorough auditing on all of the routing code - and fortunately didn't find any other surprises lurking.
Not a specific example, but a big mistake is to assume that just because when you use a function in a certain way that works means it's suppose to work that way. I've seen so many people fail to read any documentation on the functions they are using. Whenever I program, I make sure every function, operand, everything I use I understand what they do exactly. I don't just assume, I make absolutely certain. Read, Read, Read, and then Read some more.
..There's a-dooin's a-transpirin'
No, really.
Search and you will find.
Learning to search effectively will serve you best.
Yes but the lander was friggen funny, you can u just see the guy sitting there, taking a hit off a bong then stateing. "Dude dude, comehere, got a problem" "whats that?" "*giggle* Well you know the lander *gafaw* the navigation was in meters....but i programmed the lander in feet, so instead of landing *grin* fsck'er burried..... I wish there woulda been another lander there to see that one hit, woulda been the shot of a lifetime. Besides, nasa crashes things into things on perpose:)
Jesus saves, everyone else takes full damage from the fireball.
I think it would be interesting to know when the first infinite loop occured in the early days of programming, and how the programers dealt with it. Obviously, back then they only had single-tasking machines.
Let's say you turned in some bad FORTRAN code to the university computer on a time share. What if nobody noticed for hours that your program was taking up all the processing time? That would make some people pretty pissed. :p
This isn't really a programming error, but a user training error.
In the Airbus if the pilot tries to correct (use the flight controls) while the computer is engaged the computer will correct the pilot's correction. Unlike in a car with cruise control where if you hit the breaks it just cuts the cruise control. Many China Airlines planes have crashed due to poor pilot training in this regard. They weren't trained well enough to shut off the computer control before taking control of the plane.
I'm also sure someone can be a little more detailed than this, but it is, IMO, at least a design error that has caused hundreds of deaths.
As a side note, my Software Engineer professor refused to ever fly on a fly by wire plane, and was opposed to SDI simply because he didn't beleive that either had been or ever would be debugged properly. (if there is one error in every 10,000 lines of code, and it has 3 or 4 million Lines of Code, how many errors is that? His answer: too many to trust)
Come on now, that's the lazy way!
How about citing an actual example of windows code bugs causing big problems? I'll go first. The USS Yorktown had to be towed back to harbor when the NT system that was automating most of the ship crashed.
What? The French make exelent cars..(Peugeot, Citroen)...
The American cars on the other hand are *completely* hopeless!
Try a Peugeot 206 / 307 and feel how it handels the road. Then you don't even want to go back!
Moderation: +4. Modded 70% Funny and 30% Overrated. 100% Saturated.
If you want to eek out every bit of hardware performace, you still can't beat coding in assembler. This of course assumes it is worth the time and effort. I haven't the foggiest idea if switches are still coded in assembler, but they still were at the time of the failure.
Way back when I still worked on Honeywell mainframes like the DPS-88, I would write the occasional application in assembler (called GMAP) because there were some very nice multi-word machine instructions that would do a lot of stuff for you. For example, the MVT command would move data from one area to another, while converting it from ASCII to EBCDIC. One machine instruction. That's much more efficient than coding the same thing in C.
>>
This is perhaps not a syntax problem, but it remains a software problem (semantic). Clearly, the software didn't do what someone (falsely) assumed it would do.
j. scott olsson
YHBT. 'Nuff said.
Sadly, this is often the case. Sometimes management is more interested in looks or how the organization/operation will be perceived than love. Lets not fall into that same trap.
I worked for a programmer back in the 80's who made a mistake that caused all credit card purchases to disappear from the electronic journal. This meant that their purchases were not recorded on their credit card statements. Fortunately for the company the bug did not affect the recording of the transactions on the paper journal. This bug wasn't discovered for a few days and it took quite some time to rekey all the credit transactions.
Unfortunately this was not her first or last mistake of this magnitude. Retailers often see IT as an expense rather than an asset and are as cheap as possible. This has a tendency to cause shoddy programming since they hire as few programmers as possible and overwork them and often software is put into production without being thoroughly tested. At least this was the case when I worked in retail some ten years ago--I don't think I'll do that again.
But I am finding that insurance companies have the same philosophy.
Netscape. ... and that version is buggy as hell.
Stopped evolving in a mass-market way at 4.79
Comment removed based on user account deletion
US shooting down Airbus 320
You're referring to the destruction of Iran Air flight 655 by the USS Vincennes near the Strait of Hormuz, on July 4, 1988. For one thing, it was an Airbus A300 (bigger and older than an A320). The failure there was mostly in human decision making, not in the AEGIS radar system, which faithfully reported that the airliner was travelling at 450 knots on a steady bearing towards Vincennes, roughly four miles outside the commercial air corridor, and not broadcasting IFF information (which of course they wouldn't, as a foreign civilian airliner). It was the officers of Vincennes who interpreted this information as a threat, misidentified the target as an Iranian F14, and destroyed it.
Toronto-area transit rider? Rate your ride.
Did anything terrible happen because of Y2k?
The shear number of people asking for help with research on slashdot is just amazing.. Have they heard of the "internet", "search engine"
.25 second search time, Give maybe.. 5 minutes to connect to the internet and perform search..
.25 seconds over 2 million pages of info on software failure..
Lets see
google - software failure
Results 1 - 10 of about 2,040,000. Search took 0.25 seconds
Hmm lets see so
So in 5 minutes
Get a clue, Get a Life, Do your own research.
Personal Website
Many may think this topic has been done to death, but the examples grow exponentially. Someone recommended the RISKS Digest above, which I agree is terrific.
/.ers' attention from time to time.
Rarely is the answer that "the programmer was an idiot." Software bugs are projections and magnifications of human frailties. There is the class of errors where the computer does what it should but interacts with the user poorly, and it's glib to dismiss the user as the idiot.
I have followed military snafus with interest. It is still not clear how much of the Vincennes catastrophe was human versus computer error. The Yorktown was an example of an old-fashioned divide-by-zero error crashing Windows and paralyzing a frigate. The Navy plans to automate aircraft carriers and also hand over fire control to Windows systems, which makes me uncomfortable to say the least. Bill has the bomb.
Voting systems are another area where our understanding of the errors must be completely up-to-date. As it is, most (all?) manufacturers of voting and tallying software consider their code proprietary and won't allow outside audits. If you think chads were bad, just wait for an electronic voting disaster than lacks an old-fashioned paper trail.
Risks and comp.risks may however be the better forums for this topic; but it's not a bad thing for the afficiondoes to bring it to the general interest
P.S. I recall a satellite that was lost in the early 80's for lack of a comma in the code. Which satellite?
A poor communication among the software guys that results in improper units being used in a program isn't a software problem?
That's correct. It's a communication problem
If that wasn't a software problem, what was it?
A miscommunication between software guys...
All software problems are defects in design or implementation, they all go back to a miscommunication or mistake among some software people somewhere.
Your definition of a software problem is to vague to be proper...
But anyway, that's only half true. The problem is that not all software bugs go back to that source... hardware plays a big role as well.
SELECT * FROM BugTraq_SecurityFlaws WHERE FlawCause LIKE '%buffer%overflow%' ORDER BY OSType ASC
Never underestimate the relief of true separation of Religion and State.
Would it be so hard to display ERROR??? If they're stuck with numbers, how about something impossible like 99,999 feet?
That programmer should be taken for a little jet ride. So the pilots could show their thanks.
um, this was a reference to WarGames... ;^)
Do you think he had a hand in SkyNet as well?
Lucky break those metal heads chose to model Terminator after old Arnie. Probably the specifications read "Big, dense, abrasive". If they'd picked someone brighter we'd be screwed for sure.
"I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
I had always thought that SI was the English system; I have no idea why.
Thanks,
F-bacher
James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."
Not ONE hardware problem...ever?
Clearly you are forgetting the Apollo I fire which resulted from a spark in a pure O2 atmosphere. The spark was caused by a frayed wire. That's a hardware problem for sure.
the database they were using faulted on a divide by zero. nothing to do with NT.
Despite what may programmers think, they do make many mistakes. Having been in QA for more than 7 years, blimey, the stories I could tell.
For example. Once there was a requirment for a windows program to do nothing. If it started up, it would just shut down . Simple? I would have thought so - even if it wasn't, it was simple for the developer to unit test. It took 7 attempts. Ranging from opening a window and sitting there - through several GPFs - and at least one reboot.
Then there was one time (of many) where despite assurances from development that the product had been properly unit tested, it would core dump on start up.
My point is that any CS student should understand the whole development process. It is more than just programming. Whilst neither of the above were life threatening, it illustrates a point. No matter how many examples of catastrophe and failure you find, there would be alot more without testing and QA.
Of course, you could take the point that all those public failures are a result of lax QA.
Is EverQuest in all its glory!
If that's not programming gone wrong, I don't know what is...
I can't recomend the risks site too highly. (redundent I know)
a m.html
Risks To The Public In Computers And Related Systems
http://catless.ncl.ac.uk/Risks
On how to be 0wned by other people: Counterpane: Crypto-Gram . Shares with comp.risks the reframe of "I can't belive people don't learn from this"
Counterpane: Crypto-Gram
http://www.counterpane.com/crypto-gr
Don Norman's _The Design of Everyday Things_ and website also offer insight on how to avoid UI failures relating to failures.
http://www.jnd.org/index.html
Also, get a copy of _Code Complete_ and/or _Code Write_ by Steve McConnell [pub: Microsoft Press Which is rich irony) Lots of mistakes and how to avoid them.
The cautionary note might be that most of these failures are human related at some level. Whether it be at the project level, or the UI level -- there are lots of ways to cause a failure.
Finally, avoid any kind of carreer in Software QA. There is no better way to just get kicked around at the expense of the people putting the bugs in the software in the first place.
Anybody can work under ideal circumstances. -- Jeff K. (January 4, 2001)
I love how complete bullshit gets moderated to +5 Informative on slashdot. Why not do a tiny bit of fact checking and slam people for misinformation rather than praising them for anything negative having to do with Microsoft???
What Win9x kernel?
Tsunami -- You can't bring a good wave down!
Read the RISKS digest, as comp.risks or at http://catless.ncl.ac.uk/Risks. Everyone who works with computers should read this regularly, it is much less painful to learn from other people's mistakes.
PGN put a bunch of the classic items together in a book a few years ago, called Computer-Related Risks.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
Does the WOPR in War Games count?
Sure, horrible accident. But nothing on the site that you link to indicates that it had anything to do with a programming mistake.
Got brain?
Someone here was claiming that NASA has never had a software bug. That sounded pretty unbelievable to me. And sure enough, it's not true. In the recent Mars missions alone, they had a bunch of software bugs resulting in things varying from non-fatal vehicle failures to outright loss of spacecraft.
Regarding the loss of the Mars Climate Orbiter spacecraft, from nasa.gov: "The 'root cause' of the loss of the spacecraft was the failed translation of English units into metric units in a segment of ground-based, navigation-related mission software"
Also, here are several "software bugs" (their words) relating to the Mars Surveyor Lander Vehicle are described. These bugs were detected and fixed in the field (ie, Mars). At least one of the bugs caused a heater failure in the vehicle on Mars. This failure was recovered from.
Anyways, those are just two quickies, but NASA has their share of bugs. (And generally some pretty ingenious ways to reprogram and update vehicle software post-launch.)
On a related note, here's a paper from NASA entitled "The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software".
F16 autopilot flipped plane upside down whenever
it crossed the equator.
They should have known from the water going the wrong way when flushing !
What about Apollo 13?
After all, it is better to learn from the mistakes of others than from your own, right? ;)
:p
It is better. It is also very rare. Most people (especially programmers) that I know have to learn things the hard way, myself included. Sometimes, you can be told "no, don't do it that way" over and over and over, and still want to do it that way until you realize, the hard way, why you maybe should have listened.
Or maybe that's just me, and the weirdo's I hang with...
NGWave - Fast Sound Editor for Windows
I can't believe no one has mentioned it yet -- it's probably because they don't care about the third world
Well, I do care about the third world. But I was not aware the Bhopal disaster was down to dodgy software. I always believed it was reckless cost cutting by a faceless multi-national which took advantage of the fact that India, as a developing country, didn't have very good health & safety legislation.
But I think the main reason disasters like this are ignored is that poor people don't make very good consumers, so the consumerist society pays little attention to their wants and desires. And their deaths are little more than statistics.
10,000 dead? Bung them some cash.
Now, what do you think would happen if a large number of decent hard working consumers were wiped out in a single event?
Do you mind, your karma has just run over my dogma.
Technically a mechanical failure. If they hadn't had the ingenious idea of using a pure O2 atmosphere the spark probably wouldn't have done much. Pure O2 make anything flammable positively explosive (actually, that's the secret of most explosives -- a built-in oxidizer).
Was there a significant reason for using pure O2? I remember it argued that this was something they could have done without, a bad decision.
Actually I think that no one mentioned this since it has nothing to do with programmer error. (At least not according to what you linked to.)
Karma: Incomprehensible (Mostly affected by posting at +5, reading at -1, and metamoderating everything unfair.)
Friend you are worng there were some programming errors during the Apollo series Moon launches..the astronauts had to re enter new progreams to correct the orginal ones..look up its at NASA..
Don't Tread on OpenSource
Now, what do you think would happen if a large number of decent hard working consumers were wiped out in a single event? it did, 09/11/01
"Sic Semper Tyrannosaurus Rex."
To increase altitude in an aircraft, you add power, not pull up. Pulling up slows the aircraft down and decreases altitude.
Much as I dislike NT, especially in critical environments, this problem had nothing to do with NT. It had everything to do with bad coding.
As we all know, information systems are only as smart as people make them. In the case of the USS Yorktown, an admin/operator entered data which caused a divide by zero condition in the application. Because the application did not have any exception handling built into it for a divide by zero condition, it died.
You can't blame the OS for this. The application should have had exception handling built into it in a couple of places. It probably should have checked any new entries before comitting them to ensure the new data would not introduce such a condition, and the app itself should have had appropriate error handling to prevent a panic/dump when a divide by zero condition was encountered.
If the app was coded by the same people on another platform, the end result would have been the same.
Idiot, n. A member of a large and powerful tribe whose influence in human affairs has always been dominant
NASA hasn't ever had a hardware problem. Or a software problem. Ever.
you're full of crap. here are nasa bugs. there were people involved in the process of developing the software (surprise, surprise), but read nasa's admitted "root cause" for the climate orbiter ($$$$) being lost: it's software (not the people) which failed to translate units.
next you're going to tell us software doesn't have bugs, programmers do. bs. if you're going to tell us nasa never has bugs ever, you better give us evidence!
Mariner 1 (that was intended to orbit around Venus) went off course and crashed (splashed?) into the Atlantic because of an omitted comma in the COBOL guidance program.
"We shall party like the Greeks of old! You know the ones I mean." - HedonismBot
Using a pure O2 environment simplifies things tremendously. The bigest thing is that when the craft is in space, it leaks. Period. You can't stop it. So they need to replace the air as the mission progresses. Running in a pure O2 environment allowed them to operate at a lower pressure, which means that they needed to take up less air with them, which meant a lighter capsule, which is very good.
After the Apollo 1 fire, they switched to a mixed atmosphere envrionment on the ground, but only replace the leaking O2 in space. So after a while in orbit, they are actually back to a pure O2 environment. This saves the craft from having to carry replacement O2 and N2, while not really risking safety. (Fire doesn't spread well in space due to lack of convection.)
The manufacturer stated that they should not be operated under 53 degrees, and NASA overrode the recomendation and launched anyway. The expected happened.
Actually, one of the engineers who worked for Morton-Thiokol (the group who worked on that particular mission) had discovered the problem and brought it in front of his superiors. He was adamant about not letting them go ahead with the launch, but the administrative board at the firm disregarded his findings and gave the go-ahead to NASA. This wasn't a software problem, but a definite people problem.
http://www.acme.com/jef/netgems/scratch_monkey.htm l
-calyxa
Decay! Decay! Decay! -Helium
Windows is more of a design disaster than a programming one. I hate windows as much as the next guy, but it's full of things doing the job they're meant to pretty well.
It's just that they're meant to enable scripts to run in arbitrary text files, or default to sharing your documents over the internet, or place documents in weird places. All these were programmed correctly, just designed wrong.
Actually, the switching code was in C and the crash was due to a programmer's apparent misunderstanding of the 'break' statement. See full details at: http://www.csc.calpoly.edu/~jdalbey/SWE/Papers/att _collapse.html
An Air France A320 (a new design) was doing a low-altitude fly-by at an airshow when the aircraft descended into terrain.
The pilot, since convicted of manslaughter, claims the aircraft reported AGL (above ground level) altitude to be 100', while video shows the aircraft was closer to 30'. There is significant evidence to support this story, such as the apparent swapping of DFDRs and the issuing of Operational Engineering Bulletins to correct problems as explained by the captain of the aircraft.
The captain claims that the throttle by wire system would not respond to increased command, so he retracted them to idle, then advanced them to Takeoff/Go Around (full). The aircraft had crashed by that point, killing 3 aboard.
See this for more.
-twb
http://wwwzenger.informatik.tu-muenchen.de/perso ns/huckle/bugse.html
It's not as simple as you make it out to be. NASA's Marshall Center applied a significant amount of pressure to Morton Thiokol (the contractors who designed and manufactured the solid rocket boosters) to give them the go-ahead. See, the contractors have absolute veto authority; if they want NASA to not launch, NASA can't do anything about it. However, NASA was under a TON of schedule pressure to launch. (There's even a conspiracy theory that Reagan ordered NASA to launch so he could call McAuliffe in orbit during the State of the Union Address scheduled for that evening, but there's absolutely NO evidence that this is true.) NASA passed off that pressure onto the management Morton Thiokol, who unfortunatly buckled under it.
The poster said they'd never had hardware problems, which I suspected as classifying problems as EITHER a software problem or a hardware problem.
If a miscommunication among software guys that results in a bug isn't a software problem, then what exactly is a software problem?
The most common kind of internet security failure derives, it would seem, from buffer overflows on the internet.
These buffer overflows invariably arise from unsatisfactory bound checking -- one of the simplest kinds of bugs, and are easily exploited these days by script kiddies.
Examples are too numerous to mention.
>>I always believed it was reckless cost cutting by a faceless multi-national
Yes, but more than that, is was a lack of communication between sections. I forget exactly what caused this incident, but it was four or five different problems that all went wrong at once. Had any of them not had problems, thye porbably would have been closer to a Three-Mile Island scare rather than any actual damage occuring. But two of their safety systems were off-line for maintainance, and a third and fourth were poorly designed. Something like that.
Back when C++ was new, there was an insidious problem with the syntax that never showed up during compilation.
//check for \ //found one, handle path
if(c=='\')
slashfound=1;
++index;
Code similar to this delayed shipment of a commercial product because it caused serious instability.
.
I've read in-depth technical analyses of the Apollo fire, and I have an MSc in Physics.
Before that, *no-one* knew that a spark in one place could cause a fire TWO FEET AWAY.
(You get little hot bits of burnt dust floating around in a pure oxygen atmosphere, and they keep themselves hot enough to set something else afire quite a ways away. Of course things are *easier* to set fire to in that atmosphere as well.)
Some of the tips, which may appear obvious to some of us, include:
--- Fox
Usually the story goes something like, well, take your pick ...
I am a ./ reader so I am a geek and so I do know.
...
... and who are you by the way ?
It compiles, it works, so it must be correct.
But
Whether you will be willing to accept what you are going to see is a different question altogether and of course having a good laugh at others is more fun, yet it there is a difference between being just another coder out there and being a developer.
IMHO one ought to aim for the latter and once you have become your harshest critic you are on the right path.
It was a new financial system, and it was a real mess - something like £9m initial cost and £20m due to its flaws. According to Anthony Finkelstein, who's written a very detailed report on the fiasco:
You can read his full report here (pdf) or here (google html version). There are also news reports on the system here and here.
Basically, it was bad management throughout... a classic case of a big software project gone wrong.
About 25 years ago, Washington State Ferries had a new fleet of boats with computer controlled engines. The code included "safety" features to protect the engines and transmissions from abuse.
So, when a ferry was about to crash into a dock, and the captain called for full reverse power, the software would shut the engine down to protect it......and the ferry would crash into the dock.
Horror stories (lost rockets, etc) are certainly attention-getters, but a more useful question might be what kinds of errors got made, regardless of how severe the outcome.
For example, I once helped a newbie employee with a program that was working fine in a simple test case, but was blowing up when it tried to crunch through a production file.
After digging a little, I noticed that she was using recursion in her "GetNextInterestingRecord" routine! The logic was:
1) Get a record
2) See if it's the kind we want
3) If not, Call self
4) return record to main
I'm not sure why she chose to use recursion (too many classroom lectures on "cool" stuff and too little experience with getting useful stuff done?), but the program needed "interesting" records every so often to keep from overflowing the stack.
Clearly recursion should be confined to those problems where it's really needed, and not used just because you can find a way to state the problem using recursion. And even then, you need think about how big the stack will get, and what sorts of scenarios could cause it to get too big.
More recent versions have features like filename completion (I still prefer the way bash handles it), and of course there is the (old) feature that lets you access a command history. And now, finally, there are scroll bars!
About the only good thing I can say about COMMAND.COM was that it didn't crash.
I'm a student at Worcester Polytechnic Institute and I've seen some pretty bad examples of coding here.
The game development club made myst style game involving the campus. When the game first launched it took up about 25 megs of ram, but within about 5 or so minutes it was using over 200 megs and rising.
Now it wouldn't have been much of a black eye if game development club wasn't giving out copies of during orientation to try to attact new members.
Thanks -- looking back I realized I was too critical. After all, they were doing 1000's of things for the first time, especially life support.
... was this a hardware fault or design failure? Hmm. Well, we're off-topic anyway.
I later looked this up at this archive, which suggests what you said: The key error (in retrospect) was the use of O2 on the ground. Mercury was ground pressurized with plain air. Another criticism I read somewhere is that the astronauts themselves introduced netting and velcro to store things in the cockpit, and these burned very quickly indeed.
ANYWAY
No, because the software worked as the programmer designed it. The communication process of the development was flawed, not the code.
If you were me, you'd be good lookin'. - six string samurai
From the site:
"In 1969, as part of its global empire, Union Carbide Corporation set up its pesticide formulation unit in the northern end of the city of Bhopal in central India. Initially it mixed and packaged pesticides imported from the US but was gradually expanded. In December 1979 its Methyl Iso Cyanate (MC) plant with an imtalled capacity of 5000 tonnes went into production.
On the night of December 2, 1984, during routine maintenance operations in the Methyl Iso Cyanate (MC) plant, at about 9.30 p.m., a large quantity of water entered storage tank no. 610 containing over 60 tonnes of AEC.
This triggered off a runaway reaction resulting in a tremendous increase of temperature and pressure in the tank and 40 tonnes of MIC along with Hydrogen Cyanide and other reaction products burst past the ruptured disc and into the night air of Bhopal at around 12.30 a.m. Safety systems were grossly under-designed and inoperative. Senior factory officials knew of the lethal build-up in the tank at least one hour before the leakage, yet the siren to warn neighbourhood communities was sounded more than one hour after the leak started.
By then, the poisons had enveloped an area of 40 sq.kms. killing thousands of people in its immediate wake. Over 500 thousand suffered from acute breathlessness, pain in the eyes and vomiting as they ran in panic to get away from the poison clouds that hung close to the ground for more than four hours."
Nothing to do with programming errors here that I can see. Sounds more like gross negligence and incompetence to me.
-A.
student of animation and the fine arts
Add power to gain altitude? When you're driving your car, do you press the accelerator to turn, too? Please explain to me, a non-physicist, how increasing power will do anything but make you fly faster in the same direction and at the same altitude you were already flying at. I used to work on airplanes, and if I were a pilot I'd pull back on the yoke to aim the nose of the plane a bit higher to gain altitude. Sure, I'd slow down a bit, but I'd gain altitude, too. If I really wanted to do it right, I'd pull up and increase power ONCE I WAS CLIMBING.
Prevent email address forgery. Publish SPF records for y
There was a video conference where the engineers relled off reams of evidence that stated that launch below 53F would result in catastrophic failure - they thought it would blow us the moment the SRBs were ignited.
NASA rubbished them, claiming it was a bad presentation (they only had a few hours to prepare), but they could not launch if the engineers said no. The launch had already been delayed by 3 or so days so there was huge pressure from a PR point of view.
The management team said "Take off your engineering hat and put on your management hat", and they then told NASA it was ok to launch.
The senior engineer on that program resigned and now lectures on safety procedures at a university.
Well, except for Mars Polar Lander, where the failure review board determined that the lander crashed because a flag indocating contact with the ground was not intialized to zero prior to the start of the retro-thruster loop. So the flag got set by the shock of deploying the landing legs, never got reset, and caused the thrusters to switch off as soon as they were on.
I guess maybe you forgot about Apollo 13 as well (hardware)? Or the Galileo High Gain Antenna that failed to deploy (hardware)? Or the serious telemetry system problems they had with one of the Voyagers (hardware)? Or the faulty landing bag on one of the Mercury flights (hardware)? (was it Glenn's? I don't remember) Or that funky glitch in the landing computer during Apollo 11 (software)? You know, there's a reason that most space mission tend to be heavy on redundant hardware, and invest a lot of time and effort in fault protection software.
Every problem can be directly tied to one specific person being a fscking moron.
Well yeah, but that's the case with a lot of bugs, isn't it? Mistakes tend to be people issues.
The closest you could come is that Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.
You are at least correct about that - the problem was not a software issue. Lockheed Martin Astronautics was on contract to supply everything to NASA in SI units (which is what NASA uses for everything). LMA - or at least the part the caused this problem - uses English (Imperial) units internally, and neglected to perform the appropriate conversion before they sent the data on to NASA.
- [...] Mars probe that crashed because of mismatched units. And that was just poor communication among the software guys.
So if it's not a bug, it must be a featureHave an article on the guys who write the stuff. They're damn good, but they say themselves their programs contain errors: "the last three versions of the program [...] had just one error each. The last 11 versions of this software had a total of 17 errors." Apparently never caused a problem, but not bug-free.
Then there was the Canadarm2 issue. Or wasn't that a bug either
yes, we have no bananas
Actually, it was more a case of bad presentation skills. Check out Edward Tufte's "Visual Explanations" for a complete rundown, but basically, IIRC, one guy (or group) knew of the danger and spoke up about it. His supervisors said "Okay, write me a report detailing excactly why we should delay the launch." He did, and it was a very confusing report. So they went on with it, and, well, you know the rest.
c-hack.com |
no, i was just saying that we do have an example of what happens when you lose several thousand consumers in one fell swoop, and the aftermathof such an event, not a whole lot happened after bhopal, i dont feel like researching much, but i'm willing to bet the only changes were stricter regulations on plants like that, after the 9/11 attacks our entire country changed(not for the better)
"Sic Semper Tyrannosaurus Rex."
maybe you're thinking of "pulling up" as in "moving your mouse up".... i think you're being a bit picky here. the point was that they'd do whatever it takes to gain altitude. i think what puppetman is probably still true regardless.
or do you just object to the 'misconception' that apparently everyone has that "pulling up" means "gaining altitude"? but think about it...if you were flying, someone yelled, "Pull up!", would you really head into a nose-dive?...
I was contemplating the immortal words of Socrates, who said, "I drank what?"
http://wwwzenger.informatik.tu-muenchen.de/persons /huckle/bugse.html
I hate this forum. It makes me sig.
>>looking back I realized I was too critical
It's easy to be critical; hindsight is 20/20. I think NASA goofed badly with the Challenger, both in not requiring a redesign of the joint and by launching in the cold weather. I did a report on this for which I read through the Presidential report on the disaster, and NASA goofed on both these counts and should have seen the incredible danger in both these decisions. In both, people within NASA and Thiokol were continually raising red flags over the decisions, but nothing was done.
On the other hand, the Apollo 13 incident was almost unforceable. In retrospect, it should have been prevented, since it was agin caused by a design change that wasn't throughly implemented. (It was caused by a change in voltage, causing a thermister made for the older, lower voltage to fuse shut. This then didn't turn off the heating elements, which led to the insulation on the wires inside the O2 tank to melt off. It then sparked during a cryostir.) However, the Apollo system was so complex that it's easy to see how the change was overlooked for the tanks, and it's possible to forgive NASA for it. The rest of the chain in the problem was pretty much unpreventable and unpredictable.
Apollo 1 was somewhere in between. While the problems related to the fire (the inward opening hatch, the pure O2 environment) were forgivable, the whole capsule had problems which people had been complaining about. Gus Grissom had once hung a lemon from the top of the craft to show what he felt about it. So in general, NASA should have been more through with their safety concerns.
Yeah, funny that in the worst error case possible, the programmer decided to choose the value that was most frequently legitimate. When you want to indicate an error condition, it's awfully convenient to have exceptions to throw, as we do in Java. I can't tell you how many times I've reviewed Java code that returns null or 0 or -1 when it should be throwing an exception. At least, however, it was returning SOMEthing that looked like an error value....not the most typically returned value!!!
I was contemplating the immortal words of Socrates, who said, "I drank what?"
A period in the wrong place and they steal to much money?? Who can leave that off their list?
Carpe meam simiam!
When something like this happens, it's little more than an embarassing public relations problem. If the news can't be completely supressed through advertising, perhaps it can be kept off the evening news and relegated to the back pages. It requires a well-coordinated PR firm, but hey that's what they're around for.
Sure, a few independent news agencies might pick it up and make a big deal about it - until someone goes whaling or starts cutting down redwoods. Few people pay much attention to the independent media anyway. Joe Sixpack doesn't subscribe to The Progressive.
On the local front, shut down the plant, and evacuate your American/European workers. Split them up and transfer them around. If someone makes noise, force them to sign an NDA for their severance packages. Spread liberal bribes on the local front, write the whole venture off, and wait for the hubbub to die down. If you want to stay in the region and resume operations, do so under the umbrella of a subsidiary. If it's too risky, simply relocate to another third-world region. It's not like there's a limited supply.
Unless you stay in the region, you really don't have to worry much about the local population. They're too poor to pursue legal action or be a security threat.
Besides, it's not as if they're white Christians, is it?
</sarcasm>
The quintessential book about stories like this is called "Set Phasers on Stun" by Steven M. Casey (ISBN 0963617885). If you're interested in tales of this sort, this should be your starting point.
Actually, the real problem has nothing to do with whether they use pure oxygen, or mixed oxygen and nitrogen. The problem has to do with the CONCENTRATION of oxygen. If you need to be in a full pressure environment, then yes, you'd need to dilute the oxygen with about 4 times as much nitrogen. But there's a simpler solution. Just use a pure oxygen atmosphere at ONE-FIFTH OF THE PRESSURE. The concentration of oxygen would remain the same, and therefore the flammability of the atmosphere would remain the same.
The problem is concentration, not purity.
wouldn't the whole microsoft experience be a programming error. basically they have forgone security for convenience.
we should learn from them. by allowing things like scriptable macros in office, embedded executables in IE, and no user permissions they open the door to trouble. the prevalence of all the viruses, trojans, etc., that are so pervasive should be example of how not to do things.
let m$ be the how not to inCS the how to in MBA school.
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
Bottom line: that stuff about the floating point error in the PAC-2 system looks neat on paper but it's not at all clear that the faulty calculation was responsible for the loss of life.
GMD
watch this
Here are some of the best examples of windows crashing on high visibility systems that are relied upon:
in the street
At the airport
at the atm
on CNN
At disneyland
On your phone
In an airplane
At the bus stop
I'm not sure when this happened, but I was told this one in my Software Ethics class.
Montgomery Ward (a catalog company) lost a warehouse for 2 years because of a software bug. The workers at the warehouse came to work each day, and no boxes came in or left the warehouse so they just made the existing stock as orderly as possible.
The bug was discovered and the warehouse went back into operation, but it had merchandise that was 2 years out of date!
-Bob
I am the penguin that codes in the night.
Thats not entirely true. Adding power will inccrease your altitude, but pulling up will too. When you pull up, you trade altitude for speed. In other words, youll go higher but your plane will be goins slower. Eventually you arent going fast enough to maintain level flight characteristics, so you have to add power or stop trying to go higher. In some cases youre right though, if you already are only going fast enough to maintain level flight, pulling back on the stick will slow you down and decrease your altitude, but this isnt always true. As for the person who didnt understand how adding power increased altitude, when you go faster, you increase the lift coming from your wings (since lift is a function of speed and angle of attack) so there is a net upward force on the aircraft, causing it to go upwards.
Well, if you apply more forward thrust, you increase the speed differential of the air flowing over the airfoil... Bernoulli should say that this increases your lift and so you go up.
The Mongrel Dogs Who Teach
How about this:
:-)
if altimeter1 not working
{
if altimeter2 not working
{
while (1)
{
height = 1;
while (height 99999)
{
height++;
}}}} (sorry for the bad code, but the lameness filter got me)
There, a nice spinny altimeter.
Microsoft Outlook's scripting support.
Never, ever lose a file again. Ever.
How about Y2K. Have I just not read down far enough to see this one mentioned?
my cs teacher told me this one back in college...he said one of the first runs of the f-16 (or maybe another one of the computer controlled fighers in the air force) they were flying and everything worked just fine. however they took it across the equator and the plan flipped upside down. so the pilot corrected it and everything went back to normal. then he flys across the equator again and it flips.
so they took a close look at the software, and there was a bug in their sin function so that when they went across the equator they angle changed from positive to negative and the sin function didn't have the negative incorporated. so basically when the plane went over the equator it thouht it was upside down and corrected itself by flipping itself upside down.
i think it's a funny example of a stupid mistake possibly making a catastrophe. i've never seen this mentioned elsewhere, so i'm not to sure about this. but i do trust the cs prof who told me, before coming to my school he did a bunch of government contract work.
I have a friend who works at a medical lab.
They got a new machine, the machine worked fine, but their computer system ate all the results the first time they used it.
Not a big issue though, there's always some sample left over and they tests could be done again on the old machines.
Once for al Ariane V was not a programming error. Yes it was an error in a program that caused the accident, but it was NOT the fault of a programmer. Why? The system in use (I think it was the inertia compensation?) was designed and programmed for the Ariane IV, the variable that overrunned was explicitly tested to work in bounds with ariane IV. The funny fact is that they did have overrun checking the system they used, they explicitly turned it of for the variable in question to gain speed, since they tested and calculated to be running always in bounds. The real error was a managment one, it ocured when the system was transfered as is to the Ariane V without recalculation or reprogramming for new conditions.
And the second important thing it's not okay to blame any programmer/person. Normally it's a whole chain of things going wrong that lead to an accident, not a single event. Even if a programmer makes an error, in an important system like that, it should be reviewed, and also quality assurence failed completly in there. For example the ariane V, why was the system not tested for Ariane V conditions?
--
Karma 50, and all I got was this lousy T-Shirt.
Was working for a small isp. Sitting at work developing a script to blank the accounts off our old mail server (outsourced) for when our new mail server is completly online and ready to go. Its done, i remove my debugging code and the limites I had placed (i had limited it to work with only 2 test accounts) Congradulating myself on a job well done I head to the hall to grab myself a coke, i come back and my boss is at my comp, now the program was written in VC++ so the 'play' button is pretty obvious and hes seen me use it before, the idiot wanted to see what i was working on and ran it, blanking all accounts off of the mail server. Took us 3 days to get the outsourceing company to restore from a backup (one of the reasons we were co-locating our own), and even then all mail recieved after the backup (the night before) was gone ofc. I just about strangled my boss, on the upside, he never touched my workstation again.
Jesus saves, everyone else takes full damage from the fireball.
Slashdot Math!
cause we all know 50 + 1 - 1 = 49!
Ok, that was lame, go ahead and mod me down...
| - | - |
That was a hardware problem, but not a computer hardware problem.
of the Tacoma Narrows bridge falling. The *fault* was with the design, and hence, the designers.
An extended bolt puncturing the gas tank during a rear end collision was the *cause* of Ford Pintos exploding. The *fault* was with the design, and hence, the designers.
Both of these items could have been claimed to be perfectly free of design flaws while being used as "intended."
This argument did not help the designers in not being found liable for their design flaws.
The divide by zero error was the *cause* of the operating system's failure. The *fault* was with the operating system. The *operating system* crashed. An operating system failure is *always* the fault of the operating system, and hence, its designers.
Read any textbook on the design of operating systems and in the first page or two you find some sort of statement along the line of, " A faulty app should never cause the operating system to fail." This is correct design.
Let me repeat. If an app fails, it is the fault of the app. If the operating system fails, no matter what an app has done, it is the fault of the operating system. An operating system must *assume* apps badly written by complete incompetents.
It doesn't matter what operating system. Windows, Linux, Mac or just the beads on your abacus.
* It is the responsibiltiy of the operating system not to fail.*
The fact that such failures can be explained away as the fault of the app by people who should know better makes me grieve for the state of engineering these days. It can only result in products being produced with greater and greater "craposity" factors eventually resulting in a culture of complete "crapitude."
KFG
NASA had a problem with the software during descent of Apollo 11, when the computer was overloaded. Also, they had hardware problems in Galileo. Not in the computer, but in the periferal comm unit or antenna.
Here's one I heard about a few years ago:
The Australian park service (or equivalent org) needed a simulation program to train helicopter pilots. They had heard that the military had a sim that would probably fit their needs, and talked to the company that produced it. When they started testing the sim, everything seemed to go fine, until they flew by a gathering (herd? flock? posse?) of kangaroos. Imagine their surprise when these lovely beasts made tracks for the nearest ridgeline and started lauching shoulder mounted SAMs at the chopper.
Oops. The company changed the graphic, but not the behavior, for the infantry to kangaroos.
My memory is fuzzy on the exact details, but you should get the picture -- Kangaroos with Guns (sounds like a Fox special, eh?)
At the very least, add a URL with the search terms you used to produce the related information. Googling is a bit of an art, one I'm not an expert at, and it's always good to see a more experienced try.
Blar.
After all, it is better to learn from the mistakes of others than from your own, right?
Not better, but more comfortable. You will generally remember better what you learn from making your own mistakes.
Not that I would discourage your approach (or curiosity).
You're all sorta right.. here is one of my favorite aviation pages It'll tell you more than you ever wanted to know about airplane physics (from a pilot's point of view). Chapter 1 covers these altitude/speed/power concepts...
Let me guess... you're at UMIST and have just been give the "Why Systems Fail" coursework...
:)
If my hunch is right (based on Ariane 5 and Therac-25 being the example situations given in the material) using slashdot is a nice easy way to get your research done... wish I'd thought of it
You can read about it from James Gosling's home page (also has info on Arianne 5).
Luckily the engineers were able to upload a patch to Mars. That's remote debugging/patching for you :-)
i'm glider pilot in my free time, and i can tell you:
;))
to increase altitude, you pull the control column, i.e. you turn the nose of the plane UP. but you can only gain altitude of this if your plane has got enough kinetic energy which can be converted into height.
if you don't have enough kinetic energy, your plane stalls, and it won't work (to say it simple). so you need to add energy by giving more gas (if you've got an engine - in my glider, i don't have one
The post that said there has been no hardware or software malfunctions did not state that it had to be a computer malfunction.
Back issues of "Communications of the ACM" are a gold mine for such blunders of the art. Most issues have a back page column "Inside Risks" that are or were written by Peter Neumann but various others have contributed. Usually each covers a theme since the subject material is so broad and seemingly unending.
"Flight instruments don't lie"
... it has an electronic AOA.
... no matter what the real AOA was.
... Fortunately, it was expensive and not lethal.
First, BEFORE YOU LEAVE THE GROUND, pilots are taught that instruments don't lie. Specifically, when the human inner ear is placed in flight, things go wrong (the inner ear canals are static, not dynamic, devices; the fluid has no dampening or rate sensors). When there is no external reference, the inner ear canals adjust to the eye's visual presentation. It's called the 'leans.' Bad joo-joo. Many a perfectly good aircraft has been flown into the ground because the pilot believed his ears and eyes and not his instruments.
Second, IN FLIGHT, angle-of-attack (AOA) is a spectacular indicator of where your airfoil exists within (or outside) the flight envelope for your aircraft. Inside the flight envelope, you can seek best range (mpg) or best endurance (loiter) or best climb.
In most aircraft, the angle-of-attack indicator is a manual instrument (on the skin is a sensor which looks like a big euro-style handle and it runs to an indicator in the cockpit).
Many pilots are correctly taught to 'fly' the angle-of-attack.
Third, ON THE GROUND, when you land, you use the aircraft shape as an airbrake. You hold the aircraft nose off the ground as long as possible to create drag.
Fourth, ON THE GROUND, when you land, you do not want to hold the aircraft nose too far off the ground or the tail will scrape the runway and your fitness report will reflect and you'll be the butt of bad jokes at Snopes for eternity.
The AOA is used to assist in the performance of aerodynmic braking. The aircraft performance manual publishes the tried and true range of AOAs for aerodynamic braking. [It also indicates when too much AOA will ding the aircraft.]
Aerodynamic braking is part art and part science and requires accurate instruments.
Enter the F-16
F-16 pilots were taught to fly the flight direction indicators to land.
However, many old and new pilots fell back on the old AOA once the wheels touched the ground to do aerodynamic braking.
Suddenly, F-16 tails were scraping along the runway at an alarming (and expensive) rate.
[As an aside, the problem was probably ignored until a senior officer ground off a few inches of aluminum THEN there was a problem.]
The programmers who wrote the AOA routines were rightly told that the AOA is used in flight. So, when the AOA detected that the aircraft had placed weight on the wheels (weight-on-wheels - WOW), it was programmed to quit working. Unfortunately, it kept the last AOA reading
Pilot flies, pilot lands, pilot believes instruments, pilot scrapes multi-million dollar aircraft's tail along runway.
The programming solution was simple: when there was WOW, fade the AOA.
This was another case when contracts pit spec wording against spec intent against functional application and understanding of how it's supposed to work
"Why did they call you 'sparky' and why are you driving school buses in North Topeka?"
A "large quantity of water" entered the storage tank because an employee who had just been fired dropped a hose into it out of spite (he didnt know what would happen, he just wanted to ruin something). Yes the safety precautions were under-par, but when someone with legitimate access wants to destroy something its pretty hard to prevent.
:).
And yes, this has nothing to do with programming error
The Pathfinder mission was widely regarded as a huge success. Several days into the mission the system began restarting intermittently. The bug was located and corrected, and is a great lesson in real-time operating systems, priority-based preemptive scheduling, and (the fix) priority inversion.
There's lots of information on the web, and it makes a good read, even if you're not into operating systems.
For example
What really happened on Mars
Introduction to Priority Inversion"
Clearly one of Scientology's biggest programming blunders.
C Traps and Pitfalls
Programming Pearls
More Programming Pearls.
In particular, there was a management decision that the software for the previous model would be used, even though the design criteria for the new model were different. In particular, the Ariane 5 was capable of accelleration that overflowed variables in the program written for Ariane 4.
My college program involved several co-op workterms, in which students would be placed in paying jobs as part of their education. One of our teachers regaled us with stories of co-op placements gone bad. In one case, a student at the local lottery office (where the prof used to work) apparently sat on a keyboard and apparently managed to butt-type a sequence of keys that knocked out the lottery system for awhile across the province (apparently while trying to flirt with a female co-workers).
From what I remember, he was blacklisted for quite a while with local employers, probably didn't get a date from the female co-worker either...
The most likely way for the world to be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents.
-- Nathaniel Borenstein
++ Say to Elrond "Hello.".
Elrond says "No.". Elrond gives you some lunch.
Comment removed based on user account deletion
Wow, this is so pertinent. At my former employ, I was able to watch a group of talented, but misguided programmers build an entire CORBA framework that provided absolutely no functionality, nor solved any of the companies products needs. In effect wasting 2 years writing code that couldn't be sold. Today the company is spiralling into oblivion, but it's really sad that management and even the programmers themselves never saw the light until it was too late.
The truth is, it wasn't the code that was bad, it was the mindset. People just couldn't wrap their minds around the companies products (cuz it wasn't sexy afterall) to focus and decided they would invent/build their own thing. I left 6 months before the end, executive staff fired, half the company layed off which had already been cut down about 4 times over the previous year. I feel bad for the guys who wrote the code though. They really thought they were building something, but at the end of the day their code is going on the shelf.
Code gone wrong... Management gone wrong... It's the new era, now, and hopefully we've all learned something.
All the worlds a stage, and I'm the guy running the lights...
In my Computing Ethics class, mention was made of a problem (can't find a source, sorry), where a pipeline had computer controlled valves. There was something like a T-valve, where to switch flow, one valve was closed, and another opened. Since the valves worked slowly, it didn't really matter if you opened one before you closed the other or vice versa. Until the process (which was running as low priority) was interrupted after closing one, and blew out a huge section of pipe.
Also, you might be interested in a book called Normal Accidents that documents similar problems with all sorts of technology. Preventing software problems is good, but preventing entire systems of accidents is better.
The book that immediately comes to mind is "The Day the Phones Stopped" by Leonard Lee. isbn 1-55611-264-5. 1991. I believe it is out of print.
In 12 chapters it covers the most spectacular computer & technology failures in history, including but not limited to the aforementioned Therac-25 during 1986, the Air Canada Flight 143 incident, fly-by-wire systems, the legendary AT&T switching disaster of Jan 15, 1990, several air controller disasters, and others.
The book doesn't solely focus on software per se, but technology in general, and complexity in particular. If you can get past the sensationalist style, it can be a very humbling reminder that failure is a weed grown best on a bed of complexity.
No computation without defenstration!
When I took 6.001 here at MIT, they told a story which I haven't seen mentioned in this thread. Maybe it's real, maybe it isn't, but it's entertaining. Apparently the Navy was testing a new type of torpedo, but some of the code wasn't finished yet. Eventually, they wanted the torpedo to be smart enough to turn around and try again if it missed its target. That part wasn't done yet, but they wanted to do a few live tests to determine the weapon's effectiveness. As a hack, they put in a simple if-then -- if the internal compass ever turned more than 180 degrees, the torpedo would self-destruct. Several torpedoes with this code were loaded onto a ship and taken out to sea. During the trials, a few of them failed to leave the tubes. What do you do after a hard day of torpedo testing? Well, you turn around and go home... Boom!
--Dan
That's kind of the point. Look at the different reactions across the world to Bhopal and 11 September. Why is one so much worse than the other? Maybe because they were consumers? or Americans? but the original poster's comments are right on the spot.
Pessimism of the intellect, optimism of the will! - Antonio Gramsci.
A bug in a factory PLC program allowed a machine to start when a metalic object (such as a wedding ring) went in front of a sensor.
Later, a program modification allowed an aircylinder to extend while the machine was turned off for maintenance. The guy jumped out of the way in time, but let us know about it. (This was before lockout tagout.)
Bottom line - a bug in a PC program typically results in data damage. A PLC bug can literally smash someone's head!
unless your pointing at the ground. an if you've even been in an acrobatic plane (such as a jet fighter) then you'll know it's oftern extremely hard to know intuitively which way is up - especially when you're in the clouds.
If you want to see programming gone wrong I can show you the assignments turned in for the class I TA for. There is some really funny stuff in there.
word.
Could we have a new Slashdot category entitled Ask Slashdot To Do My Research/Homework For Me? Then I could mark this category unread and avoid some annoyance.
There is so much information readily available on the subject of software failures online and in scientific and popular publications. (See other responses to this question for examples.) IMHO, the questioner should go look for the answer to this kind of question directly before bugging the entire Slashdot audience; the editors should enforce this policy.
A good place to find lots of accidents, and the ethics behind them, can be found at Online Ethics
This is one of the sites that we use for the Engineering Ethics class at CWRU.
Some cases can be found here
Also:
Three Mile Island
Challanger Disaster
Here are a few more Computing Cases
Also, an excellent write up of the disaster that didn't happen. The Citicorp tower in Chicago would have fallen if it wasn't for fast work by the engineers.
Citicorp tower
-Foose
The system did not collapse per se but progressively became bogged down by a series of poor design issues and implementation issues.
What happened was there was a memory leak, in that not all the memory used when a call was processed was released. This meant that each call chewed up a small part of core.
As the day wore on, this loss of memory started to make the system run slower, and created more calls as users started to worry about the non-show of the ambulance.
Meanwhile, back at the control centre, the operators started getting blasted by messages about over-due ambulances, and other system warnings. They were spending time simply dismissing Error dialogues.
By the end of the day, they were still dealing with the emergency issues notified at 12.00.
Of course, in the inquiry, there were many different management and design issues to be addressed, including the reliability and scalability of the software. [It was a Visual Basic program.]
I have seen a number of instances personally, most of these tend to be ignored by management keen to see the system up and running. The most often case for dismissal of problems is "teething problems", and "Luditism".
In practice, the real issue here is the UI. Not so much "flash chrome", but that the buttons and so forth will actually do what the user expects them to do. The user must be able to understand how to process and correct errors in relation to the application data itself. That is, if I enter 1200, and I mean 1130, I should be able to correct that.
The other disaster happening out there is that the program must be useful to the operator. So apart from entering data, the operator must be able to extract useful information from it. What the back end does does not really matter.
For example, a clerk who has to enter data on the screen each sale, in addition to operating the till, would be reluctant to use it. On the other hande, if the program is part of the till operation, and it provides information on how much stock is left, the clerk is more accepting of the change.
Implementing a system is not about plonking a pc with a program on a user's desk. It's about a user process. Users are looking for outcomes, not process. So if you want to go to a shop, you want to buy something, and the clerk wants to sell it to you. All the rest is administrivia.
Software design is important. So is user training.
OS/2 - because choice is a terrible thing to waste.
I think I've recommended this book serveral times on Slashdot. Simply put, THE collection of computing related horror stories.
2 01 55805X/qid=1035769692/sr=8-13/ref=sr_8_13/104-4078 673-1863905?v=glance&n=507846
http://www.amazon.com/exec/obidos/tg/detail/-/0
I swear by MacOS X. Although I use to swear *at* MacOS 9...
The NT crash brought down every other NT box on the ship, not good news on a ship powered entirely by NT, is it? (-:
I have a friend (who still works for IBM) who was learning to program on mainframes, ran his app, killed it after 2.5 seconds, and went down to collect his print run. One and a half boxes of paper. He thought that was expensive... but consider running an app on your smartship while it's manouvering in close quarters, and bringing down NT. One and a half boxes of sailors. A bit expensive...
Got time? Spend some of it coding or testing
One of our techs making modifications to the logon scripts accidentally fatfingered the word REM.
Instead of REM *******.*** John Doe Modified script on xx/xx/xx
it read REN *******.*** John Doe Modified script on xx/xx/xx
As the clients would log on the domain, random files would change the word John.
We thought at first that we were experiecing a new unknown virus update. Until the tech who made the error stepped up to the plate and claimed that he had done it....
SuperGlueBooger
I would love to know if it was the OS that failed, or the applications running on the OS. The data which caused the panic was stored in the database. It stands to reason that every time a restart of the application was attempted, the same error condition would occur because the bad data was the cause, the crash was the effect. The OS could have been fine, it was the application which provided the control systems for the ship, not the OS itself.
I've had a number of experiences with software where I have changed the config or used data that was supposed to work, but the app would segfault and dump when reading it (gotta love it when things fail at 3am and you don't know why). The OS never failed, the app I was looking to provide the services did. After determining there was a problem, it was then a matter of tracking what caused the problem, and correcting it.
I can hypothesise all I want, but I can't find anywhere that says definitively whether the OS actually crashed, or whether the app did (repeatedly). Even if the program did take down the OS (and I think that should NEVER happen, but it does), I would assign a lot of blame to the app, because to me it's common sense to check for that WHENEVER you have a divide operation.
That said, because it was NT, the system was probably written in VB which would explain everything ;)
Idiot, n. A member of a large and powerful tribe whose influence in human affairs has always been dominant
Like the sig -- is it a quote from something, or just something you made up?
Karma: Bored. (Thinking about resurrecting the "Anyone else is an imposter" joke.)
It must also be expecting programs written by incomplete incompetents, ie, people who are clever enough to get most of the coding right, then put in a weird twist on the last call that guts the OS. Programs writen by complete incompetents are often disqualified (GPF) before they get so far.
Got time? Spend some of it coding or testing
That, and forgetting a semi-colon somewhere :)
Idiot, n. A member of a large and powerful tribe whose influence in human affairs has always been dominant
A few years back I was using a VAX/VMS account and had a few programs that allowed users to log various activities. The problem with the system is that the logic of the program relied on the users writing to the file then closing the file handle with the "close" command. One of the users had a friend with the last name "Close". In your mail you could "assign" names to be short handles for e-mail addresses. Now the system was implemented for a clock-in/clock-out system. And no one could write to the file while it was still "open". So... I eventually figured it out. It took a bit longer than finding the kid who knew assigned a friend to be "IF" after the interactive fiction board he ran.
--Bucktug
I had a flame... but she had a fire.
The book "Computer Snafus - Crashes, Errors, Failures, Foul-Ups, Goofs, Glitches, and Other Malfunctions that Cause Computers to go Awry" by Herman McDaniel should be just what you want.
I don't care what anybody says, Emacs is the best OS out there, hands down.
FreeBSD for the impatient.
All of the following is "IIRC":
It's hard to fault the designers of the Tacoma Narows bridge. By all available data at the time, the design was brilliant. The narrows were famous for their insanley gusty winds, going from 20 to 80 knots at the drop of a hat, and the bridge was designed with this in mind. It was designed to handle severely high winds, as well as dramatic sudden changes in wind speed. And it performed brilliantly. Someone might have wondered, what will happen if there is a strong, unusually steady wind, in particular ~40 knots for several hours? But it would be quite a stretch to expect them to figure out that air vortices off the cables would then cause them to occillate at a frequency that happened to be resonant with a particular harmonic of the roadbed, and that this would cause self-reinforcing torsion waves, leading to the bridges collapse.
I learned harmonic motion from a rather distinguished physics prof who told a story about a grad student who came to him wanting to switch programs from engineering to theoretical physics. When he asked him why, the student explained the for the previous week he had the same nightmare every night: that he designed the Tacoma Narrows Bridge.
To come somewhat back on topic, here's what I have learned from famous examples of programing gone awry: I am not interested in a job where anyones life depends on my code working correctly. I respect those who do write such code, and while I hope they are as smart as me, I pray they are willing to pay more attention to detail!
got a hot tip for ya..... windows.... that thing was never right=)
Surely it's severely damaged somebody....
The UK automated ambulance dispatch system has to be my "favourite" in this regard. See here.
Not only was the software incomplete, inadequately tested and un-tuned to the required performance, but the users were never actually shown how to scroll back up the list of incidents (fifo queue, so oldest were scrolling off the top of the screen). This is rarely mentioned.
Add a good dollop of stupidity in not deciding to phase in a new complex system alongside your existing (pen and paper) system...
They all got off VERY lightly in this case - due to the medical emergency nature of the system it was very difficult to prove the culpability of any one party.
The reported 30+ deaths are now considered mostly media hysteria, however at least one asthmatic died before reaching hospital. Many people had to wait hours for assistance, while others had 5-6 ambulances arrive.
Software criminal negligence is not restricted to the project managers, testers and developers...
Q.
Insert Signature Here
Go find the Gutenberg Project and find a copy of "The Hacker Crackdown". In it is a description of a poorly written patch to telephone network control software back in the 70s(?) that brought down the telephone system for most of New York State. The failure was initially blamed on hackers, and the author considers it to be one of the major contributors to law enforcement's crackdown on hackers in the late 70s / early 80s (thus Hacker Crackdown).
The Forum On Risks To The Public In Computers And Related Systems has an excellent background on all kinds of risks. I've been following it for years (since '91). It's the equivalent of slashdot for risks but the moderators are much more sophisticated. It's required reading for anyone serious about quality software in life critical situations.
I did run acroess that title, but for some reason I did not read it. It's very possible that no local libraries had a copy (in fact, a run of Penn State's library search engine yields no results, so this was almost certianly the case). As it was not a huge paper in general (just for a high school class, albeit a *major* paper for high school), I didn't feel compelled to go looking for tons of secondary sources when Penn State had the Commission's Report. As this report is the basis for most of the information out there on the Challenger, it was by far and away my biggest source.
/.ers is Richard Feynman's "What Do You Care What Other People Think?" Feynman was one of the members of the commission that investigated the accident, and he gives his story in the second half of the book. Feynman is a fun guy and the book is a very good and easy read.
One book that may be of particular interest to
In general, there are certain sacrifices that you have to make in terms of safety. Having to cart a one-piece booster just wouldn't have worked, period. This is especially true at that time because NASA was running another launch pad in Califiornia at an Air Force base (I forget which, and also forget if it was ever used) in addition to the one at Cape Canaveral. I suspect a barge in that case would have necessitated a trip all the way through the Panama canal to get between the launch sites. My point is that you can't always take the safest option. However, that doesn't mean you ignore blantant safety issues. NASA was negligant in its inaction concerning a joint redesign. (And this is true legally as well as IMHO; the families of the astronauts I believe got a fairly substantial amount of money in a wrongful death suit, though I forget if it was settled out of court.) The decision to launch in cold weather was, in my mind, far secondary to the lack of any progress regarding the design of the joint.
If anyone is curious, my report is online as a PDF and HTML. The PDF version has a couple more pages that I didn't splice into the HTML file, including a VERY interesting and revealing memo starting on page 40. Try to ignore the numerous technical errors (I have a couple dozen typos and horrible tense consistancy). I wish that I had had the time to proofread it, but it was too long for the time I had avaliable. But I don't think ym teacher bothered to read it anyway, because he made no comments on it.
It requires a number of examples of buggy execution to convince the programmers that their beautiful code is flawed, so some significant time people spend "resolving errors" is really tricking the computer into allowing the taxpayers figures, when the computer thinks the taxpayer is wrong. Hard to believe, but true. I've done it for a season. -SheWhoWalksWithToesLikeCobras
-SheWhoWalksWithToesLikeCobras Please enter any 11-digit prime number to continue...
-----rhad
Slashdot needs to interview Natalie Portman.
I know a financial institution which ran a job to increase mortgage rates by 0.5 percent. They knew the rates in use, so the job selected all of the mortgages for a given rate and increased them by the aforesaid 0.5 percent. The job started with the lowest used rate and then repeated the process for each of the the next higher rates in use.
See the problem?
this has a good list, and even a funny dilbert cartoon to boot! talk about value. Software Horror Stories
From the Pacific Northwest, home of "innovative" approaches to software reliability, comes:
3 4563661_ship27m.html
...
http://seattletimes.nwsource.com/html/localnews/1
"Officials could not say for certain what caused the ship to heel, but they think the ballast system was probably at fault. A malfunction became evident about 3:30 a.m., when the 653-foot ship started to tilt. The crew was evacuated and no one was hurt.
The ship, in operation since June, has an automated ballast system that adjusts water levels in 28 compartments to keep it righted on the high seas."
Kind of frightening - wonder if the crew even knows how to do a manual override. (Also weird that evacuating the upper port balast chamber would cause it to list to port...)
my own quote, i googled my handle and actually found this
"Sic Semper Tyrannosaurus Rex."
If the Y2k bugs hadn't been fixed, things would have broken left and right, and we would have been blamed for not fixing them ahead of time.
Since the Y2k bugs were fixed, very few things broke, and we got blamed for wasting tons of money to no effect.
C'est la vie, I guess.
At a place I used to work, they made burglar alarms. One burglar alarm, the "8112", had a feature where you could lock out sensors so they would not report. This had two possible uses: the alarm company could lock out a defective sensor, or the alarm company could lock out all sensors if the customer wasn't paying the alarm bills.
Anyway, there were 8 possible sensor "zones", and there was a byte that could lock out any or all of them. You set a bit, one zone was locked out.
An upgraded version of the 8112 came out, where you could replace two of the 8 zones with serial data loops, and you could have over 100 individually reporting sensors instead of 8. This optional feature was called "Zonex" (for "Zone expander" or something like that). For no good reason, when Zonex was running, the sense of the lockout bits was reversed: instead of being lockout bits, they became enable bits.
I want to emphasize this. If Zonex was running, you had to set all the bits; otherwise you had to clear all the bits.
So the setup software knew to check the Zonex bit, and set up the lockout byte accordingly.
This worked great until a new version of the 8112 came out where Zonex was always running, whether or not the Zonex bit was set. If the user didn't happen to set the Zonex bit when setting up that version of the 8112, the burglar alarm would silently ignore all the sensors. Eeeek!
Once we figured that out, we added a check for the version of the 8112 and all was well again with the setup software.
The moral of this story is: don't randomly invert the sense of settings!
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
From my Software Engineering textbook (author: Vliet if you're interested), a few references you might like: - http://www.csl.sri.com/users/neumann/neumann-book. html
- http://www.rothstein.com/slbooks/sl296.htm
Also, you might like:
"Design Paradigms: Case Histories of Error and Judgment in Engineering" by H. Petroski (not restricted to Software Eng)
Enjoy,
Rod
"I respect faith but doubt is what gets you an education." --who knows
The RISKS mailing list (aka comp.risks on usenet) deals with this topic quite thoroughly. Go forth and read this fine forum on risks to the public through computers and related systems. Learn about the problems faced by planes, trains, automobiles, banks, websites, electronic voting machines and more.
For the last 30-40 years, the most common programmer error is to put 2769 bytes in where only 1024 fit.
// *BOOM*
That looks like...
void copyit(char *b)
{
char a[1024];
strcpy((char *)&a,b);
}
Simple, but look at bugtraq... programmers still can't do it right..
Dr. Charles Forbin really should have known better.
http://us.imdb.com/Title?0064177
On a slightly different tack - the
Sleipner A oil platform sank because of a bad design, caused by inaccurate computer based modelling (using an FEA tool inappropriately). In this case it was the data not the software.
Hang on... My win98 box already does that! And to think, I critisise mircrosoft for shody coding!
Come to think of it... I wondered what that big rectangular prismatic monilith was doing next to my box, next thing I know it will become a star child.
When Argumentum ad Hominem falls short, try Argumentum ad Matrem
In Cook County, northern Minnesota, a large percentage of households are heated by "off-peak electric stored heating". At midnight, December 21st 1999 (precisely 10 days before Y2K) the software controlling the radio signal which keeps all the heaters from going online simultaneously, crashed. The resulting overload shut down the power in the county for hours. This utility was not believed to be Y2K sensitive. Surprise!
On February 28 2000, (one day before the infamous 2/29/2000) credit card traffic into VisaNet (through Vital Processing) was failing out with the error code corresponding to "Invalid Date". Since the date 2/29/1900 is invalid, good Y2K test procedures usually call for testing that condition. AFAIK, the company never admitted to having a Y2K problem.
The National Reconaissance Office had some of its most valued spy satellite systems go offline due to Y2K troubles. I think they were down for at least a day or two. (ouch!)
Just remember that if (a = b) is not the same as if (a == b)... Every time I write that wrong it takes me hours to find the damn bug, cause it might not appear immediately.
I hope it's just a tale - one of my teachers recently told me that the engineers assigned to build the new Danish InterCity trains IC3 and their smaller version RE2 made the mistake of using DOUBLE and FLOAT everywhere so that at one time in the middle of a test run, one of the trains would halt on a bridge and move nowhere anymore because all of its 16 MB where filled up with DOUBLE's storing the speed in KM/H ... the train only runs approx. 240 KM/H so an INT or perhaps a LONG INT would have sufficed.
As to the RE2 the engineers miscalculated the weight - the Ariane-problem - so some of the trains would "lie down" using their hydraulics but not to the correct side - the one facing the station, instead passengers had to crawl / jump the last 2 feet to the ground.
the biggest error is using assignment when you want to compare, ie if(foo = bar) instead of if(foo==bar)
The SAAB JAS-39 Griffon crash you mention was in fact caused by the pilot not being warned about "pilot induced oscillations" (though there was a crash in testing that was attributed to the control systems). Basically, the plane started to weer left, the pilot quickly compensated as did the control system, leading to overcompensation. So, the pilot compensated back the other way, as did the plane...
Rinse, lather, repeat. End result: plane rears up, suffer complete loss of air speed, falls like a brick into a crowd of about 200 000 spectators (this was an airshow over central Stockholm!), and through divine intervention happens to hit to one empty spot in a sea of people.
Interesting fact 1: footage shows the plane stabilizing back down into a correct flight attitude once the pilot ejected. Not much use without airspeed, though.
Interesting fact 2: SAABs division for Flight Control (presumably the same guys who did the Griffon control systems) also programmed the flight control systems for the Ariadne...
Interesting (or at least humurous) fact 3: This is the monument marking the spot where JAS crashed.
I choose to remain celibate, like my father and his father before him.
I dont want to read the 2000 posts above me. so comeon flame me :P.
How you prevent errors have alot to do with the Programming language your useing.
(newer heard about CS, is it C?)
When programming in Visual Basic, here is some guidelinies.
1) Use Option Explicit.
2) If you use "On error resume next" then you have to check the Err object after every line,(Inline errorhandling).
3)And yes check for Division by zero (error-code 11)
4) Comment, other people may have to work with your code after you.
5) Generaly Always asume the user for an ediot.
Always Validate ALL Input.
6) Test your product before releasing it, We are only Human, we do make mistakes.
You can newer protect yourself 100% against errors.
New bugs will most likely be fund months, even years after the software have been released.
(Ofcouse depends of the size of the product.)
Hope you can use this to something.
I'm sorry to have moderated this post as "Funny" instead of "Insightful". The mousewheel is to blame, and so am I. However, I guess this mistake is preferrable over moderations that negatively affect the score.
Bottom line: that stuff about the floating point error in the PAC-2 system looks neat on paper but it's not at all clear that the faulty calculation was responsible for the loss of life.
Dead on! There's been loads of evidence that NONE of the patriots EVER hit a signle scud. Luckily the scuds had problems of their own with some wing design when going neg-G or something,
I can't remember what it was, but anyway quite often they just simply blew up when decending towards a target.
Even I remember Bush the elder boasting (numbers may be inaccurate, but gap was 1) "37 missiles engaged, 36 intercepted". Naturally that raised a few questions about what was this "intercepted" as none of the Patriots ever hit a scud. And...
I saw this rather hillarious documentary about
all this Patriot fuzz where some US general telling in court what they ment with "intercepted". He said that "intercepted" meant only that at SOME point the patriot's flight path / trajectory crosses that of the scud's.
1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
subject states it all, comp.risks, has archives discussing nearly every computer failure for the last 20 years, check it out.
In 1997 a major bank in South Africa ran its nightly payment job 17 times. It made a few people instant millionaires overnight, and caused others to have the worse debt they could ever imagine. The bank had to close for business the next day to do a complete rollback (which was successful, perhaps giving them a lot of confidence in their disaster recovery plan).
What happenned was that the job failed due to the job data spanning two tapes for the first time ever. This specific occurance was never tested before. The operator simply ran the job again and again 17 times, until he realized he should call for assistance.
By then the damage was done, but luckily not permanent.
The question is: who is to blame? The operator, the programmer, the tester, the owner or Col. Mustard (who accepted the system).
That's fucking cool. Any more information?
--Giving to trolls for the benefit of us all
My university used to offer distance learning for computers. The students would post their programs in, secretaries would type them in, and the listings of compiler errors would be posted back to the students. At 2-3 weeks turnaround, the students were inclined to think before coding a bit more, and hoped the typists were good... Soon after I left, the university made having access to a PC a requirement for the course, taking all the fun out of it :-)
I only skimmed the report, but I'm not sure it was a software problem per se.
I had lunch with a fairly senior engineer from Aerospatiale shortly after the Ariane 5 explosion, and his version of events, which is consistent with but not explicit in what I skimmed, is that because the software and hardware had worked flawlessly for any number of Ariane 4 flights, they did the sensible thing and didn't change a thing for Ariane 5.
The disaster occurred because the Ariane 5 is faster and/or had more sensors, therefore threw more data at the processor, and, eventually, a sensor queue overflowed and the system reset itself to launch altitude, the result of which was to make the rocket attempt the equivalent of a handbrake turn.
If this is anything near right, you could reasonably argue that it was a hardware problem, ie if the processor had kept up with the rocket the software would have performed perfectly. OK, not trapping buffer overflows is a naughty no-no, but, offhand, I can't think of an obvious way of making this system fail gracefully (throw away every second piece of data until the queue goes down? Apply the brakes? ...)
Virtually serving coffee
Of course they didn't. The patriot was specifically designed to detonate itself CLOSE TO the offending missile and, hopefully, in the process destroy the latter. This is, in fact, what happened: Tel Aviv and surrounding areas were rained on by falling scud parts. These were pieces of the scuds intercepted by the Patriots.
The problem of intercepting a moving target is difficult, but it becomes much easier when the goal is to simply get "near enough" to disable it with an explosion.
... is whot bwings os tugevza tsuzay.
Cheers.
~ ~ ~ ~ ~
Great Spirits have always encountered violent opposition from mediocre minds. -Albert Einstein
Almost a decade ago when I worked for a differect credit card company that shall remain nameless, a member of my team (I was the lead) introduced a defect that was responsible for about $40K is mis-applied credits. I am not sure whether we ever got the money back.
The program was written in C, and he had changed a do-while loop to a for loop, in editing he had kept the line that contained the original condition (including the trailing semicolon). As many of you C-ers out there are aware, a semicolon following a for() statement will not execute the subsequent code block in the loop!
A very memorable lesson in the value of lint and thorough regression testing!
This may not qualify as a disaster, but I distinctly remember having to give an account for the defect to the corporate controller with an aufience of grand and exalted poobahs. She was a very intolerant and technically ignorant person that actually intimated that this had been done maliciously.
KK4SFV
See anything with Peter G Neumann's name on it. For example, a soft-covered book "Computer-Related Risks" is old [1995] but still available at Amazon. Neumann has been publishing this stuff for years in ACM SIGSEN bulletin as "Risks to the Public" [I'm not sure its still there]. In any case, look at the Risks Forum at http://catless.ncl.ac.uk/Risks
Actually, I believe Arthur C Clarke maintains that HAL stands for Heuristic ALgorithm (basically an oxymoron because Heuristics are flexible and dynamic and Algorithms are static and precise)
You should differentiate between programmer error and human error. Programmer error is professional error (such as not following best practices). Human error is making stupid mistakes (and understnding the reason - for instance, tiredness.
This parallels professional vs human error in other industries, such as the Railways (eg SPADs), Aviation, Civil, etc.
...some unbelievably stupid. Here
'nuff said. :-)
"No matter where you go, there you are." -- Buckaroo Banzai
Correct me if I'm wrong, but I don't think priority inversions are an issue for Win(NT|XP). Consumer Windows (as well as Linux and most other general purpose OSes) runs a modified round-robin type scheduler. Priority inversions aren't an issue.
Windows CE has them, though...
Inversions happen on real-time OSes. In RT, the OS always selects the thread to schedule next out of the pool of eligible threads of priority X, where X is the highest priority that has any eligible threads. In other words, if any threads have a higher priority than you, as long as they aren't blocked, you'll never run.
Inversion is when a low-priority thread has a resource that a high-priority thread is waiting for. The low priority thread is bumped up in priority so that it can release the resource.
So a priority inversion problem is usually an application issue, not an OS issue -- the programmers didn't think very hard about the consequences of setting various priorities and acquiring various resources. And it is one of the reasons that programming real-time OS apps is tricky. Make a priority mistake on a general-purpose OS and your program might run a bit more slowly. Make a priority mistake on a real-time OS, and several threads will never run at all.
Of course, you were really referring to the fact that it crashed every 20 minutes, so this reply is completely pointless. Oh well...
Time flies like an arrow. Fruit flies like a banana.
CDE
:)
Cheers
Sigh... If someone is going to rip all the pictures, I wish they'd at least give proper attribution. Visit http://www.windowscrash.com for the original collection. I always give attribution.
From the sounds of what I have read, the OS can normally handle a divide by zero. However, in this case, it sounds like the OS didn't handle the divide by zero. The only case where I know of that happening is the case of a double-fault.
And a double-fault is caused by a divide-by-zero, followed immediately by some kind of a service request interrupt (such as could be generated by an I/O card). But if that is the case, since the double-fault results in a processor halt, then you have to fault only four groups of engineers:
(1) those who developed the intel chip, for not providing a way for the software to handle a double fault (triple-fault halt might have been safer)
(2) those who designed the motherboard, since they should have included some kind of a watchdog timer to handle the double-fault case and reboot the system.
(3) those who designed the entire system networks, for not designing in a backup so that a single computer can go down, but the system stay up and running in a redundant mode.
(4) the people who drove the tugboat that pulled the Yorktown back in, since everything is their fault, anyhow.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Indeed, there was a *slew* (pun intended) of timing problems with the initial software to run the shuttles (does anyone else remember the frequency of early launches scrubbed due to software?). From my understanding, the bugs were traced to race conditions amongst the (then large) array of onboard and ground computers.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
During my Freshman year of college, I wrote a BASIC program that would use the modem to call numbers on the dorm phone system at specified times - kind of a primitive version of those 976-WAKE type services. On its first day of operation, a minor programming/math error dealing with the system clock caused it to start ringing each number every hour or so, starting at midnight... Whoops! A couple angry people knocking on my door the next morning.
Caveat Emptor is not a business model.
The dangers of object oriented misuse:
i es /ProgrammingLessons.html
http://www.softcom.net/users/ispy/FunStuff/Stor
Taken from the June 15, 1999 Defense Science and Technology Organization Lecture series, Melbourne, Australia.
come on fhqwhgads
Hey we sold you this! Top of the range! But it's broken, even before we sold it to you. If you pay us £500000 we'll fix them all, but if you don't your blood will boil and your head will explode, all your kids will die of pestilence, your wife will sleep around, your plane will try to reach the moon and all your elevators are belong to us.
Careful. Quoting from a Microsoft EULA like that without proper attribution could get you tossed into jail for a DMCA violation, sport.
I actually googled for the the quote first and found that same page. But I've found my sigs attributed to me on quote pages like that when I wasn't actually the original author of the quote, so I thought I'd ask first.
Karma: Bored. (Thinking about resurrecting the "Anyone else is an imposter" joke.)
http://www.byte.com/art/9512/sec6/art1.htm
The Law of Falling Bodies
I remember an AIRBUS being flown for the first time at an airshow (video available somewhere). Apparently, the pilot made a slow fly-by over the runway at a couple hundred feet. Apparently, in this near stall configuration, the computer guidance was in "error" mode and thus took over the controls of the fly-by-wire system. Of course, applying full-power is like the worst thing you can do because you operate in what engineer's call the "back-side" of the power-curve. Which means additional horsepower actually means less thrust. Of course, the computer, sensing all the available data to it, does just that: full power. The plane slowly enters the canopy of a forest grove at the end of the runway and then a violent explosion erupts.
"This isn't a study in computer science, its a study in human behavior"
I wouldn't worry too much, since you posted on the same thread as you moderated your moderation was probably undone (unless doing so from different accounts or IP's or unless it changed since the only time it happened to me).
"The obvious mathematical breakthrough would be development of an easy way to factor large prime numbers." Bill Gates,
In talking with programmers and fighter pilots at Hill AFB, I learned that the pilots usually have to restart some computer systems several times every flight.
frob.
//TODO: Think of witty sig statement
Check out http://ask.slashdot.org/article.pl?sid=01/10/31/19 27246&mode=thread&tid=156 for this same discussion with an emphasis on embedded computing.
"Prepare for the worst - hope for the best."
I'm not familiar with the examples you cite, but as a seasoned Tester I urge you to look beyond the obvious to the root cause of those and any other examples your are presented with. I can't think of a project I've been on where serious bugs for system failures were not caused by some other serious problem upstream of dev. When you're working with competent devs, you'll find most problems are introduced due to poor requirements, miscommunications and unrealistic expectations filtering down from management, and lack of resources applied to test infrastructure and testing. Management should be held accountable for all these things, but instead they scapegoat testers and developers to CYA.
Female Prison Rape in NY
Data is just as inmportant as the code itself. FOr example if the french had told the British intelligence agency that they had sold excocepts to the Argitinians during the Falkland war then when the radar systems were being programmed by marconi would have gone ew an excocept - hmmm we have them - yeah do they have them - yes - ok possible target. Intead the data told the code that only the British had excocepts and as such flagged the newly aquired target that was 20 miles away as friendly. It wasn;t until the ship collision code flagged it up that the comuter even considered it a threat. So in this case you end up with the program failing - not due to the program, but due to the data fed into the program. End of the day all users are virus's and virus's kill hosts :D.