Richard Feynman, the Challenger, and Engineering
An anonymous reader writes "When Richard Feynman investigated the Challenger disaster as a member of the Rogers Commission, he issued a scathing report containing brilliant, insightful commentary on the nature of engineering. This short essay relates Feynman's commentary to modern software development."
I scanned TFA and I'm not sure he has a clue about Linux, IMHO.
Beer is proof that God loves us and wants us to be happy.
The problem with the shuttle disaster (both of them, really) is external pressures that are not in anyway at all scientific. The pressure from your manager at Morton Thiokol to perform better, faster and cheaper. The pressure from the government to beat those damned ruskies into space at all costs.
So this is really a case of engineering ethics, when do you push back? As a software developer, I never push back. Me: "There's a bug that happens once every 1,000 uses of this web survey but it would take me a week to pin it down and fix it." My Boss: "Screw it--the user will blame that on the intarweb, just keep moving forward." But could I consciously say the same thing about a shuttle with people's lives at stake? No, I could not.
So when an engineer at Morton Thiokol said that they hadn't tested the O-Ring at that weather temperature that fateful day and that information was either not relayed or lost all the way up to the people at NASA who were about to launch--it wasn't a failure of engineering, it was a failure of ethics. External forces had mutated engineering into a liability, not an asset.
And there's a whole slough of them I studied in college: * Space Shuttle Columbia disaster (2003)
* Space Shuttle Challenger disaster (1986)
* Chernobyl disaster (1986)
* Bhopal disaster (1984)
* Kansas City Hyatt Regency walkway collapse (1981)
* Love Canal (1980), Lois Gibbs
* Three Mile Island accident (1979)
* Citigroup Center (1978), William LeMessurier
* Ford Pinto safety problems (1970s)
* Minamata disease (1908-1973)
* Chevrolet Corvair safety problems (1960s), Ralph Nader, and Unsafe at Any Speed
* Boston molasses disaster (1919)
* Quebec Bridge collapse (1907), Theodore Cooper
* Johnstown Flood (1889), South Fork Fishing and Hunting Club
* Tay Bridge Disaster (1879), Thomas Bouch, William Henry Barlow, and William Yolland
* Ashtabula River Railroad Disaster (1876), Amasa Stone So I agree with Feynman's comments in relationship to engineering and the further comments to software development. But I don't find them to be a fault in the nature of engineering, just a fault in our ethics. What does capitalism and competitiveness drive us to do? Cut corners, often.
My work here is dung.
For a second there I thought I read "Rogers Communications" and "brilliant" and "engineering" in the same sentence. I thought I had been kicked to an alternate universe where I wouldn't be able to escape. I am glad to be back.
[alk]
Did anyone get through before the story hit the front page? I'd be interested in reading, but Google doesn't have a cached version of the story.
A future essay relates Feynman's commentary to modern web hosting, load balancing and the so-called Slashdot effect"
http://duartes.org.nyud.net/gustavo/blog/post/2008/02/20/Richard-Feynman-Challenger-Disaster-Software-Engineering.aspx As a side note, could someone make a grease monkey script to make all links frmo /. run through coral? it just makes sense
Nothing great was ever achieved without enthusiasm
already?
Absolute power corrupts absolutely. indymedia
To be fair, the Challenger disaster actually preceeded NASA's slogan and procurement policy of "faster, better, cheaper" by a bit. More to the point, Feynman's article should be a cautionary tale to ANYONE in a engineering field. It isn't a matter of one field being subject to unscientific pressures and another field being immune. No technology or industry is immune from the pressures and problems that caused the challenger disaster. Anyone who claims to be well adapted to safety concerns enough to not spend lots of time and effort on fixing them is foolish. The nuclear industry still has to practice strong QC on parts, procedures and maintenance and CONTINUE that practice. Same with commercial aviation, acute medical care, etc. Constant vigilance is rewarded only with another uneventful day. That is the fundamental problem. Vigilance is expensive and time consuming. these are not pressures from the profit motive. They apply to government as well as civilian ventures.
(I will refrain from a four-step Profit post). Standard technique: latch on to an essay by a brilliant and insightful person. Extend the insights of that person slightly into a different field with usual compare-and-contrast, brand-extension writing techniques. Claim that resulting essay (and self) are as insightful as the original essayist.
It doesn't work 99.994% of the time, generally because very few people are as insightful as the original brilliant person.
sPh
What would Richard Feynman do?
The blog post makes a nice contribution by linking to Feynman's original thoughts (for example, here: http://www.ranum.com/security/computer_security/editorials/dumb/feynman.html ), ones I haven't read for a long time (and was happy to be reminded of). However, the author makes the mistake of thinking that the original thoughts need to be interpreted and summarized for the reader. Feynman's words by themselves are simple to understand, are concise, and contain just the tone for which geeks go gaga. Anyone interested in the subject will be able to make his or her own judgements about the engineering and politics involved in the Shuttle development, engineering in general, and the extensions to software development.
And here I was on the verge of releasing my twin papers on how the 9/11 Commission Report can be applied to software development, and how the Warren Commission Report on the Kennedy assassination applies to P2P.
Offtopic, but I highly recommend Surely You're Joking, Mr. Feynman, the autobiography he narrated on his deathbed. It's got some great stories in it, like when he surreptitiously went around picking locks at Los Alamos or his personal recollections of the Trinity nuclear tests.
I'm not sure if he is stating that a bottom up testing method is readily available in all situations, but it sure is a hell of a lot easier with data rather than with physical designs. Scanning and testing code is much easier than building a CPU and testing it from the bottom up (not that I ever have). He does make the distinction that it is less costly in the long run, and I'd probably agree with him, not from experience with this particular application, but experience in general with preventative maintenance. I would rather design something that is tested to withstand its rigors rather than cut corners because it is cheaper now, but potentially more costly in the long run in terms of upkeep and repairs. But what do I know, I'm no computer scientist/software engineer.
Absolute power corrupts absolutely. indymedia
Otherwise you end up with people who don't develop anything, in general. Yes you have your exceptions, but exceptions won't get the entire job done. Think of it in terms of a water pipe. Make the pipe wide enough, all you get is a trickle. But slowly start reducing the diameter, all of a sudden this pressure is enough to launch the water many feet away.
In order to have a real sense of the "nature" of engineering, you have to look at more than the failures. You have to look at the successes that occurred in the midst of these same pressures. I'd start by looking into the Manhattan project, of which Feynman played a part in. The exercise of finding other examples is left for the reader.
Well.. maybe. Or Maybe not. But Definitely not sort of.
While most commentaries on brilliant analysis are not brilliant, a few are.
Edward Tufte's analysis of Dr. Feynman's brilliant analysis is brilliant, warranting a full chapter in Visual Explanations. What makes it special is that it is not "hey, yeah, that's a good idea, I'm smart too" but instead a study of why Dr. Feynman's analysis is brilliant.
Can we get a "-1 Wrong" moderation option?
http://www.networkmirror.com/LBKPk3ml3LEozZTj/duartes.org/gustavo/blog/post/2008/02/20/Richard-Feynman-Challenger-Disaster-Software-Engineering.aspx.html
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
The biggest problem is most software developers are NOT chartered professional software engineers, so have no personal, professional and legal responsibility for their work. That is why IT is full of cowboys and trust is nearly none existent. Software Engineers must become a chartered only profession, so that people who are not chartered are not allowed to practice.
To qualify as a Professional Engineer we should place good practice above short term gains. Professional Engineers should be truthful and objective and have no tolerance for deception or corruption. Professional Engineers only work in areas were they are competant. Professional Engineers build their reputation on merit and their skills through continual learning and the skills of their charges through ongoing mentoring.
We wouldn't have to put up with the shoddy work of cowboys, because they wouldn't be allowed to practice. We wouldn't have to put up with orders that counteract professional ethics or good practice, because legal responsibility trumps commercial pressures. The professional wouldn't be undermined by fast to market but poor quality work. We could place trust in third party tools, software & services and we would not have to put up with EULA that diavowed responsibility for damage.
"Fatal Defect" by Ivars Peterson. A good read.
Time once again for the rejoinder, "Doctors bury their mistakes. Engineers read about theirs in the headlines."
They said that the management at NASA didn't want to cancel the flight of the challenger because it was such a high profile launch even though they were warned about the O rings.
God spoke to me.
May be there will be some sunny day when I will listen to what Linus Pauling says about vitamin C, what Fomenko says about history and what Richard Feynman says about programming.
But that day is not today.
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Not that his comments arent still relevant.
Your heart's in the right place, but it would not and cannot work.
Why? Simply - an excess of demand and a shortage of resources. There is simply too much demand for software development and there aren't enough Computer Science curricula in existence to meet that demand.
And this is coming from a degreed engineer. Not a licensed professional, however. Yeah, I took and passed the EIT, but never went for the PE. Why? In my original field, telecommunications, there never was any requirement at any of my employers to be a registered PE.
Granted, there are tons of people out there who confuse an MIS degree with a Computer Science or Computer Engineering degree. And if you hire an MIS grad to help develop the next whiz bang OS, well, chances are it won't work out. It might, but the odds are against you...
"A little misunderstanding? Galileo and the Pope had a little misunderstanding."
An example of flawed control software leading to fatalities: http://en.wikipedia.org/wiki/Therac-25
I got the site back up. It should be working now. I never imagined this would end up on Slashdot.
Thanks a ton for setting these up. I got the site back up. It should be working now. I never imagined this would end up on Slashdot.
"The computer system is very elaborate, having over 250,000 lines of code." Wow, I've helped write PHP applications that have more code then that. Of course , that was all buggy to hell too.
Software Engineers [in Canada anyway] can't legally be called engineers, unless they are licenced by the PEO [Professional Engineers Organization].
It requires an accredited Engineering program, with the same math, science and engineering courses that civil/electrical/chemical engineers take.
Anything else is just a simple developer/programmer/analyst/code monkey.
Engineers are responsible to a central body, which can revoke licences, impose judgements, etc.
Who are programmers responsible to? Their managers? What oversight exists?
The slogan of the engineer is 'If I don't build it right, people could die", while the slogan of the programmer is:
"if I don't build it right, I'll release a service pack, bug fix"..
Physics is not engineering. If you get things wrong in physics, usually, nothing happens except maybe an angry letter to the editor. Physicists regularly produce incomplete or even contradictory theories, and nobody dies. Physics doesn't have to interface with people; when coming up with a theory of quantum gravity, you don't have to worry about people pushing the wrong button. And the complexity (in terms of variables, equations, etc.) of all of theoretical physics taken together is probably still less than that of a single big software project.
I've been in software quality and testing for 14 years. I've worked at very large corporations as well as startups. There is a WIDE gap in software development process in our industry. Many people like to call themselves software engineers when they are developers. There is a huge difference. Engineering is a discipline that follows well-defined rules, and it usually takes time. But I think the very important thing to point out is that some software requires engineering - other software does not. If I go into a startup company that is trying to develop a blog/wiki site and try to implement a NASA-like software development methodology, they will fail. Likewise, software to control a heart monitor should be engineered and closely controlled. Sometimes quality and perfection is the goal, other times it might be time-to-market that is critical. You have to fit the process to your business. A bridge is a bridge, and they should all be engineered pretty much in the same way. You can't say the same thing about software.
I think that this is a very key point to software development. I have seen companies who spent entirely too much time and money trying to eliminate all defects from their software when it wasn't the critical part of their business. Yes, we should always strive to eliminate defects, but you can't get them all. You have to know when to pick your battles, and when to accept the risks. If we're talking about life-or-death software, or security, or other very critical things - you need to focus on those.
There's a grid I have seen used that is a great tool when doing projects.
Schedule, Cost, Quality, Scope.
1 can be optimized, 1 is a constraint, and the other 2 you have to accept. Period. It is a more useful version of the "fast, good, cheap - pick two"
My beliefs do not require that you agree with them.
This story is about Feynman, so it needs to be tagged "richardfeynmanisgod."
Legalize it.
There was a point that Feynman missed, which is that the SRB support mechanism, a support at the top and bottom, created a single point of failure of the entire system. If there had been a third support in the middle, the burn through and failure of the bottom support would not have caused that SRB to rotate into the main tank. Feynman found the proximate cause in the failure of the 'O' rings, but not the design flaw that was the ultimate cause.
The blog entry is dated today.
The link to Feynman's appendix to the Rogers Commission is a link dated 1996.
Feynman died Feb 18 1998.
So we're talking about something over 10 years old that a blogger has added a few personal observations to, and it's linked in as slashdot news.
These posts express my own personal views, not those of my employer
"There is not enough room in the memory of the main line computers for all the programs of ascent, descent, and payload programs in flight, so the memory is loaded about four time from tapes, by the astronauts."
Since I've had such stellar success with tapes and drives made this century, I can't image trusting landing the shuttle to some made 20+ years ago...
Shift happens. Fire it up.
01. Don't build solid fuel boosters in sections.
..
02. Don't build them out of state so they have to be sectioned to transport by rail.
03. Don't compromise design so as to get some politians vote for funding, forcing you to site the solid rocket booster in his state.
04. Don't ignore safety concerns from your own engineers
05. It don't take a nuclear physicist to figure this out
davecb5620@gmail.com
Maybe it's the election, but I had thought I was watching plenty of news lately. This post made me look up Columbia and I see that the 5th anniversary of its crash was Feb 1st, 2008. Funny thing is, I didn't hear a thing about it then. Did anyone else? Or was this ignored by the media in the runup to Feb 3rd (superbowl) and Feb 5th (super Tuesday)? Seems that NASA was reminded with this disaster to pay attention to the Feynman suggestion that shuttle failures will happen on the order of 1% of the time, as suggested by its engineers. Glad the Mars landed proposed by Bush still has time to be well-designed. ;)
BTW, the Iraq war also started about 5 years ago, on March 19. Maybe that event helped to squelch public morning for Columbia at the time. Sure seemed like it wasn't in the headlines for long. Or maybe, like me, everyone was just to sad to be reminded of Challenger and didn't want to think about Columbia.
If you like Faynman here are some of his lectures. http://vega.org.uk/video/subseries/8
If it ain't broke, don't fix it. The software, at least, ain't broke.
!#@%*)anks for hanging up the phone, dear.
If I recall from watching video interviews with Richard Feynman he said that the general who got him involved in the challenger accident investigation already knew the problem about the O-Ring. He asked (or ordered?) Richard to present it to the panel. Richard didn't like being played by the political powers that be. The general thought that people would listen when they saw it demonstrated by Feynman and heard it from Feynman.
I suppose the powers that be thought it would be good to hoodwink Feynman for a change.
The interviews with Richard Feynman are fascinating to watch. If you find the one where he talks about the challenger accident please add the link with a reply to this comment. Thanks.
http://youtube.com/results?search_query=feynman&search_type=
Marcus Ranum has an interesting talk (MP3) in which he discusses Feynman's Challenger commentary at some length in the context of designing reliable/secure software systems.
The talk gets off to a bit of a rough start (see Ranum's comment below), but contains much insight and makes a lot of sense before long. Highly recommended for those in the software development field, where the approach is often 'throw it together, then poke at it and patch it until it stops obviously breaking'; the rigour Feynman & Ranum describe may be overkill for some systems, but exposure to this other approach can help make most of us better developers. I found it helpful, anyway—your mileage may vary.
I don't know about the rest of them. Yes, I'm an engineer. Two of those were covered in my engineering class.
The cesspool just got a check and balance.
I work in the aerospace industry, specifically an airline, as a manager of an Engineering subgroup. (if "manage" is what you call what I do)
One of the first things I have a new hire do is read Feynman's appendix to the Challenger Report. Primarily to instill a respect for dealing with data, not desires or pressures, and to (re)enforce the concept that "it worked last time", does NOT make it right or safe to do the same thing again.
The pressure / desire from above or parallel organizations within the company is constant, and usually precipitated by the latest operational interruption. All to frequently the refrain is along the lines of "but last time you authored a deviation, this is only a little bit more". When I feel the pressure is starting to cause situational ethics creep, I pull out Feynman's appendix, and read it myself, or have the affected person on my staff read it.
It is amazing how effective it is in restoring sanity, and a healthy respect for the ability of the hardware to kill you (and / or your customers).
Richard Feynman gave many things to this world, and especially certain segments of it. It's my opinion however that one of his best and most unsung gifts was the Challenger Report Appendix. It should be required reading for ANYONE who will ever touch or direct action on hardware that could even remotely present a potential for injury or death.
The message was not rocket science, but as the Columbia accident proved the rocket scientists still can't get it right.
Never ascribe to malice or conspiracy that which can be adequately explained by ignorance or stupidity.
Initial temperature could easily play a role by causing shrinkage and larger-than-usual packing gaps. But it wasn't some sort of brittle fracture since I very much doubt the brittle transition temperature of the packing material was anywhere as high as 30'F. More usual is 0'F and below.
From Feynman's book "What do you Care what Other People Think": "the comptuers on teh shuttle are so obsolete that the manufacturers don't make them anymore. The memories in them are the old kind, made with little ferrite cores that have wires going through them. In the meantime we're developed much better hardware: the memory chips of today are much, much smaller; they have much greater capacity; and they're much more reliable." (page 192 of my copy, it's in chapter called An Inflamed Appendix).
a war on terrorism? How can we end a war on a method?
Time to turn in your geek cred. There's a lot of references to many things, some of which are pretty high-minded cultural references. There's also very artful and clever references to comics, games, sci-fi, movies, anime, and concepts like the singularity.
Hint: the thing being tripped on isn't acid.
Richard Feynman is one of the few intelligent people to walk to earth. Few know who was, but there are not many people through history more important.
Unfortunately, the new beta thread navigation feature on slashdot breaks some of the AJAX features like expanding hidden comments in-line without a page reload
Signatures are a waste of bandwi (buffering...)
But didn't everyone agree many years ago it was simply the o-ring that failed... as it had many times in testing... as it kept failing... because of crappy building materials and poor design?
i'm really sad i missed the big debate here but this is old news isn't it? What new light has been shed? We have footage of nasa testing the booster rockets blowing up on their sleds because of the o-rings. I don't recall having heard that the weather played a part, but the ports freezing up and failing o-rings was pretty much a constant throughout all ideas put on the table.
is TFA worth my time or have i picked up enough with comments alone?
one poster said it was about pressure from many sources -- they're right. Because of time lines, budget constraints and bad managerial choices the challenger happened. I hope they have learned enough to not repeat that... which is probably why they're returning to Apollo style rockets (which is retardeddedededded)
My abilities are only limited by my imagination
I agree with this completely (and I have seen minor versions of this in my job at a defense contractor), but I have to say that there's a certain amount of personal accountability that has to be taken by engineers/software developers(engineers)/anyone.
At times I have been pressured to get something done or let something pass, and sadly I have to admit that I've often given in. But as bad as this is, it's just as unethical for me to try and blame the problem solely on management/capitalism/any "invisible hand".
I did contract work as a Software Developer for an airplane engine manufacturer (among other things this DOD contractor does). After I pushed a new CAD module to Dev, I was shocked to find out how the software was configured to run.
You see, this very important company wanted their software to be "always available" - even when a specific server or database wasn't available. So the script that ran the program would look for its (50+) supplementary modules in Prod. If a specific module wasn't available in Prod it would try connecting to QA, if QA wouldn't work it would load from Dev. It did all this without prompting or even notifying the user that they weren't running "Prod level" software for all the components.
When I raised concerns that engine components were being developed using this fail-over strategy, I was stunned to discover that most of their software used similar startup scripts. I was also told repeatedly by the engineers that it didn't matter because "all modules have been tested and are production-ready, or the vendor wouldn't have released them to the public.".
They're still making engines, but there'll come a day when a component developed or tested using Prod software loads a Dev module and makes a deadly, untested calculation. Possibly very similar to computer errors that caused the Hartford Civic Center roof collapse.
I quit soon after, and I get very nervous when I fly on planes with those engines.
When money or reputation is involved, convenience trumps reliability far too often.
The pressure from the government to beat those damned ruskies into space at all costs.
Angle of Attack is a magnificent read, both for the human side of the story, and the achievement of manned space flight from an inside engineering viewpoint. The at all costs dictum forced something else: the conscientious and conspicuous mandate to do things differently, fail and learn from it, and not stay in any comfortable ruts. There were no sanitary, emasculating powerpoint charts; instead stories of important people being called in at midnight to get to questions immediately. This is not to diminish the importance of ethics, but part of what came out of Apollo was thinking outside the box at all moments and not being tethered from doing so by management. Would that type of engineering today be seen as rogue and wasteful or brilliant? Indeed, how many places now truly allow development to occur in this fashion across all departments, despite the bullshit marketing speak spewed to investors and prospective employees? The global requirement of quick return of investment (measured in misleading ways, say lines of code per day for example) can stoke ethics problems, lead to a lot of CYA behavior, and squash innovative, careful, time consuming with little to show for it, thinking.
For more info, refer "What do you care what other people think".
The blog entry fails to understand that Feynman was criticising the lack of testing during the construction of the engine to fit the "top down" design.
The point Feynman makes is that you *must test* at a very low level to gain surety about what you build out of the low level building blocks. i.e. one way of looking at it is that you need to verify that each function in libc (strcpy, etc) through manual inspection and then verifying they work through testing. Any change to that function requires a repetition of that process. Then as you build your program up around various libc functions (which you know are correct), you have to repeat that testing and verification step.
This process and testing does not mean that "bottom-up" design is what's called for, but rather when you build something, you need to test each component as you go along.
There is nothing wrong with stating the space shuttle engines must work for X hours in a design spec, but if you do not do testing on the components to verify that, then you have a problem with the assembled unit.
Ah, yeah, I know what they say failed on the shuttle.
The aspect that I was pointing out was that Feynman said that he was told by a general what failed and to present it at the public presentation of the investigating panel - which he then did. What Feynman said when interviewed about it was that he didn't like the "answers" being feed to him (I'm paraphrasing as I can't find the video clip at the moment) by political people. He said that he felt "used" by them. In essence he felt "hoodwinked".
> Software has much in common with other engineering disciplines
:-)
As a programmer, I go through life battling this view. I would agree to:
Software (development) has much in common with engineering of
first-time things.
Engineering is about how to create and how to reproduce. The two are
often confused, but very distinct. Building a house is mostly reproductive
engineering. Because software is easily reproduced, "building" software
usually is much more a N-time endeavor for a relative small N.
Clearly the space shuttle design was mostly pushing the envelope,
first-time engineering. In my opinion, in constructing the big engine,
too much reproductive engineering was utilized, where they should have
used first-time engineering techniques.
I think this point is maybe to subtle for non programmers, and therefore I
feel the need to oppose it.
The top-down approach is compatible with reproductive engineering, the
bottom(s)-up approach is better suited for most first-time development.
If only my managers would get that.
-- (:> jms cs.vu.nl (_) --"---
If NASA were to replace the shuttle computers they would need to have their own CPUs manufactured with rather coarse structures, they simply cannot use any of the modern embedded processors. I'm not sure if designing and manufacturing your own CPU chip nowadays is really cheaper than custom-threading magnetic core memory...
Okay, so applying the principles of engineering to software development is a good thing -- but what does it mean and where should I start?
I found the article a little hard to follow -- it didn't really lead me anywhere. The author pointed out this esoteric concept of the "bottom-up" approach which Feynmann believed was important; but the author failed to show me how he thought it would apply to software development. I only understood that we should apply it. I never really read in the article what this approach is and how to apply it to my practice of software development.
The article did well to point out the short-comings of software developers who did not take this approach, but it lacked the evidence to explain and support the author's theory.
The links he pointed to were more or less the same. I'm rather spell-bound by this. Am I to believe that the author's intent was to tell us that we should all think of software development as he does? Is that what will turn software development into the utopia thought it should be?
I do not disagree with the premise of the article; I just wish proponents of it were more clear in the practical application of their theory.
Afterall, that's the goal of engineering, is it not?
Even if it made it past design, those O-rings should have been red-flagged and fixed after the first test-firing of the SRBs showed leakage. They were not proper O-rings. The fact that it was not (perhaps for reasons of schedule or whatever) is somewhat culpable if not revealing a military (vs civilian) risk tolerence level.
The quotes by Feynman are great, but the actual article linked (the blog post) is one of the most spectacularly uninsightful things I've ever read. I don't think that dud has ever worked on any sort of engineering project, ever. Can we just link straight to the good stuff next time, and not to some retard's blog?
The computer has a reality and a nature and software must live within those rules or risk the consequences.
[signature]