Richard Feynman, the Challenger, and Engineering
An anonymous reader writes "When Richard Feynman investigated the Challenger disaster as a member of the Rogers Commission, he issued a scathing report containing brilliant, insightful commentary on the nature of engineering. This short essay relates Feynman's commentary to modern software development."
The problem with the shuttle disaster (both of them, really) is external pressures that are not in anyway at all scientific. The pressure from your manager at Morton Thiokol to perform better, faster and cheaper. The pressure from the government to beat those damned ruskies into space at all costs.
So this is really a case of engineering ethics, when do you push back? As a software developer, I never push back. Me: "There's a bug that happens once every 1,000 uses of this web survey but it would take me a week to pin it down and fix it." My Boss: "Screw it--the user will blame that on the intarweb, just keep moving forward." But could I consciously say the same thing about a shuttle with people's lives at stake? No, I could not.
So when an engineer at Morton Thiokol said that they hadn't tested the O-Ring at that weather temperature that fateful day and that information was either not relayed or lost all the way up to the people at NASA who were about to launch--it wasn't a failure of engineering, it was a failure of ethics. External forces had mutated engineering into a liability, not an asset.
And there's a whole slough of them I studied in college: * Space Shuttle Columbia disaster (2003)
* Space Shuttle Challenger disaster (1986)
* Chernobyl disaster (1986)
* Bhopal disaster (1984)
* Kansas City Hyatt Regency walkway collapse (1981)
* Love Canal (1980), Lois Gibbs
* Three Mile Island accident (1979)
* Citigroup Center (1978), William LeMessurier
* Ford Pinto safety problems (1970s)
* Minamata disease (1908-1973)
* Chevrolet Corvair safety problems (1960s), Ralph Nader, and Unsafe at Any Speed
* Boston molasses disaster (1919)
* Quebec Bridge collapse (1907), Theodore Cooper
* Johnstown Flood (1889), South Fork Fishing and Hunting Club
* Tay Bridge Disaster (1879), Thomas Bouch, William Henry Barlow, and William Yolland
* Ashtabula River Railroad Disaster (1876), Amasa Stone So I agree with Feynman's comments in relationship to engineering and the further comments to software development. But I don't find them to be a fault in the nature of engineering, just a fault in our ethics. What does capitalism and competitiveness drive us to do? Cut corners, often.
My work here is dung.
For a second there I thought I read "Rogers Communications" and "brilliant" and "engineering" in the same sentence. I thought I had been kicked to an alternate universe where I wouldn't be able to escape. I am glad to be back.
[alk]
A future essay relates Feynman's commentary to modern web hosting, load balancing and the so-called Slashdot effect"
http://duartes.org.nyud.net/gustavo/blog/post/2008/02/20/Richard-Feynman-Challenger-Disaster-Software-Engineering.aspx As a side note, could someone make a grease monkey script to make all links frmo /. run through coral? it just makes sense
Nothing great was ever achieved without enthusiasm
To be fair, the Challenger disaster actually preceeded NASA's slogan and procurement policy of "faster, better, cheaper" by a bit. More to the point, Feynman's article should be a cautionary tale to ANYONE in a engineering field. It isn't a matter of one field being subject to unscientific pressures and another field being immune. No technology or industry is immune from the pressures and problems that caused the challenger disaster. Anyone who claims to be well adapted to safety concerns enough to not spend lots of time and effort on fixing them is foolish. The nuclear industry still has to practice strong QC on parts, procedures and maintenance and CONTINUE that practice. Same with commercial aviation, acute medical care, etc. Constant vigilance is rewarded only with another uneventful day. That is the fundamental problem. Vigilance is expensive and time consuming. these are not pressures from the profit motive. They apply to government as well as civilian ventures.
(I will refrain from a four-step Profit post). Standard technique: latch on to an essay by a brilliant and insightful person. Extend the insights of that person slightly into a different field with usual compare-and-contrast, brand-extension writing techniques. Claim that resulting essay (and self) are as insightful as the original essayist.
It doesn't work 99.994% of the time, generally because very few people are as insightful as the original brilliant person.
sPh
The blog post makes a nice contribution by linking to Feynman's original thoughts (for example, here: http://www.ranum.com/security/computer_security/editorials/dumb/feynman.html ), ones I haven't read for a long time (and was happy to be reminded of). However, the author makes the mistake of thinking that the original thoughts need to be interpreted and summarized for the reader. Feynman's words by themselves are simple to understand, are concise, and contain just the tone for which geeks go gaga. Anyone interested in the subject will be able to make his or her own judgements about the engineering and politics involved in the Shuttle development, engineering in general, and the extensions to software development.
And here I was on the verge of releasing my twin papers on how the 9/11 Commission Report can be applied to software development, and how the Warren Commission Report on the Kennedy assassination applies to P2P.
Offtopic, but I highly recommend Surely You're Joking, Mr. Feynman, the autobiography he narrated on his deathbed. It's got some great stories in it, like when he surreptitiously went around picking locks at Los Alamos or his personal recollections of the Trinity nuclear tests.
http://www.networkmirror.com/LBKPk3ml3LEozZTj/duartes.org/gustavo/blog/post/2008/02/20/Richard-Feynman-Challenger-Disaster-Software-Engineering.aspx.html
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
The biggest problem is most software developers are NOT chartered professional software engineers, so have no personal, professional and legal responsibility for their work. That is why IT is full of cowboys and trust is nearly none existent. Software Engineers must become a chartered only profession, so that people who are not chartered are not allowed to practice.
To qualify as a Professional Engineer we should place good practice above short term gains. Professional Engineers should be truthful and objective and have no tolerance for deception or corruption. Professional Engineers only work in areas were they are competant. Professional Engineers build their reputation on merit and their skills through continual learning and the skills of their charges through ongoing mentoring.
We wouldn't have to put up with the shoddy work of cowboys, because they wouldn't be allowed to practice. We wouldn't have to put up with orders that counteract professional ethics or good practice, because legal responsibility trumps commercial pressures. The professional wouldn't be undermined by fast to market but poor quality work. We could place trust in third party tools, software & services and we would not have to put up with EULA that diavowed responsibility for damage.
Your heart's in the right place, but it would not and cannot work.
Why? Simply - an excess of demand and a shortage of resources. There is simply too much demand for software development and there aren't enough Computer Science curricula in existence to meet that demand.
And this is coming from a degreed engineer. Not a licensed professional, however. Yeah, I took and passed the EIT, but never went for the PE. Why? In my original field, telecommunications, there never was any requirement at any of my employers to be a registered PE.
Granted, there are tons of people out there who confuse an MIS degree with a Computer Science or Computer Engineering degree. And if you hire an MIS grad to help develop the next whiz bang OS, well, chances are it won't work out. It might, but the odds are against you...
"A little misunderstanding? Galileo and the Pope had a little misunderstanding."
I've been in software quality and testing for 14 years. I've worked at very large corporations as well as startups. There is a WIDE gap in software development process in our industry. Many people like to call themselves software engineers when they are developers. There is a huge difference. Engineering is a discipline that follows well-defined rules, and it usually takes time. But I think the very important thing to point out is that some software requires engineering - other software does not. If I go into a startup company that is trying to develop a blog/wiki site and try to implement a NASA-like software development methodology, they will fail. Likewise, software to control a heart monitor should be engineered and closely controlled. Sometimes quality and perfection is the goal, other times it might be time-to-market that is critical. You have to fit the process to your business. A bridge is a bridge, and they should all be engineered pretty much in the same way. You can't say the same thing about software.
I think that this is a very key point to software development. I have seen companies who spent entirely too much time and money trying to eliminate all defects from their software when it wasn't the critical part of their business. Yes, we should always strive to eliminate defects, but you can't get them all. You have to know when to pick your battles, and when to accept the risks. If we're talking about life-or-death software, or security, or other very critical things - you need to focus on those.
There's a grid I have seen used that is a great tool when doing projects.
Schedule, Cost, Quality, Scope.
1 can be optimized, 1 is a constraint, and the other 2 you have to accept. Period. It is a more useful version of the "fast, good, cheap - pick two"
My beliefs do not require that you agree with them.
I had never heard of Dresden Codak before this post but am now getting hooked while going through the archive. I think it's hilarious, but then I grew up in Los Alamos...
The linked comic is funny in a postmodern way (wondertwins vs. historical quantum theory) and the art is fantastic. A lot better than I could ever do.
Sam! If you will let me be,
I will try them.
You will see.
I don't have my copy of Visual Explanations handy, but I've read it and I was at a talk Tufte gave on this subject, and my recollection of it is rather different. Without directly criticizing Feynman, Tufte actually comes up with a significantly superior analysis of the root cause of the disaster. Feynman spread he blame around many places, finding bad science, bad engineering, inaccurate statistics, poor procedures and documentation, politics influencing design, and most importantly and famously, a disconnect between management and engineering leading to overconfidence. Everything he found is right. But Tufte took the analysis one step further and came up with a completely convincing "one point where it all went wrong." That point was the inability of the booster rocket contractor's team to effectively present information.
The day before the Challenger's final launch, the team that designed and manufactured the booster rockets called Mission Control and said that they thought the launch should be aborted because an O-ring on a booster would be likely to give out due to cold and cause the Challenger to explode. This team was not previously known for being overly cautious; in the previous history of the shuttle program, they had never before recommended aborting a mission. The next day, the challenger launched and the booster rocket blew up exactly the way the team that made it said it would.
This seems like an inconceivable oversight on the part of Mission Control. When the team that designed the rocket told them it was going to blow up, how could they possibly go ahead and launch? The hubris, the pride, the thick-headed showmanship.
Well, Tufte dug into this and found out exactly what happened. Mission control told the rocket team to prepare a presentation about why they thought it would go wrong. The team did so and presented that to Mission Control. Tufte interviewed many people about the specifics of that meeting and actually managed to reassemble the original slides shown during the talk. And anyone viewing the information presented by the booster rocket team to Mission Control will have trouble faulting Mission Control, because the presentation was absolutely incomprehensible.
The booster rocket team's argument was supposed to be that for each previous launch, the amount of subsequent damage found in the O-rings was inversely proportional to the temperature at launch. They had all the data. They were all scientists and engineers. Tufte used their data to construct a graph of O-ring damage vs. launch temperature. Showing that graph and the weather forecast for the launch day to anyone in charge would have gotten the mission cancelled in a second. But the team, that was there to argue that low temperatures correlated with O-ring damage, never presented a single intelligible piece of data demonstrating that, even though they had all that data with them. Instead, they showed a chart of O-ring damage vs. launch date, and another chart several pages later with temperature vs. launch date.
I've read Adventures of a Curious Character and have the utmost respect for Feynman. Every problem Feynman outlined in his analysis was a real problem that NASA should fix. But none of it really pinpointed the exact cause of the disaster. Feynman mostly chalks the failure to postpone launch to management's disconnect from engineering, from their mistakes and lack of understanding and therefore overestimating the safety of the shuttle. This puts the blame in the wrong place. The managers were no where near being so overconfident that when the engineers who designed the part that failed knew it would probably fail in exactly that way and tried to halt the launch, they'd just brush them aside and go ahead with it. They listened carefully; the engineers had data that would make a great case, but it was presented so incompetently that no one at that meeting would have thought they had a case at all, they simply appeared to be overly cautious, because they did not present any data demonstrating their point.
Can anyone tell me how to set my sig on Slashdot?
If you like Faynman here are some of his lectures. http://vega.org.uk/video/subseries/8
Marcus Ranum has an interesting talk (MP3) in which he discusses Feynman's Challenger commentary at some length in the context of designing reliable/secure software systems.
The talk gets off to a bit of a rough start (see Ranum's comment below), but contains much insight and makes a lot of sense before long. Highly recommended for those in the software development field, where the approach is often 'throw it together, then poke at it and patch it until it stops obviously breaking'; the rigour Feynman & Ranum describe may be overkill for some systems, but exposure to this other approach can help make most of us better developers. I found it helpful, anyway—your mileage may vary.
I work in the aerospace industry, specifically an airline, as a manager of an Engineering subgroup. (if "manage" is what you call what I do)
One of the first things I have a new hire do is read Feynman's appendix to the Challenger Report. Primarily to instill a respect for dealing with data, not desires or pressures, and to (re)enforce the concept that "it worked last time", does NOT make it right or safe to do the same thing again.
The pressure / desire from above or parallel organizations within the company is constant, and usually precipitated by the latest operational interruption. All to frequently the refrain is along the lines of "but last time you authored a deviation, this is only a little bit more". When I feel the pressure is starting to cause situational ethics creep, I pull out Feynman's appendix, and read it myself, or have the affected person on my staff read it.
It is amazing how effective it is in restoring sanity, and a healthy respect for the ability of the hardware to kill you (and / or your customers).
Richard Feynman gave many things to this world, and especially certain segments of it. It's my opinion however that one of his best and most unsung gifts was the Challenger Report Appendix. It should be required reading for ANYONE who will ever touch or direct action on hardware that could even remotely present a potential for injury or death.
The message was not rocket science, but as the Columbia accident proved the rocket scientists still can't get it right.
Never ascribe to malice or conspiracy that which can be adequately explained by ignorance or stupidity.