Software Defects - Do Late Bugs Really Cost More?
"If you're a software engineer, one of the concepts you've probably had driven into your head by the corporate trainers is that software defects cost logarithmically more to fix the later they are found in the software development life cycle (SDLC).
For example, if a defect is found in the requirements phase, it may cost $1 to fix. It is proffered that the same defect will cost $10 if found in design, $100 during coding, $1000 during testing.
All of this, to my knowledge, started by Barry Boehm in papers[1]. In these papers, Mr. Boehm indicates that defects found 'in the field' cost 50-200 times as much to correct as those corrected earlier.
That was 15 years ago, and as recently as 2001 Barry Boehm indicates that, at least for small non-critical systems, the ratio is more like 5:1 than 100:1[2].
[1] - Boehm, Barry W. and Philip N. Papaccio. 'Understanding and Controlling Software Costs,' IEEE Transactions on Software Engineering, v. 14, no. 10, October 1988, pp. 1462-1477
[2] - (Beohm, Barry and Victor R. Basili. 'Software Defect Reduction Top 10 List,' Computer, v. 34, no. 1, January 2001, pp 135-137.)"
At any stage, you can only find bugs that are introduced at or before that stage. So while fixing a requirements bug in the coding phase might be more expensive than fixing it during the requirements phase, fixing a coding bug during the requirements phase is a tricky operation that I'll leave as an exercise for the reader :-)
Of course, if you omit some of these phases completely, you won't introduce any bugs during them. That's why the JFDI(*) methodoloy is so popular.
(*)Just F*cking Do It
Every bloody emperor has his hand up history's skirt [Peter Hammill/VdGG]
There's plenty of proof out there. Even "ancient" but worthy texts like "The Mythical Man Month" discuss this one.
The size of the project and the nature of the bug really combine to drastically affect the outcome.
For me personally we have just spent about a year tracking down a particular set of bugs (probably not all nailed yet) which showed up post-live. When we were pre-live these would undoubtedly have been easier to fix, but something else that we could have done at that point would have been to improve our design, which would have nuked most of the bugs completely. Once we are in production however we have this forward/backward compatibility heuristic tying one hand behind our backs, and redesigning the thing gets much much bigger.
But that's just anecdotal, of course.
Looks more exponential to me.
Never forget that complexity accumulates. Fixing the bug itself probably costs about the same at every stage, but other costs are introduced as the project moves along, and peak after the software has been deployed.
A bug found after deployment has costs associated with it that a bug found during coding does not:
The cost of finding and fixing the bug may be negligible compared to other costs.
Another aspect of the issue is the nature of the bugs you find late. In my experience, bugs that survive testing and deployment tend to be either bugs in requirements or pretty subtle bugs that slipped through testing, and both are more expensive than the type of bugs commonly detected early on during development.
Most recently I've been tracking down an error in our system. After nearly a month of trying various things, I found the problem of an error. In this case, two years ago the hardaware engineer building the FPGA and DSP programs didn't bother to fix the [relatively simple] design problem. Rather than give all communications the same format, a few commands differ substantially from all others (different responses in certain circumstances, for example).
The problem made it into the PC software that interfaces with the board. The problem is documented in several [maybe 20?] bugs of the software that works between the PC and the external device. The problem is documented in at least 50 bugs in a port of that PC software. It has been in production for several years, and implemented by external companies (which I feal sorry for, due to the complexity of the communications bug).
Now we're working on a completely new FPGA/DSP board to replace the earlier board. Design changes prevent us from directly implementing the bug in the new design, although otherwise the communication protocols are the same. Implementing the same malformed communications will mean breaking the simple straightforward design and carefully implementing a set of 'design exceptions' (read: 'bugs').
It would have taken one engineer an hour or so to fix this thing when they first saw it. It would have taken both teams a few days to fix it when writing the PC to DSP interface (~1 FTE month). It would have taken a few weeks to fix it when writing the port, requiring changes to the PC software and the DSP (~1 FTE year). If we choose to fix the error now, it will probably result 2+ FTE years of work to just fix everything, and more time for regression testing every old peice of software for this one bug. If we choose to leave it in, we will devote at least that much time in evaluating, implementing, and testing the old errors. Not to mention the continued maintenence work when the eventual bugs are found in the new board.
Now we're forced with a tough financial decision: do we spend a month or more carefully re-creating and testing the 'design exceptions', (probably 3-5 FTE years in total) or do we do it 'the right way' and break both our own and our customers' software? (again, several FTE years, but potentially loosing faith with the customers.)
This particular bug could have been prevented by about $50 of work. It has now cost the company tens of thousands of dollars, and will probably cost a few hundred thousand before all is said and done.
Now, lets throw some financial ethics into the $50 --> $5,000 --> $50,000 --> $500,000+ problem: The engineer was in a hurry to fix the problem before a company imposed deadline. Is that engineer responsible for the enormous financial cost? If so, how much? If not, why not? It can be argued that his negligence cause a half-million dollars in damages. It can be argued that the engineer was responsible for $50 but the team was responsible for allowing it to grow. It can be argued that this is a regular business cost due to falibility of engineers' designs.
This begs the question:
How responsible are any of us for the errors we introduce?
frob
//TODO: Think of witty sig statement
From what I have read, Oracle's founders had the best solution to the problem of customers holding off buying until version 2.0: "This first Oracle was named version 2 rather than version 1 because the fledgling company thought potential customers were more likely to purchase a second version rather than an initial release."
If POP3 could have looked forward and seen the SPAM and Forged header abuses, security could have been part of the standard. Now that POP3 and IMAP mail is everywhere and forged headers are also everywhere, changing the de-facto standards is a big thing. Making the switch to something more robust will be a long and painful transition. Everything will be incompatible for a while.
It will be as easy as getting the US to switch to the metric system or transition with the rest of the world to driving on the left side of the road. Both would be much cheaper if they were implimented in the beginning instead of attempting a transition later.
The truth shall set you free!
Typos: Simple misspellings of words. Infrequent, easy to detect, easy to fix.
Writos: Incoherent sentences. More frequent, hard to detect, harder to fix.
Thinkos: Conceptually bonkers. Very frequent, subtle and hard to detect; almost impossible to fix.
Most 'late' bugs that I've seen in software projects belong in the last category - a lack of design or the failure to make a working mock-up leads to 'thinkos' which are only obvious when the application is nearly completed. These are expensive to fix.