Too Darned Big to Test?
gManZboy writes "In part 2 of its special report on Quality Assurance (part 1) Queue magazine is running an article from Keith Stobie, a test architect in Microsoft's XML Web Services group, about the challenges one faces in trying to test against large codebases."
Maybe they can take a clue from other large code projects and open source it?
to Keep It Simple, Stupid.
For those who didn't RTFA, it is basically saying that exaustive (?sp) testing can't be done on a large codebase, and random testing is all you can use, to which most coders will say bull.
If a piece of code is too big to test exaustivly, it's time to refactor it into bits that can be.
After you've tested each part to make sure it works, you test a super set of parts, thus testing the interactions between the smaller parts, lather rinse repeat until you've tested th whole application.
Correct use of unit testing will always outstrip random testing.
This is just an excuse for badly designed code bases.
Ask 8 slackers a question, get 10 awnsers (a citation, but I can't remember from who)
Except people pay Microsoft to beta test their software.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
- Unit test. This can easily be automated with almost any language, especially modern languages like Java (JUnit).
- Component test. This is usually a low-volume testing phase because you're only testing boundary conditions and each component only needs to be retested when it changes.
- Integration test. Again, usually quite low volume if the unit testing and component testing have done their job. You pretty much put it all together and check that it runs, along with the most basic of sanity checks (Can it access the database? Can a user log in? Is it producing log files? etc.)
- System test. The biggy, but it doesn't have to be daunting. If you have proper requirement and design documentation writing the test plans should be a breeze for any competent tester, no matter how large the codebase. If the unit testing, component testing and integration testing have done their job system testing should really only be about validating the software against the requirements, not finding bugs. If you're finding significant bugs at system test stage either your unit or component testing wasn't done correctly or your requirement and design process is poor.
This sort of thing is basic, bread 'n butter stuff to a tester. Usually it doesn't work as it should because either management don't allow proper timescales, don't use a proper iterative process ("I've penciled in one re-build to fix any bugs we find the week before it's due to ship. O.K? Oh well, it'll have to do.") or the requirement and design phase is done so poorly there is no way to write proper test plans. It is almost never the case that the software is "Too complex". If NASA managed to debug the entire shuttle flight control software, I'd expect a company the size of Microsoft to be able to debug a server application.It is possible to build immense and complex code bases that are incredibly well tested and robust. Look at any Linux distribution and this is what you have.
The key is that the code base is structured so that it can evolve over time as many independent layers and threads, each using an appropriate technology and competing in terms of quality and functionality.
The problem is not the overall size of the code base, it's the attempt to exert centralised control over it.
To take a parallel from another domain: we can see very large economies working pretty well. The economies that fail are invariably the ones which attempt to exert centralised planning and control.
The solution is to break the code base into independent, competing projects that have some common goal, guidelines, and possibly economic rationale, but for the rest are free to develop as they need to.
Not only does this make for better code, it is also cheaper.
But it's less profitable... and thus we come to the dilema of the 2000s: attempt to make large systems on the classical model (which tends towards failure) or accept that distributed cooperate development is the only scalable option (and then lose control over the profits).
Sig for sale or rent. One previous user. Inquire within.
What is it about software construction that makes this so difficult a concept to grasp?
Where did you learn the trade? If I guessed you're pretty fresh from a Computer Science course, say two or three years in the business, would I be far from the mark?
I ask because what you describe is exactly what is supposed to happen. You know you're done when, surprise, QA stop sending you bugs (Or at least, stop finding bugs which are classified above a certain severity level). Then, and only then, should the software be considered complete and ready.
The problem is that attitudes like yours, that QA is a pain that should be wished away, is wrong but very very pervalent within the IT industry. It is such a wrong and totally backward attitude I can't fathom where it came from. It's a brain rot that's killing the industry in a see of broken code and half assed implementations.
That's actually more the philosophy of Unix -- to do one thing, and to do it well -- and has been around for 30+ years. I'd say that philosophy is common in the OSS world for a few reasons: One, open-source encourages code & component reuse. Two, most OSS developers don't have time to write large projects on their own, and three, the free software movement started in the Unix domain, the source of this philosophy.
Be relentless!
And that, my friend, is why software "engineering" is not engineering at all. I'm all for raising coding standards to engineering levels. The amount of time and headaches saved by such an effort would easily exceed thousands of lifetimes. It's silly that we still accept such shoddy workmanship.
Be relentless!
In other areas, i.e. ASIC / integrated circuits, the costs of wrong decisions and errors explode during the design cycle. This is why the whole IC industry commits itself to a "first-time-right" ideology. Each step, from specification to the final layout, involves testing. As a ASIC designer, you're happy if you can spend more than 25% of your time and effort on designing the actual architecture. 75-90% of the overall effort is "wasted" for testing.
To continue on. I think that is part of the problem. a lot of Microsoft Beta Testers are just those Windows Nerds who think it is hip and cool to run the unpolished edge of technology and able to put on their resumes that they have 11 Year experience in windows 95. But most of them do in counter bugs but do not report them. Why should they they have to pay to get the product so why bother reporting bugs. Microsoft should release the Beta Tests for Free and to a wide group of people thus allowing them and promise them a free copy of the release product if they report so many bugs. The trick for beta Testing is to get as many eyes on it as possible who know that this isn't a completed or stable product. and are able to try funky things to break it.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
The problem is that in the NASA case, if they don't get that shuttle flight control system ready on time for launch, they can easily push the launch back indefinately. It isn't as if they're going to go out of business if they don't have launches due to unsafe conditions.
Besides which, once the flight control system version x.y is finished, the development tea doesn't then immediately start working on flight control system version x.y+1 (or worse, versionn x+1.0). It isn't as if NASA finishes a shutttle, and then immediately starts building a new, improved shuttle.
But this is exactly what happens in big software houses. The pressure to release ahead of your competition and stay ahead (or catch up with) the perceived feature curve is huge. Delays are bad -- delays equal lost sales. And once the product is done, unlike a bridge or a plane or a shuttle which will last 20 - 30 years or more as is, that software immediately starts getting new features and major modifications for "the next version".
And perhaps worse, once a version ships, most software development companies stop any sort of further testing -- instead, they rely upon customers to report problems, and typically only then do they investigate (and, hopefully, fix the problem).
The process is different due to "market forces". Personally, I don't like it either, and have stayed away from corporate software development for some time because of it. It's simply not a good way to develop software, as eventually the poor design decisions and rushed jobs (and burnt out developers) cost the company and the users dearly.
Yaz.
I still content that rewrites are harmful only when all of these three conditions are met:
a) Your code is your commercial product/livelyhood
b) You need to support legacy systems
c) You are coding for practical results not for the art of programming.
Joel is an insightful guy, but he approaches software exclusively as a deliverable intended to Get The Job Done Now. For a lot of software this is appropriate, but in the case of open source software it is seldom that all of the above conditions are met. There are also a couple of points he doesn't mention that are relevant to open source software:
d) Users of the old code are not left out in the cold - the complete old codebase is available for them to pick up and maintain (or hire someone to maintain - maybe even the original author) if there is sufficient motivation. Open source authors often aren't motivated to maintain steaming piles of turd just for the joy of it, so they are more inclined to do rewrites. If you want them to maintain old stuff, do like everyone else who really wants some service and hire them!
e) The software stack is almost completely free for open source software - there is no "but I can't afford to upgrade to Windows 98 and break everything!" problem. Granted you might run into those problems, but in theory if you care enough they can be solved. (Often NOT true for legacy commercial software.) So open source developers as a whole are a lot less concerned with backwards compatibility. Take KDE for example - the incentive to support KDE2 when coding a KDE app today is virtually nil - there are many very good reasons KDE3 exists, both from a user AND a developer standpoint. If a user really wants the crap handled to deal with old, broken environments they shouldn't expect to get something for free. The point, again, is that they CAN hire someone to do what they want, because the code is available to be updated.
Now, that said, I would agree that OpenOffice is too critical to the free software world to rush off and be headstrong about. It might be a case where a Netscape type move would be a bad idea. But I like the enlightenment project, even if they have treated violating Joel's rules like a pro sport. They are creating something artistic, advanced, and with the intent of "doing it right". If you look at enlightenment as not a continuation of the old e16, but instead as a totally new product, then it takes on a different light - they are actually doing prototypes, designing and testing, etc. BEFORE they release it in the wild and invite support headaches. Now, as usual first to market wins, but in open source losers don't always die and can sometimes come back from the grave. Rosegarden is an example of an application that is good because they explored their options and found a good one, even with and partially because of their experience on previous iterations of the code. They didn't do it "the Joel way" but they did it in the end and they did well.
I think there is another "zen" of programming, that we are getting closer to reaching - the "OK, we have discovered the features we want and use, now let's code it all up so we never have to do it again" level. There is little that is surprising in spreadsheets, databases, word processors, etc. - they are mature applications from a "user expected featureset" point of view. So now I propose we do, not just a rewrite, but a reimplimentation using the most advanced tools we have to create Perfect software. Proof logic, careful design, theorm provers, etc. etc. etc. We know, in many cases, what program/feature/OS behavior/etc. we want. Let's formalize things as much as humanly possible, and make a bulletproof system where talking about rewrites makes no sense, because everything has provably been done the Right Way. (Yes, I'm watching the coyotos project - they've got the right attitude, and they might determine if it is possible.)
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
...such is the outcome of not doing test-driven development. Test the functions as you write them, and just leave the tests there until you release, makes sure everything works. When will these people learn!
I've seen this debate before, and the part I always wonder is "why not both?" At least, when you are starting from scratch. You can verify your components do what they are supposed to and then check for bizarre situations no one thought of with random testing (sometimes you will expose obscure bugs in the software stack itself, not just your code - but remember no code stands alone, and all crashes look the same to the end user no matter what the root cause.)
Particularly on large, old projects one has inherited, random testing can really help because you have absolutely no clue what you are looking for. There are so many discrete components to the system that could be tested it would be the work of ten years to set it up, so you are forced to (as much as possible) assume that things work and find the cases where they don't. Then, you gradually begin to fix things over the long haul while fighting fires.
GCL and the other free Lisp implimentations are a good example of testing - we have a very dedicated individual who has been creating tests of ANSI behavior from the spec and testing a wide variety of implimentations - indeed many non-standard behaviors have been corrected because of these tests. He has also created a "random tester", which I like to call "the Two Year Old Test." It is a code generator which generates random but legally valid Lisp code and throws it at the implimentation. It has exposed some very obscure bugs in GCL which probably would have otherwise hidden for years. Anybody who has been around small kids knows they will introduce you to all sorts of new failure modes in just about everything you own, so I always think the Two Year Old Test should be administered as a final check whenever possible. (Granted this works particularly well for compilers.) Newbies are very useful for this kind of stuff as well, because they will use the software in ways you never thought to.
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
Well, if a QA tester is too clever enough, they can come up with a bug that 1) no end user would ever encounter since it's so arcane and 2) is really, really hard to fix. While theoretically, a programmer should enjoy the challenge of fixing a challenging bug, I can understand there comes a point where the programmer says in effect, "Screw this! The customer is never going to start an order, cancel, put in an invalid amount, then start an order and overflow the buffer! Fixing this will take a month, but it's not going to help anyone!"
So, I can sort of sympathize with the programmer, sort of. However, OTOH if your job is to fix the bugs, you've gotta fix the bugs, so quit yer bitchin' and fix the bugs or get a new job.
Then, six months to a year later, a customer starts and order, cancels, puts in an invalid amount and then starts an order. Guess what happens?
Yes that's right! You get a critical bug report from a customer because the entire system just crahsed and corrupted the database. Now someone in your support team has to spend the entire day helping the very annoyed customer recover their data and restart the system, all the while of course the customer suffers downtime.
Whoops.
Of course this may never happen to you. You probably got lucky. This sort of scenario is why bugs should be classified, and why they should be given a priority. If a bug is found and is deemed to be critical, you'd better fix it. If you don't think it's really a critical bug then fine, argue with the QA team and see if you can change their mind, but at the end of the day they may have use case scenarios or data that you do not, and may have a better idea of just how critical the bug will be.
Use your judgement, but rely on theirs as well.
Is anyone doing independant QA audits of linux, outside of the development sources/bug report/linus ruling on high loop?
If you resist reading what you disagree with, how will you ever acquire deeper insights into your own beliefs?
[TFA] Another great way to target testing is based on actual customer usage.
This is a really good idea.
The crash feedback systems in Mozilla exhibits this model of testing.
I think more of the casual user applications I run on the desktop should be compiled with debugging and a simple transparent mechanism for returning information to the developers about problems.
Nothing mandatory, no hidden information sent back to the mother ship, just a text file showing back traces, etc. that the user can see contains no sensitive information.
Thus all users become beta users that can feedback to the developer which bugs really matter.
Taken to the next step of optimization and UI design, developers can find out which code paths really matter in terms of real life usage if the application is instrumented with profiling turned on and the option for the user to feedback information this way. IIRC, some compilers have options to take advantage of run-time statistics to better compile the second time around.
"Provided by the management for your protection."
I'm a great fan of randomized testing, and have used it to good effect in testing Common Lisp compilers as part of the GNU CL ANSI test suite. The oracle problem is tractable, since one can do differential testing -- test that two different computations that should produce the same answer actually do. For example, construct a random lisp form, then eval it, and also wrap it in a lambda form, then compile and funcall. Differences in output, errors during compilation, or errors during execution all indicate bugs (assuming one has generated legal lisp code.)
This and other more focused random testing schemes have found oodles of bugs in many Common Lisp implementations.
1) no end user would ever encounter since it's so arcane and
Assuming an end user will never encounter an error because it's "arcane" is a really good way to get your ass handed to you. The parent was probably GOOD for testing because he at least knows how to describe the problems and is familiar with the systems. You take the average drone user who doesn't know jack about the system when it blows up and your talking about a huge argument where no one knows what the hell is going on.
I agree about the "hard to fix" problem though. I mean if your code is 99% right with a bug that will be rare and non-criticle and it will raise the cost of the software significantly - then perhaps it's better to let it slide. Or just try to throw up some last minute defenses.
Two points.
1: I agree, but it takes a long time (5-10years) to get you coding skills upto traditional engineering levels.
2: Mechanical devices have much higher tolerances than mathematical ones, if I want a bus that's going to be safe I do some rough calculations and then add 10% to the thickness of all the materials etc...
If I want software that I know is safe I have to make a estimate, double it, then allow ten times the development time to fix the bugs. Even if I had fifty pears reviewing my code bugs are going to slip in because they don't fully understand what I am writing. If you don't believe me, download the kernel source, download a few API's and check the kernel against the API's I bet you'll find a bug. You could even download the DirectX 9 patch from my website and find some bugs if you want!.
If I was writing commercial software I would make sure that all code has full modular testing with wide data sets, and possibly introduce a random data sets and leave a few boxs running from now till we ship just testing the code with valid data and junk. I would also make sure that people were given the opertunity to work on 'pet projects', maybe do some OSS hacking in more R&D area to help keep their skill sharp.
thank God the internet isn't a human right.
And I don't mean to just Microsoft-bash; they are just an easy target. Apple does it, most the major Linux distros I've used do it, it seems like it is just the way the software industry works nowadays. And it is insane.
Apple at least seems to be better about it. With one very notable exception where the contents of my iPod were completely erased, all of the software updates I have gotten from Apple have been flawless and for the most part made the product better. This includes point releases as well as security updates.
I don't know the internals of Apple's development process, but I suspect that they are very disciplined in their QA process. I think Microsoft has driven them to this, because one of the prime differentiating characteristics between Apple and Microsoft software is quality.
The trick for beta Testing is to get as many eyes on it as possible who know that this isn't a completed or stable product. and are able to try funky things to break it.
I agree with that statement.
However, what you've described takes an enormous amount of time and effort and [background] knowledge, to the extent that "try(ing) funky things to break it" could very well become a full time job. Hell, just spending the time necessary to read the documentation [and surf the web looking for "gotchas"], solely for the purpose of figuring out how to INSTALL a piece of software, is d@mned near a full time job.
But in the real world, people get paid to do full time jobs - in fact, they even get paid to do part-time jobs. And if their job title is something like "Senior Testing Engineer for Quality Control", then they get paid sh1tl0ads of money.