Debugging
Debugging explains the fundamentals of finding and fixing bugs (once a bug has been detected), rather than any particular technology. It's best for developers who are novices or who are only moderately experienced, but even old pros will find helpful reminders of things they know they should do but forget in the rush of the moment. This book will help you fix those inevitable bugs, particularly if you're not a pro at debugging. It's hard to bottle experience; this book does a good job. This is a book I expect to find useful many, many, years from now.
The entire book revolves around the "nine rules." After the typical introduction and list of the rules, there's one chapter for each rule. Each of these chapters describes the rule, explains why it's a rule, and includes several "sub-rules" that explain how to apply the rule. Most importantly, there are lots of "war stories" that are both fun to read and good illustrations of how to put the rule into practice.
Since the whole book revolves around the nine rules, it might help to understand the book by skimming the rules and their sub-rules:
- Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
- Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs, don't trust statistics too much, know that "that" can happen, and never throw away a debugging tool.
- Quit thinking and look (get data first, don't just do complicated repairs based on guessing): See the failure, see the details, build instrumentation in, add instrumentation on, don't be afraid to dive in, watch out for Heisenberg, and guess only to focus the search.
- Divide and conquer: Narrow the search with successive approximation, get the range, determine which side of the bug you're on, use easy-to-spot test patterns, start with the bad, fix the bugs you know about, and fix the noise first.
- Change one thing at a time: Isolate the key factor, grab the brass bar with both hands (understand what's wrong before fixing), change one test at a time, compare it with a good one, and determine what you changed since the last time it worked.
- Keep an audit trail: Write down what you did in what order and what happened as a result, understand that any detail could be the important one, correlate events, understand that audit trails for design are also good for testing, and write it down!
- Check the plug: Question your assumptions, start at the beginning, and test the tool.
- Get a fresh view: Ask for fresh insights, tap expertise, listen to the voice of experience, know that help is all around you, don't be proud, report symptoms (not theories), and realize that you don't have to be sure.
- If you didn't fix it, it ain't fixed: Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself, fix the cause, and fix the process.
This list by itself looks dry, but the detailed explanations and war stories make the entire book come alive. Many of the war stories jump deeply into technical details; some might find the details overwhelming, but I found that they were excellent in helping the principles come alive in a practical way. Many war stories were about obsolete technology, but since the principle is the point that isn't a problem. Not all the war stories are about computing; there's a funny story involving house wiring, for example. But if you don't know anything about computer hardware and software, you won't be able to follow many of the examples.
After detailed explanations of the rules, the rest of the book has a single story showing all the rules in action, a set of "easy exercises for the reader," tips for help desks, and closing remarks.
There are lots of good points here. One that particularly stands out is "quit thinking and look." Too many try to "fix" things based on a guess instead of gathering and observing data to prove or disprove a hypothesis. Another principle that stands out is "if you didn't fix it, it ain't fixed;" there are several vendors I'd like to give that advice to. The whole "stimulate the failure, don't simulate the failure" discussion is not as clearly explained as most of the book, but it's a valid point worth understanding.
I particularly appreciated Agans' discussions on intermittent problems (particularly in "Make it Fail"). Intermittent problems are usually the hardest to deal with, and the author gives straightforward advice on how to deal with them. One odd thing is that although he mentions Heisenberg, he never mentions the term "Heisenbug," a common jargon term in software development (a Heisenbug is a bug that disappears or alters its behavior when one attempts to probe or isolate it). At least a note would've been appropriate.
The back cover includes a number of endorsements, including one from somebody named Rob Malda. But don't worry, the book's good anyway :-).
It's important to note that this is a book on fundamentals, and different than most other books related to debugging. There are many other books on debugging, such as Richard Stallman et al's Debugging with GDB: The GNU Source-Level Debugger. But these other texts usually concentrate primarily on a specific technology and/or on explaining tool commands. A few (like Norman Matloff's guide to faster, less-frustrating debugging ) have a few more general suggestions on debugging, but are nothing like Agans' book. There are many books on testing, like Boris Beizer's Software Testing Techniques, but they tend to emphasize how to create tests to detect bugs, and less on how to fix a bug once it's been detected. Agans' book concentrates on the big picture on debugging; these other books are complementary to it.
Debugging has an accompanying website at debuggingrules.com, where you can find various little extras and links to related information. In particular, the website has an amusing poster of the nine rules you can download and print.
No book's perfect, so here are my gripes and wishes:
- The sub-rules are really important for understanding the rules, but there's no "master list" in the book or website that shows all the rules and sub-rules on one page. The end of the chapter about a given rule summarizes the sub-rules for that one rule, but it'd sure be easier to have them all in one place. So, print out the list of sub-rules above after you've read the book.
- The book left me wishing for more detailed suggestions about specific common technology. This is probably unfair, since the author is trying to give timeless advice rather than a "how to use tool X" tutorial. But it'd be very useful to give good general advice, specific suggestions, and examples of what approaches to take for common types of tools (like symbolic debuggers, digital logic probes, etc.), specific widely-used tools (like ddd on gdb), and common problems. Even after the specific tools are gone, such advice can help you use later ones. A little of this is hinted at in the "know your tools" section, but I'd like to have seen much more of it. Vendors often crow about what their tools can do, but rarely explain their weaknesses or how to apply them in a broader context.
- There's probably a need for another book that takes the same rules, but broadens them to solving arbitrary problems. Frankly, the rules apply to many situations beyond computing, but the war stories are far too technical for the non-computer person to understand.
But as you can tell, I think this is a great book. In some sense, what it says is "obvious," but it's only obvious as all fundamentals are obvious. Many sports teams know the fundamentals, but fail to consistently apply them - and fail because of it. Novices need to learn the fundamentals, and pros need occasional reminders of them; this book is a good way to learn or be reminded of them. Get this book.
If you like this review, feel free to see Wheeler's home page, including his book on developing secure programs and his paper on quantitative analysis of open source software / Free Software. You can purchase Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
cause when i do it, it is often re-bugging
What if someone else fixes it?
Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs, don't trust statistics too much, know that "that" can happen
Isolate the key factor, grab the brass bar with both hands (understand what's wrong before fixing), change one test at a time, compare it with a good one, and determine what you changed since the last time it worked.
Does anyone else feel dirty after reading this?
"If you think you have things under control, you're not going fast enough." --Mario Andretti
> Change one thing at a time: Isolate the
> key factor, grab the brass bar with both
> hands (understand what's wrong before fixing),
> change one test at a time, compare it with a
> good one, and determine what you changed
> since the last time it worked.
This is helpful with unit tests, too. If I find a bug, I want to figure out which unit test should have caught this and why it didn't. Then I can either fix the current tests, or add new ones to catch this.
Either way, if someone reintroduces that particular bug it'll get caught by the unit tests during the next hourly build.
The Army reading list
...are always the worst: bugs which disappear when you look for them. Insert a print statement? The bug disappears. Use a debugger? The bug reappears, but in a different place.
Heisenbugs are almost always caused by buffer overflows. They can often be prevented (at least in Fortran 77/90/95/03) by enabling array-bounds checking at compile time; but before I knew about this, I had a hell of a time tracking them down.
Tubal-Cain smokes the white owl.
(this IS slashdot after all)
1) Check your registry.
LINUX ain't got no registry crap!
2) Check your FAT32/CXFS filesystems.
LINUX is JOURNALLED and can do that in the background!
3) Verify your drivers are current.
LINUX is stable with drivers written in COBOL back in the 50's!
4) Defrag your disks.
Defrag?! You must be a WINDOZE LOOSER!!
5) Check your connections on the back of the PC.
HAHAHA! LINUX does that AUTOMATICALLY!!! LOOSERSSSSS!!!
6) Are your cards well seated? Power down and reseat.
HAHAHAAHA! LINUX can HOTSWAP EVARYTHING EVEN CPUs, LOOSERS!
7) Is your OS up to date? Perform a Windows Update.
HAHAAHAHA!!! LINUX can update itself automatically cuz of its LEET HEURISTICS and COOLNESS that MS aint got, LOOSERS!!!
8) Start in "SAFE MODE"
HAHAHA! What's the other? UNSAFE MODE!?!?! LINUX is always safe, LOOSERS!
9) Reinstall Windows.
HAHAHAH! LINUX NEVER NEEDS INSTALLING! Pour the blood from a freshly sacrificed penguin on the disk and it installs AUTOMATICALLY THROUGH AIR!!!!! LOOOOOOOSERS!!!!!
Trolling is a art,
I've read it and it's a good book, but I would
just borrow it from the library and then print
out the poster to remember the 'rules'.
There's not enough meat to keep it on my
precious shelf space.
-- All that's left of me, is slight insanity, whats on the right, I don't know. -- Bob Mould
...to learn how to debug. I only need my own sloppy code.
Find funky gifts
Regression test suites (if possible) should be maintained so that when bugs get fixed, they stay fixed.
Just my 2 cents.
I can think of a WHOLE lot of tech's and admin's who really need to follow number 9 a lot closer.
Especially those Windows admins/techs who think 'restart' is the ultimate fix-all. Though, sadly, I suppose in many cases that's about all you can do with proprietary software. Well, that and beg vendors to fix the problem. (We all know how productive that is....)
Nothing about writing code for a test case that exercises the bug, then rerunning it every time you make a change you think will fix the bug? Seems like a big oversight. Any program of reasonable size is going to require wasting a significant amount of time restarting and re-running to the point of failure, and with every manual check of the result, there's an increasing probability that fallible human will make a mistake.
More programmers need to get Test Infected.
the original bug was a hardware problem.
Soul of a New Machine by Tracy Kidder (book teaser) My favorite chapter was The Case Of The Missing NAND Gate.
"Can there be a Klein bottle that is an efficient and effective beer pitcher?"
I can't tell you the number of times I've heard something along the lines of "Resetting your account will fix the problem." Guess what- it doesn't. Then again, after this, I guess I shouldn't expect much.
Children in the backseats don't cause accidents. Accidents in the back seats cause children.
there's a distinction (in real life) and
in the book between troubleshooting something
that's supposed to work (think TV repair) and
debugging something that's never been made
before (hardware design).
Troubleshooting lends itself more to scripted
debugging, and "real debugging" is a bit more
free-form
-- All that's left of me, is slight insanity, whats on the right, I don't know. -- Bob Mould
- Is it plugged in?
- Are you logged in?
- Is it spelled right?
Works in 9 cases out of 10.Quidquid latine dictum sit, altum sonatur.
"The most likely source of the current bug is the fix you made to the last one."
These "rules" are great, but nothing beats the mystic power of a little goat blood and chicken bones waved over a misbehaving system.
Without these, the average user might be tempted to try and fix it themselves.... Next thing, my job is being "offshored" to a phone bank in India.
No, the chicken bones and a little incantation will keep my job right here, where it belongs.
10) Hammer.
if 10 fails
11) Shotgun.
Congrats problem solved, human destressed.
-- taking over the world, we are.
Chips have bugs, why do you think there are re-spins? We are talking from a design point here, not a "techie-fix-this-shit" point. Different ballgame.
One thing's clear from looking at that list - spend more time on testing your code.
Unfortunately, speaking as an ex-programmer, time is one luxury that PHBs don't afford their minions. A project needs to be completed and knocked out of the door as soon as possible. The less time spent on unnecessary work, the better.
It is also unfortunate that PC users have been brought up expecting to have buggy software in front of them and expecting to have to reboot/reinstall. What motivation is there to produce bug free code when the users will accept buggy code?
Ho well, at least I run my own company now - master of my own wallet - and can concentrate on quality solutions.
Teaching people how to debug isn't that easy. It requires some experience before they get the hang of it.
I'm a stickler for labeling code often, and tracking changes released to production. Because of this, I often seem to be a stick in the mud when it comes to refactoring.
Heavy refactoring makes your code nicer. But when you have to do a lot of debugging on something that worked be refactoring, you can start to appreciate that keeping the change set managable is a 'good thing'. (I do financial apps, so this may not work for everyone.)
The things I see people fail at most is the ability to 'bracket' the problem. Go between code that works and doesn't work, filtering the problem down to something simple.
The second thing is the inability of some people to go 'deep' in their debugging. Decompile the java/C#/whatever code, trace through the library calls, whatever.
Its nice to see another good book on the market that seems to cover these topics.
0. If you're a software guy blame it on hardware, if you're a hardware guy blame it on software.
0.1. Blame it on the user.
0.2. Blame it on your colleague.
0.3. Blame it on your manager.
0.4. Yell at the computer and tell it to work dammit!
0.5. Put head on keyboard and sob.
0.6. Read Slashdot.
0.7. Post on Slashdot.
0.8. Call it a feature not a bug.
I find that when troubleshooting systems with which other people have worked longer, I have had better luck just asking them simple facts and troubleshooting myself rather than listening to their wild-ass guesses and having to shoot them down.
500GB of disk, 5TB of transfer, $5.95/mo
You can read a sample chapter from the Debugging Rules book in PDF format by going here. (Requires the free Adobe reader.)
The Big News Page
Market the bug as a "random feature".
Are you Corn Fed?
Troubleshooting is what you do to fix your mom's ethernet card. "Oooh, it's on the bottom PCI slot, has no interrupt line. I'll just move it up one slot..."
Debugging is what you do with an oscilloscope to figure out why a particular circuit design isn't working as anticipated. You don't "troubleshoot" a circuit design. You debug it.
Or, to put it another way, "troubleshooting" is what a tech support monkey does. "Debugging" is what an engineer does.
Get off it. I can't think of a single reason why someone can't "debug" hardware or anything else for that matter. The origin of the word comes from a troubleshooting situation anyways. Why should someone be able to debug a relational database but not a relationship?
postmodernsideshow.com
10. Code is _always_ Beta. It's never done until it's no longer in use or support no longer exists.
9. The better the SDK, the more sophisticated the bugs.
8. There's always more bugs in the other guy's (girl's) code.
7. Declaring code bug-free is asking for it to fail at the worst possible time with the greatest visibility.
6. A good design is as likely to have bugs as a bad one. Bugs are equal opportunity.
5. Debugging time is inversely proportional to coding time.
4. If it works the first time, there's a bug, but you won't find it until you roll it out.
3. Debugging is fun. Really! It's when you run out of bugs that you should wonder if you got them all, that's not fun.
2. The most difficult bugs to find are in the most straightforward looking code.
1. That's not a bug, that's a feature.
A feeling of having made the same mistake before: Deja Foobar
because it's truly true.
Actually,
the first computer "bug" was a hardware bug, as it was a moth that flew into a relay and jammed it. Removing the bug physically was debugging. http://www.maxmon.com/1945ad.htm is a reference.
Besides, when you are building a machine and dealing with Logic Gates - its the same type of debugging as with software logic.
Maybe if you spend a few weekends dealing with MS "bugs" you would have a new appreciation.
BC
Considering that, as of now, no less than six people have pointed out the wrongness of this statement, can somebody please mod him back down to reality? Thanks.
OP and the mods that modded OP up are morons and obviously not anywhere near the hardware design industry.
Okay, haven't read the book, and I guess dhweeler is distilling the rules down to a soundbyte, but isn't #1 the most important and difficult part of debugging? I mean, if I knew system Foo ver. Bar had such-and-such an idiosyncrasy, I could code around it, but Googling for hours to find the one message board post that lets you Understand The System can be aneurysm-inducing. It's not even always the idiosyncrasies of a system -- the sheer volume of stuff you have to learn about I/O conventions, operating systems, etc., in order to write a useful program in a non-toy language boggles the mind. I'm surprised people are able write programs in the first place.
Like 15 years ago in my intro CSE class my first Fortran program which found "edges" in a text file filled with numbers did this. Everything looked good. It would compile. But wouldn't print out its little thing. So I instert statements to print out status of where it is, and it works! I take out the statements and it doesn't. In/out in/out. SO I go ask the TA for help. He says its one of the damndest things he's seen, sorry, Fortran isn't something he's really an expert at.
I have hated fortran for years, having written a single program in it, based on this.
Interesting and fairly well written.
"Can there be a Klein bottle that is an efficient and effective beer pitcher?"
Make It Fail is pretty hard to do when it comes to race conditions. This has got to be the most frustrating kind of bug. Others are referring to the Heisenbug which comes in a variety of flavors.
Sometimes you don't KNOW when there's multiple threads or processes, or when there are other factors involved.
Have you noticed that a new thread is spawned on behalf of your process when you open a Win32 common file dialog? Have you noticed that MSVC++ likes to initialize your memory to values like 0xCDCDCDCD after operator new, but before the constructor is called? It also overwrites memory with 0xDDDDDDDD after the destructors are called. And that it ONLY does these things when using the DEBUG variant build process? Did you know that .obj and .lib can be incompatible if one expects DEBUG and the other expects non-DEBUG memory management?
Someone on perlmonks.org was just asking about a Heisenbug where just the timing of the debugger threw off his network queries. Add the debugger, it works. Take away the debugger, it fails. I've got a serial-port device which comes with proprietary drivers that seem to have the same sort of race condition.
The top 9 rules mentioned here look great. But you could write a whole book on just debugging common race conditions for the modern multi-threaded soup that passes for operating systems, these days.
[
If it's Windows, add:
4. Try rebooting.
That fixes almost half of all Windows problems for me.
I find the best way to uncover bugs is to do a demo for your boss's boss.
If all this should have a reason, we would be the last to know.
probably added a step stating that the problem symptoms and causes should be articulated clearly (probably between #3 and #4) before trying to fix anything. I've seen too many engineers/programmers/technicians list symptoms and attack them individually, only to discover that they were related.
On the surface, this flies in the face of "divide and conquer" - but what I'm really saying here is make sure you have the problem bounded before you attack it.
Also, with Step 9, I would have liked to see more emphasis on ensuring that nothing else is affected by the "fix". Making changes to code to fix a problem is often a one step forward and two steps backwards when you don't completely understand the function of the code that was being changed.
All in all, an excellent book in a little understood area.
myke
Mimetics Inc. Twitter
"AAAHHHHHHHHHHHHH!!!!!!"
A feeling of having made the same mistake before: Deja Foobar
They missed a good one: explain the bug to someone.
If you start explaining the bug to someone, there's a good chance in mid-explanation you'll realize a solution to the problem.
Some school (can't remember which) had a Teddy Bear in their programming consulting office... There was a sign. "Explain it to the bear first, before you talk to a human". Silly as it sounds, people would do it, and a large portion of the time they'd never actually have to consult the staff... by explaining it to the bear, they solved the problem.
Weird, but true.
Retry
Reboot
Reinstall
And that's why I love having source code!
He missed a rule: Explain the bug to someone else.
The second pair of eyes often finds the problem
even if they don't have a clue what you are talking
about.
Besides being highly apocryphal - that was the first use of the word bug in context of computing. It is not the first hardware bug by a long shot. Actually you would have known that if you actually read the page you linked to.
One rule he's missed is very important: Before making a measurement (like printing the value of a variable or changing something about the code) work out what answer you expect to see. Note well - do this before you look at the result. When you see something different, either its a symptom of the bug, or a symptom of you not yet understanding the system. Resolving this will either improve your understanding or turn up the problem.
Squirrel!
Mr. Agans' book presents real life experiences, or as he calls them war stories and humor filled comment/anecdotes.
I find myself chuckling and giggling along while reading this book, some of what he said brought back my own memories while working/debugging on my own software bug(s), or other people's bug(s) that I have somehow 'inherited' because they left the company, or are too busy on other projects to debug their own code. I like the metaphors that he uses to explain ideas or concepts that seems a bit too complicated to understand.
Mr. Agans made this very clear in the beginning of his book; the book is not a cover-it-all book, it is a general concept book on how to isolate, find, and debug something that has gone wrong. The principles presented by Mr. Agans can be applied to situations covering everyday life. He presented examples of well pump and light bulb, etc...
More experienced software/hardware engineers or more experienced problem solvers who read this book might find it covering bases that they already know, but the humor makes it worth while.
4. Divide and conquer: Narrow the search with successive approximation, get the range, determine which side of the bug you're on, use easy-to-spot test patterns, start with the bad, fix the bugs you know about, and fix the noise first.
That's a very usueful rule. In nearly 20 years of programming I haven't found any tool or technique that works better than printf / std::cout / MessageBox and logging.
Logging is especially important if your users aren't conveniently in the same building as you. When a customer has a problem I've never seen before, I usually tell them to run the program with the -log switch and send me the log. Nearly always this leads to the problem and I can fix the bug within minutes.
Add logging to your app and you'll increase the number of hours you can sleep.
Well, even though I think most people 'round these parts would agree with me that the book covers the fairly obvious, I will say this: it's absolutely necessary to have an "expert" write these things down because all too often, us developers try to proceed and get blocked by management. At my last job, we had a big problem with WebLogic transaction management, some bizarre confluence of events was causing a HeuristicMixedException to be thrown by the platform--by the way, WebLogic people, thanks a lot for naming this exception this way and taking the time to make sure it gets thrown in no less than six totally unrelated (as far as I can tell) circumstances. I love it when exceptions originate ambiguously, from several sources, and no one part of the platform has authority over the problem.
This was a big enough problem that we had to set up a separate, isolated environment to figure out what was going on. 4 out of the 5 architects involved on the project (no it wasn't a huge project--you can see HME wasn't the only problem here) had cemented ideas about what was going wrong...none of them agreed of course...and we had no less than 3 managers with theories based on the idea that the Earth sweeps through an aether against which all things can be measured.
The biggest issue with this testing environment was keeping everyone's mitts off of it, especially those people who didn't have to ask for permissions to the system (the architects, managers...in other words everyone). And the managers didn't agree that it was particularly important to record every step methodically, or limit the number of people making changes to the system to 1 at a time. Instead, they set up a war room and engaged in what I like to call: Fix By Chaotic Typing. (It's chaotic in the sense that, there are definitely patterns to the activity taking place, but you have to be Stephen Wolfram to find and understand them.)
Needless to say, that didn't work. If I'd had access to this book, an authority willing to put the obvious in print might have bolstered my argument that we needed to take resources OFF this issue, not add more. Alas, it was not to be. The bigwigs decided that, since the current manpower wasn't able to track down this bug, it was time to bring in the high-priced WebLogic consultants. We got some 1st generation WebLogic people, 3 of them eventually, and they came in and immediately set themselves to the task of learning our business, telecommunications. And at a mere $150/hour, why not? (Management decided the bug was non-deterministic at some point and this assembly of people was given the informal team moniker: the Heuristics team. I preferred "the Histrionics team".)
So I eventually teamed up with the lead architect on the project and we solved the problem by subterfuge. We had to intentionally set these people working in a direction--everyone, employees and WebLogic consultants alike--that was so off-the-track they actually didn't interfere with any part of the system likely containing the error. This gave us a reasonable amount of time and space to track down the bug in 3 days' time. At only the loss of 6 weeks and several thousand dollars in expenses alone for the WL consultants.
sev
but have you considered the following argument: shut up.
No,
The first rule of bugs is, you do not talk about bugs. The second rule of bugs is, YOU DO NOT TALK ABOUT BUGS!
-From a memo to Microsoft's new employees
debugging is a state of mind /t
#!/usr/bin/english
Obviously, the author has never used C++Builder.
"Describe the problem to someone else."
;-) Even when the person to whom you're explaining it is smart, alert, and interested, it's almost never them that fixes the bug.
This is so effective that it doesn't require the person to whom you're explaining it to pay attention, or even understand. A manager will do
The process of describing the behaviour of the program as it ought to be versus the behaviour it is exhibiting forces you to step back and consider only the facts. This in turn is often enough to give you an insight into the disconnect between what's really happening and what you know should be happening.
If you catch yourself saying "that's impossible" when debugging some particularly freaky bit of behaviour, it's definitely time to try this.
The input of the other party is so irrelevant in this process that we used to joke about keeping a cardboard cut-out programmer to save wear and tear on the real ones...
--- These are not words: wierd, genious, rediculous
But still... Someone made money off of that? Heck, look for my new book next week, "Walking to Peoria in 3,976,554 steps", with each step being "Place your rear foot 1.5 feet in front of your front foot."
Karma: Chameleon - mostly influenced by bad '80s New Wave music
What is this "unit test" you refer to? If we consider our customer base to be a "unit", does that count?
Yours Truly (All Belongs To Me),
Bill
A good list. As part of rule 8, it's often extremely helpful to look at the problem from a different level of abstraction than one normally would (e.g., different than you coded, or that you best understand it). This often exposes false assumptions that may be blocking a proper analysis.
Successful debugging is a lot like any hard science, particularly if you are not, and cannot, become familiar with the entire system first. Your "universe" is the failing system. You develop hypotheses (failure modes and potential fixes) and run experiments (test them). You have solved the problem only if you completely close the loop (your fix worked, it worked in the way you expected, your hypothesis completely explains the circumstances, and peer review concurrs).
A big part of the "art" is cultivating an attitude of how systems are stressed, and how they may fail under those stresses.
I find that when troubleshooting systems with which other people have worked longer, I have had better luck just asking them simple facts and troubleshooting myself rather than listening to their wild-ass guesses and having to shoot them down.
Yes, but within their guesses are sometimes tidbits of information. Last week we had a complaint from a user that every time they clicked this one button on a form, it set off a certain process that wasn't supposed to happen right then, but we knew that there was no connection between that click event and the process. However, I knew he wasn't imagining it.
After investigating, I found that when he opened the form that the button was on, it loaded a timer object that started ticking away, and 5 seconds later initiated the process. Just happens that it takes about 5 seconds from opening the form to click on the button.
Of course, if I'd written the software... well, whatever.
"I have never let my schooling interfere with my education." - Mark Twain
Here's the story from the Jargon File (under "casting the runes"): "A correspondent from England tells us that one of ICL's most talented systems designers used to be called out occasionally to service machines which the field circus had given up on. Since he knew the design inside out, he could often find faults simply by listening to a quick outline of the symptoms. He used to play on this by going to some site where the field circus had just spent the last two weeks solid trying to find a fault, and spreading a diagram of the system out on a table top. He'd then shake some chicken bones and cast them over the diagram, peer at the bones intently for a minute, and then tell them that a certain module needed replacing. The system would start working again immediately upon the replacement."
- David A. Wheeler (see my Secure Programming HOWTO)
...because I get so much practice.
I agree, explaining it to someone else is a really good way to debug things. It's discussed in the book, it's just not identified as a "rule" per se. You're probably right, it should've been elevated to a rule or subrule. The book does have a funny story that one group used a mannequin and made everyone explain their inexplicable problems to the mannequin (same idea as the Teddy Bear). And yes, it worked for them, too!
- David A. Wheeler (see my Secure Programming HOWTO)
Sounds like rules from Bastard Operator from Hell.
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
It's funny how much time a typical tester spends testing the code that's OK already (has been debugged/fixed) while not touching the bugs outside of the scope of regression test suite. I would mandate ad-hoc testing with periodical (two to three times during the product cycle, basically at two last milestones and in the final version) regression testing. Not the kind of "we've changed this from int to uint, execute your 10000 test cases" kind of stuff I'm seeing way too often. Testers get bored and become dumb if you make them do it two dozen times, thus they miss the higher-level bugs.
Another thing that's totally wrong with testing is breaking down the application by areas in any kind of strict fashion, especially with things like security or performance. This fucks up the most important thing in the app - namely, how well the pieces work together. Individual features may be OK, but there's a lot of unexplored garbage at the seams, and sooner or later seams start to break revealing fundamental problems and total lack of testing in feature interaction.
I daresay this is the first "-1, Informative" post I've ever seen :-)
Windows Debugging Steps:
;^/
1) Re-boot.
2) Re-install.
3) Re-format, Re-boot, Re-install.
4) Re-peat.
I am not surprised at the number of so-called bugs that turn out to be holes in the specifications or tests. Then I tell the complaintant "thats the design specification". Then they say "no, thats not" and give me the updated specification.
In fact, popular bug-tracking databases like Scopus usually merge bugs and enhancement requests together, due to this ambiguity.
See Eric Allen's "Diagnosing Java Code" at IBM's Developerworks for lots of debugging info (& not just for Java).
My favorite quote on the subject of debugging:
55 years later, programmers are still spending a large part of their lives finding bugs and fixing them...
At about $1200 a seat for WinXX, that's about a single day's worth of productivity from a coder. (I'm counting all overhead here).
How many times have you spent days or even weeks looking for that one elusive memory overwrite causing your Heisenbug? Memory-checking tools like Purify find those bugs before they cause failures!!!
There really was a bug based on the phase of the moon. See the Jargon Dictionary for more info: phase of the moon:
phase of the moon
phase of the moon n. Used humorously as a random parameter on which something is said to depend. Sometimes implies unreliability of whatever is dependent, or that reliability seems to be dependent on conditions nobody has been able to determine. "This feature depends on having the channel open in mumble mode, having the foo switch set, and on the phase of the moon." See also heisenbug.
True story: Once upon a time there was a program bug that really did depend on the phase of the moon. There was a little subroutine that had traditionally been used in various programs at MIT to calculate an approximation to the moon's true phase. GLS incorporated this routine into a LISP program that, when it wrote out a file, would print a timestamp line almost 80 characters long. Very occasionally the first line of the message would be too long and would overflow onto the next line, and when the file was later read back in the program would barf. The length of the first line depended on both the precise date and time and the length of the phase specification when the timestamp was printed, and so the bug literally depended on the phase of the moon!
The first paper edition of the Jargon File (Steele-1983) included an example of one of the timestamp lines that exhibited this bug, but the typesetter `corrected' it. This has since been described as the phase-of-the-moon-bug bug.
However, beware of assumptions. A few years ago, engineers of CERN (European Center for Nuclear Research) were baffled by some errors in experiments conducted with the LEP particle accelerator. As the formidable amount of data generated by such devices is heavily processed by computers before being seen by humans, many people suggested the software was somehow sensitive to the phase of the moon. A few desperate engineers discovered the truth; the error turned out to be the result of a tiny change in the geometry of the 27km circumference ring, physically caused by the deformation of the Earth by the passage of the Moon! This story has entered physics folklore as a Newtonian vengeance on particle physics and as an example of the relevance of the simplest and oldest physical laws to the most modern science.
cpeterso
it's only 192 pages and cost < $20USD :)
That "suitably aligned for any use" is critically important - it means that malloc has to return memory in blocks of 8 bytes (on most systems). That means that when you ask for 1 byte, you really get 8 and have 7 bytes of slop to overwrite before you really have a problem.
But if you ask for 8 bytes you have no room for any error...
For whatever reason, slashdot's code eliminates the '&' that I put in front of x and y, along with both x and y, for the 3rd and 4th cases. So, just keep in mind that there should be a '&' in front of y when I assign c = y, and in front of x when I assign d = x; //this is fine, we're just passing data /*this ok too, y's block of memory is
int a = 0;
int *b = NULL;
int **c = NULL;
int *d = NULL;
{
int x;
int *y;
y = malloc(sizeof(double));
a = x;
b = y;
allocated out of the heap*/
c = y;/*WRONG!! y is allocated
locally, remember, we're talking
about the address of y, not the
adress y is pointing to in this
case, the adress of y, like the adress
x, is referrencing locally allocated
data*/
d = x;/* this is also wrong, since the adress
of the data containing x, is pointing
to memory that is pushed on the
function stack*/
}
If it takes more time to write automated test than to execute the test case 5 or 6 times - fuck the automation. Again the main problem with it is that it will test the code that's already "good" without going to unexplored code paths. This gives a tester (and a developer) a false sense of security. Automation suite passes, so there are no more bugs, right? WRONG!
I'm not saying automation is useless. There are certain tests that are nearly impossible to do without automation, I'm just saying that most of the time cost outweighs the benefits.
I can't believe he didn't add that one.
I cannot tell you how many times I've found problems simply by comparing with something similar which works, and changing the thing that works to be more like the thing that doesn't work one step at a time until it quits working. E.g. Back up io previous working versions by CVS, and start applying changes until the problem reappears. Sometimes this is easier than actually debugging the problem, especially if lots of people are working on something and the code is stuff you aren't that familiar with. You can often narrow the problem down to a small diff, and the problem will just be plainly visible, or if it isn't plainly visible, at least you have a good clue that might stimulate other areas to look.
Also works for debuggging network hardware problems. Or finicky jet ski engines.
It's related to "change one thing at a time" I suppose, but not quite the same.
"63,000 bugs in the code, 63,000 bugs,
Ya get 1 whacked with a service pack,
Now there's 63,005 bugs in the code!"
-- from a Slashdot post
hehe...how does one solve this problem?
Even a stopped clock gives the right time twice a day.
Wow, I'm glad I'm coding in Java.
Even when garbage collection and VM-managed pointers meant significantly slower performance (not anymore in most situations), it was worth it.
And I sat down years ago and learned Java threading inside and out, and I always manage thread communication very carefully (which is easy - define your thread boundaries well, and there's not much code at risk)... and voila, no Heisenbugs.
I actually like debugging nowadays. It's like being a storybook detective -- you always get your man.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
You: (insert long story about ice cream and driving)
Mechanic: (translation mode ON) So, you drive to the store and when you get back in your car, sometimes it won't start?
You: Yes (insert long story about ice cream)
Mechanic: (translation mode ON) Sounds like vapor lock. Let me take a look at the carb.
Moral of this story: Good techs can extract the relevant facts from the fluff and quickly diagnose the real problem.
Slashdot needs to do some.
If I click the Read More link, no forum items appear after the text. If I change my threshhold, I still see no items. I have to go back to the Slashdot front page, and then click the link beside the Read More link in order to see the responses.
Examine the input data. Often it isn't a bug. Often the program is doing an entirely reasonable thing given the input data. Or perhaps the program mishandled bad input data (in which case it is a bug, but now you know what to look for).
I don't remember the details exactly but we were testing a server for several weeks that was built on some version of the msvc++ debug mfc libraries. They put in some kind of code to track memory allocations numerically. So, let's say you wanted to tell the debugger to stop on the 3238th "new" -- somehow you would set this up using some precompiler defines and it would count up all the memory allocations until #3238 and it would break into the debugger right then. Well, folks -- guess what happened when that counter overflowed (after a few weeks of my program running)? You guessed it:
/. is the place to do it, and I'm anon, I'll cast aspersions -- what do y'all think, was their convenient forgetting to check for overflow on this integer a deliberate mistake, since you're not supposed to (because of license) ship a product (especially a server) linked to their debug libraries? Kinda gets in the way of testing though...
This program has performed an illegal operation and will be shut down.
With me saying "wtf" and having no idea what's going on except that my program just wants to die after it's been running for too long.
since
Here's what I've always done.
1. Figure out what the code should do.
2. Figure out what it does.
3. Figure out where 1 and 2 diverges.
If you have reproducible error, you really don't need any great skills or brilliant thinking to find and fix a bug this way.
This is Informative!
This issue is a bit more complicated than you think.
I usually test new code from my desk, but the equipment is in the lab where it's (usually) connected via the LAN.
GAAAAAAHHHHH!!!!
One of the most confusing bugs I ever encountered was when I determined the obvious fix and then got the same conditions as before.
The fix looked good and showed as expected in a dump. I finally determined, by careful code checking, that there was another error downstream which produced exactly the same results. The second fix resolved the matter.
This is a very ancient experience which occured on an IBM704 noticeably before there was any such thing as an operating system.
I had written an application which, when run, wiped 32K of RAM words clean with the same image in every word. I had to get after-hours computer time and proceeded to insert a print statement following by an abort in the source code (Fortran) in order to track the problem down.
Turned out to be an input error. The input utility I was employing used the character in column one of the data cards to indicate the type of data.
I finally determined that my assistant, who had prepared the input data for my test case had put a 3 instead of a 4 in the first column of a data card resulting in what would have been a quite reasonable integer value becoming a floating point number in the machine.
That resulted in a very long-running loop which stored the same number in every word of the machines RAM.
Thereafter, every appplication that I wrote had an input data check for every integer value to ensure that its value did not exceed the source code dimension of the array into which data were to be written.
This was Fortran around 1960. The Ur days!
Murphons are failure particles that radiate from certain individuals. They were named after Captain Murphy of "Murphy's Law" fame.
Debunking the "59 Deceits"
Don't know what planet you're programming on.
Sometimes things can manage to "fix themselves" simply because of problems that started outside of your scope of control. Not too long ago, I was writing some code for my own spin on phpBB, called OmniBB. One night, I was getting all sorts of error pages from the server (Apache ran on Windows). I finally gave up and went to sleep, with visions of computer bugs dancing in my head. The next morning, I tried it again, and lo and behold, it worked! The problem might not be with your code, but a tool that your code relies on. In cases like this, it wouldn't help me to fix the tool (Apache) or get a different one, since it wouldn't be very useful if it didn't work on Apache servers, would it?
> beginning programmers spend more time debugging code than they do writing code,
> so why isn't that activity stressed?
Because you should know what your code is doing. If you can not figure out which piece of code is failing you either have a bad design, where the division of labour is inadequate, or an inadequate error reporting mechanism. When I have a bug and I find the cause, I always try to figure out how the error reporting could be improved. Throw those exceptions, put in those asserts, and print clarification messages for errors whenever you can (probably with an #ifndef NDEBUG around the last). The goal is to be able to diagnose any bug simply by looking at the error message, which I can already do in most cases thanks to the aforementioned practices.
Didn't you know? A restart cleanses the soul of the computer and therefore the problems go away.
Several posters have pointed out useful rules that didn't make the book authors list, such as comparison with a known good example or explaining the problem to someone else.
It's also worth noting that some of this terrain has also been codified into the "Universal Troubleshooting Process" here.
Tyler
Create semphore lock/unlock routines that ALSO create/destroy the semaphores, write a for() that loops a million times on these locks/unlocks over and over and then run 3 of these proceses.
:)
No one likes the idea of haiving to hit the big red switch to bring your IBM AIX system back. I should seriously publish this code to IBM, but do you really think they care very much for their SCO licensed AIX OS rite now?
- Write down the problem.
- Think very hard.
- Write down the solution.
It has worked for me many many timesYou can always use a garbage collector in C/C++, or use smart pointers in C++. I guess that still doesn't solve misuse of allocated data. I don't know Java but I'm guessing it's possible to over run an array in it as well, correct me if I'm wrong.
Try "65 years later..."
the off-by-10 bug strikes again!
You don't "troubleshoot" a circuit design. You debug it.
Actually, I just started designing analog circuits about a month ago. I can tell you with total confidence that my technique is still distinctly troubleshooting: hmm... I wonder if changing these two capacitors from tantalum to polyester film will eliminate the buzz... hack hack hack.
Stop-Prism.org: Opt Out of Surveillance
You seem to have screwed up your own argument. Here, let me help out.
You might have discussed "fools who thank their technology for limiting their control over the computer" or something along those lines. Then you might have had a leg to stand on, because of course there are trade-offs when you give up explicit control over your pointers, for example (a point that I touched on).
But I'll eat my keyboard if you can find me one programmer who scorns a technology purely because it limits their errors.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
YFI!
There is one that appears to be left out (from the summary, perhaps not from the book - I haven't read it): fix it everywhere.
Once you have found a bug, search the rest of your tree for similar bugs. Chances are that you will find and fix several. This is especially true of bugs caused by bad assumptions.
FYI: This is one of the central audit methodologies of the OpenBSD project. It works much better for the BSDs as they keep the entire system in one CVS tree, rather than scattering it around FTP servers in the forms of tarballs. The whole system is readily available to search for entire classes of bugs.
"65,535 bugs in the code, 65,535 bugs,
Ya get 1 whacked with a service pack,
Now there's 2 bugs in the code!"
Thanks... I am in the middle of debugging a problem with some hardware, trying to track down a lost 400ps of time. Been working on this for the last 4 days... The nine rules apply to software & hardware systems IMHO. Was pondering theories earlier yesterday... should have been reporting symptoms! I'd add one personal favorite. Usually, if you blame the system or the compiler, or some other "Monolithic" entity its because its your fault, it will just take you longer to find it, after you dig out from under your embarassment for blaming the "system". So... I always assume I personally hosed up until I can prove otherwise. That way I'm plesantly surprised if its someone elses bug.
Ross Youngblood
and objects to being debugged.
I find your statement a bit baffling.
To even know that there is a bug, you need to know what the code is supposed to do. How else are you going to know it doesn't do it right? And you also need to know what it actually does. How else are you going to know that what it does isn't right?
How you would attempt to rewrite an application from scratch that you don't know what it's supposed to do is also something I find hard to fully buy in to.
Not that I've invented anything special or anything. It's just common good problem solving practice. In terms of the book rules, I focus on 3, 4, 5, 9, and occasionally 7.
Sure, it can take a lot of time and work sometimes. Days or even weeks. Some things are complicated and take time to do. But I fix my bugs for real. And that, sadly, makes me a fairly small minority among todays programmers.
I've been programming professionally on planet earth for 20 years on both big and small applications, good, bad and awful designwise.
"And don't forget the phase of the moon, or for the truly unlucky, intermittently glitchy hardware."
Tell me about it. I had purchased a high-end SVHS VCR about 13 years ago. When it worked it was a wonder to behold. "When" being the operative word. When I had it at home it would intermittently spin it's motors. Click a few times, the raise the loading table and buzz loudly. This was with no tape in the machine. It also would occasionally damage tapes. repeated trips to the shop, and even the manufacturer didn't resolve the issue. I last took it to a shop that "claimed" that the heads were shot and they wanted a lot of money to fix it. i had had it at this point and put it in my attic and bought a new one. my current one is about seven years old and beyound the usual maintenance, has worked well. I think I'll get one of those DVD recorders next.
Corollary 1: Use a garbage-collected, type=safe language (there go a huge mass of bugs), whenever possible.
Corollary 2: Code defensively - if your routine complains the first time it's called with garbage input, you won't have to look at the output of 20 or 30 runs before somebody notices that something's fishy.
Corollary 3: Write modular code, with clean and minimal interfaces, i.e. K.I.S.S.
I have a lot of experience in finding and fixing difficult bugs. In my experience, the most important thing you can do is when you find a bug, stop and think how you could have caught this bug automatically. If you practice this policy, you end up with very solid code. Basically, in the debug build, no function should ever crash the program no matter what garbage you put in the parameters - it should report an error and stop.
I think writing solid code is all in the attitude of the programmers - I had one guy who had a memory overwrite bug that was corrupting some characters in his string table when he called a certain function. Do you know how he fixed it? He wrote some code that put the right characters back over the corrupted ones after the call to this function!!! If you have that attitude, things WILL blow up in your face...
The interactive way to Go -- http://www.playgo.to/iwtg/en/
You forgot:
5) ???
6) Profit!
The Tao of math: The numbers you can count are not the real numbers.
If you over or under run an array you get an ArrayBoundsException thrown, and the read/wrtie doesn't occur, so no corruption occurs (except your operation failed)
There is something sublime about knowing when to use a big hammer to fix something. I fixed the oven door by hitting it hard enough near the hinge to fix the problem of a slightly worn part - oh so satisfying! Of course I wouldn't admit to any times I might have used a hammer and completely screwed something up!
Happy moony
What the hell does "grab the brass bar" mean, anyway???
Slashdot quality declines as the number of hot grits posts decreases. - Provolt's Law, Apr-09-2005
5) Re-dhat
just teach programming courses for a few years ! Then you will learn to think of every possible cause for a bug, because else you only are used to look for bugs coming for mistakes you usually make (say, loops going from 0 to n on an n sized array), not for mistakes you assume you would never make...and that you actually make. When you look to someone else's code (or electronic circuit, or any kind of experiment for that matter), you realize that you have to have this particularly open state of mind to find bugs, especially under this pressure that students are very good to put (like : OK, I program like an idiot, forgot 90% of what you explained, but I expect you to find bugs in less than one second in this 1000 lines kludge because I need to pass the class ). Of course, a good knowledge of all classic mistakes, debugging tools and switches (array bound checking like pointed above, post mortem core dump analysis, stack printouts after segv, tests with known results, check with various optimization flags - compilers can be buggy) also helps.
Google passes Turing test : see my journal
And I didn't say you should rewrite the application from scratch. I said it would take someone that long if they used your "approach".
Here's how to win the Master's. For each swing...
-
Figure out where you want the ball to end up.
-
Hit it there.
I'm sure Mike Weir will be grateful for this advice. Then again, maybe not.Love the .sig!
It's like anti-Japanese protectionism vs. anti-Mexican protectionism.
Joe
http://www.joegrossberg.com
I tend to explain things in an email (it helps if it's to a real person, though you don't need to send it when you're done). Sometimes I'll end up rewriting the entire mail in the process, so that all that's left is the solution; but even if not, it usually clarifies and limits the problem.
My other debugging technique, when I'm really stumped, is to pace up and down the office muttering to myself. Really! I get strange looks, but there's nothing like leg movement and a steady rhythm (careful, now!) for getting you thinking, and each time you turn around it helps you look at the problem slightly differently.
Ceterum censeo subscriptionem esse delendam.