How Do You Deal w/ "Heisenbugs"?
horos1 asks: "I was wondering how people out there deal with 'Heisenbugs': bugs that have no logical, programmatic cause (mostly from C and C++ programs and especially threaded C/C++ programs), that may change or disappear if you modify the state of the program.
For example: we have a multi-threaded C++ app which cores in about
5 places, at a different memory address each time, and which disappear if we turn threading off. They seem to be caused because of a memory overrun error, but this too is exceedingly hard to fix with tools like purify, because they tend to give several 'false positives' on memory errors, as well as core when we link with certain libraries. Anyways, this is getting very annoying... Any help with this, as well as pointers on how to deal with bugs like this would be greatly appreciated."
You could use a "Heisenburg Compensator".
printf() statements for debugging multithreaded code are a really bad idea. It's an I/O, and hence blocking call, and can therefore affect the way your threads are scheduled - resulting in mysterious things, e.g. the problem disappears when you use the printf(), but reappears if you comment it out.
Several useful suggestions have already been posted. One additional one: Initialization.
Rig macro wrappers for malloc() and calloc() and the like that initialize new memory to 0xdeadbeef if you have a memory debug flag active. This should make your program crash more abruptly when it starts doing something questionable.
This won't help much for finding the race condition or non-thread-safe code causing the problem, but it may give you some idea of what's being stomped, and make the heisenbug more predictable.
Patient: "Doctor, when I write multi-threaded programs in C++ they dump core all over the place and I don't understand why!"
Doctor: "Don't do that."
I know it sounds glib, but it really is the heart of the matter. Writing proper multi-threaded programs is difficult and takes a fair amount of skill. If you or your programming staff don't have the skill to do it, you are better off sticking to something less challenging. This isn't meant as a put down, either: there are plenty of simple solutions to problems that, at first blush, may look like they require threads (or dynamic memory allocation, or any of a number of other complex and error prone tactics). You could replace your multiple threads with a polling loop and a switch statement, or spawn completely separate processes and communicate through pipes (you might be supprised how much performance you can get out of either solution compared to the threaded code).
If you feel that you really must use the cool threaded code (or a complex, dynamically allocted data structure, or whatever) then you may need to dedicate a week or two to a carefull re-examination of the code and the accompanying re-implementation to eliminate race-conditions/memory-leaks/mutual exclusion errors/etc. While there are a lot of cool debugging tools out there, that can help you find some kinds of errors, there is really no substitute for a deep and thorough understanding of your code. You can either spend the time understanding the mess you have, or try replacing it with something less complex but easier to maintain. (but, maybe, harder to extend/scale/etc.)
Here are some of my favorite high-tech problems with low-tech solutions:
- Multi-threading: multiple processes or polling loop with switch statement.
- Search Trees or other tree structures: hash tables.
- Linked lists or other linked structures: dynamically allocated arrays (double when size>=length, halve when size<length-n, like Java vectors)
- Complex decision logic: state machines.
Admittedly, sometimes you really do need to go with the more complex solution, but it is best to avoid it whenever possible. Besides, you can always put the complex stuff in the next version.Debugging tips
0. If you are running linux: apply the patch that causes the thread that had the segfault to dump core. The default Linux semantics (under 2.2.x at least) are for the threads to exit STARTING with the one that had the problem. Then, the LAST thread dumps its program counter / stack info into your core file. The result: you get what look like "random" crashes when really they aren't very random.
There is a patch which fixes this behavior. try this patch .
1. See if you can get your program to crash in a debugger or dump core. I presume that you are getting this by your comments.
Record the places that you get crashes. Each time anyone gets a crash, have them record where, as best as they can.
Try to figure out what's getting overwritten, even if it's not clear when or how.
3. Try to increase the frequency of the crash (e.g. by running on an SMP machine). this usually provides people with more incentive and a better chance to test if any given change really fixed things.
4. The next few hints fall into the category of "reduce the problem code". It sounds to me like you don't have a good feeling as to where the problem is happening. Try to eliminate sections of the code, by any means necessary. examples include one big lock to force serialization, test programs that only excercise certain modules of your code, etc. I know there is often a temptation to just jump in, but some extra scaffolding to reduce the possible problems is almost always valuable on hard debugging problems.
Reduce the amount of data shared between threads. We are using a message passing interface, where each thread more or less has its own data. This has been a big win. We often copy data before passing on to the next thread, just to make sure.
5. Understand what is and isn't thread safe in your libs. For example, did you know that it is almost impossible to make the C++ std string thread safe, without changing the implementation? that's because the implementation is copy on write. So, even when you're not sharing any data between threads, you are....
I hope some of these help. threading problems aren't easy, particularly in c++...
"Heisenbugs", as you call them, are almost always the result of memory management bugs.
Absolutely. In about 75% of the cases I've seen they were from clobbering something on the stack. In one case I built malloc() and free() wrappers and preceeded every array or memory reference with a (#ifdef DEBUG) check to make sure the index was in bounds. I found dozens and dozens of cases where the indexes went out of bounds.
One of my favorite is code sort of like this:
main() { int x[100]; int i; for ( i=0; i != 101; i++ ) { x[i] = i * i; } }
Where overflowing the array steps on the loop index.
Another case I saw that my team chased off and on for weeks was one where we didn't initialize one field of a time-related structure.
It probably doesn't impact anyone these days, but we spent an hour one day stripping some 16-bit Windows or DOS code down to just a dang printf("Hello world."); and it cored inside the printf(). Finally we noticed that there were a lot of large arrays declared locally in main(), so the stack was almost completely used up. The next function call would core no matter what.
I have a lot of can't happen checks. They never, or rarely trigger (anymore). I have to maintain code written by someone less paranoid about buggy hardware (for 100 triggered can't happen except on hardware errors, I've fixed 110 software bugs. This on new untested hardware where I have found hardware bugs by other means.
I've also fixed several crashes because I knew the code well enough to know where it should be an what it should be doing. Once I proved it wasn't in that state I had to figgure out why not, and from there the fix was easy. Unfortunatly figgureing out why I was in the wrong state is hard.
Duplicatable problems are easy to fix. If you can crash in one of 5 cases, then splatter printf's all over those areas. Consider writting your own printf which just writes to memory, not the screen so you don't block. Then when your program crashes you pull that memory from the core file and you know where each function was last. just knowing what function each thread was in last is a big clue.
Finially, code inspections are a must. Get some good programers who have never seen that section of code and have them inspect it. If nothing else it will assure that your comments are meaningful before the programmer quits.
My two favorite examples of past debugging:
www.HearMySoulSpeak.com