How Do You Deal w/ "Heisenbugs"?
horos1 asks: "I was wondering how people out there deal with 'Heisenbugs': bugs that have no logical, programmatic cause (mostly from C and C++ programs and especially threaded C/C++ programs), that may change or disappear if you modify the state of the program.
For example: we have a multi-threaded C++ app which cores in about
5 places, at a different memory address each time, and which disappear if we turn threading off. They seem to be caused because of a memory overrun error, but this too is exceedingly hard to fix with tools like purify, because they tend to give several 'false positives' on memory errors, as well as core when we link with certain libraries. Anyways, this is getting very annoying... Any help with this, as well as pointers on how to deal with bugs like this would be greatly appreciated."
Patient: "Doctor, when I write multi-threaded programs in C++ they dump core all over the place and I don't understand why!"
Doctor: "Don't do that."
I know it sounds glib, but it really is the heart of the matter. Writing proper multi-threaded programs is difficult and takes a fair amount of skill. If you or your programming staff don't have the skill to do it, you are better off sticking to something less challenging. This isn't meant as a put down, either: there are plenty of simple solutions to problems that, at first blush, may look like they require threads (or dynamic memory allocation, or any of a number of other complex and error prone tactics). You could replace your multiple threads with a polling loop and a switch statement, or spawn completely separate processes and communicate through pipes (you might be supprised how much performance you can get out of either solution compared to the threaded code).
If you feel that you really must use the cool threaded code (or a complex, dynamically allocted data structure, or whatever) then you may need to dedicate a week or two to a carefull re-examination of the code and the accompanying re-implementation to eliminate race-conditions/memory-leaks/mutual exclusion errors/etc. While there are a lot of cool debugging tools out there, that can help you find some kinds of errors, there is really no substitute for a deep and thorough understanding of your code. You can either spend the time understanding the mess you have, or try replacing it with something less complex but easier to maintain. (but, maybe, harder to extend/scale/etc.)
Here are some of my favorite high-tech problems with low-tech solutions:
- Multi-threading: multiple processes or polling loop with switch statement.
- Search Trees or other tree structures: hash tables.
- Linked lists or other linked structures: dynamically allocated arrays (double when size>=length, halve when size<length-n, like Java vectors)
- Complex decision logic: state machines.
Admittedly, sometimes you really do need to go with the more complex solution, but it is best to avoid it whenever possible. Besides, you can always put the complex stuff in the next version.