Slashdot Mirror


Pet Bugs II - Debugger War Stories

AlphaHelix queries: "A few weeks back there was an article on Pet Bugs, where people were asked about their favorite bugs. I have a different sort of question: what was your greatest debugging challenge? I've been debugging for a long time, from analog circuits all the way up to multi-kLOC multithreaded servers, and I have some pretty grisly war stories, like the time I debugged a problem in a third-party DLL in machine code because the client didn't have the source for it (yay open source.) What was your greatest debugging triumph?" The first time Slashdot did this it was more about bugs that you had encountered (and may not have solved), this one is about bugs in your own projects code and the trials and tribulations you had to go thru to get them fixed.

21 of 121 comments (clear)

  1. Back in the day... by Pauly · · Score: 5, Funny

    Once I had to debug a program written in MFC... Wait. Sorry. The memory is too painful to recall.

  2. Re:Multi-threading by PD · · Score: 4, Interesting

    Not the most difficult bug I ever encountered, but one that didn't pop right out.

    Project was porting code from Solaris to AIX, multithreaded app. At one point in the code, two threads were started, and they needed to synchronize with each other.

    Anyway, on Solaris, the threads would start and interact properly. On AIX, the system would crash. Turned out that right after a thread was started on Solaris, the scheduler would stop one thread and allow the other one to start up, and from then on both threads existed at the same time as they should.

    Under AIX, the scheduler would start a thread, and that thread would run through to completion before the other one even got started. To fix this, we had to add in a rendezvous point at the top of each thread, so that the first thread would stop and wait for the second one to be created.

  3. Debug versus release bug by ghamerly · · Score: 3, Interesting

    My favorite bug was allocating memory inside of an assert() using VisualC++ (I hate MS tools; I had to use it for work).

    So the gist of the code went something like this:

    ...
    0. int array[];
    1. assert(array = new int[SIZE]);
    2. for (int i = 0; i < SIZE; i++) {
    3. array[i] = i;
    ...
    and the code would segfault on line 3. So I brought it into debug mode, and stepped through. But it worked fine. Back to release mode, and it segfaults.

    To restate, here we have the classic example of something you don't want: it works fine in debug mode, but it bombs in release mode.

    Of course, since I have simplified the code the answer should be obvious -- in release mode, VisualC++'s compiler was stripping out the assert(), and the allocation inside. In debug mode, it left the assert() in, so the allocation worked fine. I had never changed a flag that said I wanted it to strip them, so I assumed it wouldn't. Never trust M$...

    1. Re:Debug versus release bug by codexus · · Score: 5, Informative

      Never trust M$? I'm sorry but it's clearly documented that the asserts are stripped from the release code. The macro to use for code you want to check in debug mode but still execute in release mode is VERIFY()

      I'm no fan of Microsoft, but it's a bit easy to blame them for your own mistakes.

      --
      True warriors use the Klingon Google
    2. Re:Debug versus release bug by ComputerSlicer23 · · Score: 4, Informative
      Hmmm, you broke the rules from the MS Press books. Both "Writing Solid Code", and "Code Complete" mentions specifically to never ever have code with side affects inside of an assert statement (more generally no side affects in debugging code). Both outstanding books, that have lead me down the path so I don't have war stories about debugging anymore. That and I don't do embedded programming anymore.

      Good books, MS tools are weird mainly because I like my commandline a bit too much, but they publish some damn fine books about programming.

      This is also a case of found easily by reading the output of gcc -E, you best friend when debugging code that has macro's anywhere near it.

      I had never changed a flag that said I wanted it to strip them, so I assumed it wouldn't. Never trust M$

      I hate to post a flame, but RTFM. On every compiler or tool you ever use, spend several days reading the manual and all associated docs you can find. Knowing how the compiler works, and how all the tools work is a hallmark of all the finest programmers I know. I used VC++ a handful of times 5 year ago, and I could have told you the asserts were stripped in from release mode. All you have to do is look at the full list of options it puts on the command line. That's relatively easy to find in the menuing system on VC 4.0 (the only version I used). The -DNDEBUG=1 flag turns off asserts.

      Kirby

      PS: Other then the keyboard and mouse I use, I haven't used a Microsoft product on a daily basis in years. It's about craftsmanship, and knowing your tools.

  4. Amiga 500 and Action Replay fun by codexus · · Score: 5, Interesting

    That one is an oldie :) Back in the Amiga days I had made this game that worked fine on my A500 but that stopped working after a while on most other A500. That was strange as the machines were supposed to be identical and I couldn't make more tests at home.

    So I used the Action Replay cartridge. For those who don't know about Action Replay, those were "hardware debuggers" that pluged on the bus and could stop and restore the execution of the running program. They were very powerful debugging tools.

    After inspecting the content of the hardware registers thanks to the Action Replay, the result was that on some revision of the A500 motherboard the audio interupts had a slightly different timing that caused an improbable case were the audio samples always stopped playing on offset 0 retrigering an audio interupt as soon as one was handled.

    The Amiga was so much fun...

    --
    True warriors use the Klingon Google
  5. Video games by inkfox · · Score: 5, Funny
    I don't think I caught the original article; this was a fun one though:

    When we were working on a game title on the Nintendo 64, we were maintaining a parallel version for Windows, as it was substantially easier to develop for. Basically, we created some rendering, sound, controller and I/O code for Windows that duplicated what we'd created on the N64.

    At one point, we were trying to find some problems with the camera behavior, so we created a flying camera object that coincided with the real camera. It looked like an old Hollywood camera, though the lens cap, reels, everything was just flat black. Then, we'd set up a fixed camera and watch what the game camera would be trying to do by observing where the flying camera went.

    Time passed, and we'd forgotten about the added camera altogether. Then, as we were approaching a critical milestone, we went to bring the N64 build up to date... and the screen was black.

    The game seemed to be playing, the menus were there, and the framerate counter was up, but - black. The Z-buffer had a constant value all the way across it, meaning there was some mysterious polygon that was exactly covering the screen all the time.

    We were there until something like 2am, trying to figure out what the hell was going on. We were risking blowing this milestone, and with that, taking on a pretty hefty late delivery penalty.

    We finally figured it out. Stepping back... well, the N64 engine didn't support back-face culling at that point, whereas the Windows engine did. So what's the upshot of the whole thing?

    We'd left the lens cap on.

    --
    Says the RIAA: When you EQ, you're stealing bass!
  6. No contest. Bad DMA. by inkfox · · Score: 5, Interesting
    As a game programmer, and as someone who's banged on device drivers, I have to say there's no contest: The most fun bugs result from errant DMA.

    It's really easy to set up a bad DMA chain on most architectures, and when that happens, it can do wonderful damage that's tough to reproduce.

    One of the more fun things about it is that DMA generally ignores the MMU completely, so you can consistently trash whatever's at some physical address time and again, ignoring all protection.

    Even better, DMA doesn't cause hardware breakpoints, so even if your debugger/system are capable of watching for all writes to a given address or page, it'll still merrily corrupt it.

    Even more fun if data has been corrupted, but the correct data is still in the data cache from a previous access, making failures even more unpredictable, often relying on an interrupt or other random bit of interference clearing the cache.

    On top of all this, a bad DMA chunk may not manifest itself in an obvious way. The program may crash in random sections well before you realize that the DMA you intended didn't happen, or the program may just keep on running with a single fram graphic glitch or a brief bit of static in the case of DMA meant to go to video or audio hardware. That's easy to miss when you're focussing on the debugger and not the running program.

    I've seen products ship weeks late just because of a single hard-to-find DMA glitch.

    --
    Says the RIAA: When you EQ, you're stealing bass!
  7. Simple, Obvious, NOT! by renehollan · · Score: 4, Informative
    PUSH SP does NOT do the same thing on 8086 and 80286 architectures: in one case it pushes the stack pointer value before decrementing it, and in the other case it pushes it after decrementing it.

    I got stung by that on a Friday before a long weekend in 1984 or 1985. A dirty INT21 hook I was applying to DOS worked on ATs but not on XTs (or was it the other way around?). I had set up a structure on the stack and needed to pass its address to a higher language (prolly K&R C) routine, so PUSH SP seamed like the right thing to do.

    Hardly a complex bug, but one where it is non-obvious that a 286 is not a superset of an 86.

    Then there was the time I had to download a patch to over a thousand embedded controllers spread over a whole country whose problem was that downloading didn't work.... a truck roll to wach one was not an option. But, that's another story (bootstrapping the fix was horrendously more complex that finding the bug).

    --
    You could've hired me.
  8. Satellite systems by itwerx · · Score: 4, Interesting

    I worked on code for a satellite ground-control system for a few years. It was all Fortran-5/77 and handled a couple-dozen satellites and ground stations in real-time. The problem was that it was written back in the 60's and the programmers who implemented it had really old slow MV-8000's. There weren't enough spare clock cycles to have decent synchronization between modules so they just depended on different subroutines taking an exact number of cycles to execute so they'd match up with whatever they were talking to. Change a single line in anything and you had to recompile it and time every possible way it could execute. Horrible stuff...

  9. Bug in the CPU by dant · · Score: 5, Interesting

    Once I was writing some C code to run on an old Motorolla DSP in an embedded system.

    One particular function kept crashing. My debugging tools were very limited in this environment--basically, I had a total of 4 LEDs that I could blink on and off by insert function calls into my code. That and a logic analyzer for when things got really nasty.

    Well, things did get really nasty. After reviewing and rewriting that function dozens of times, I finally decided the bug couldn't be in my C code. So I had the compiler spit out the assembly it was generating, brushed up on my DSP assembly, and read through its code on the hunch that there was a bug in the compiler (the compiler was very new and still pretty crappy).

    But after spending a couple of days staring at the assembly, I concluded that it was perfectly fine. What else could be going wrong? I started thinking maybe something was going wrong in the link step or in the process of getting the file transferred down onto the embedded controller.

    I went and learned more than I wanted to know about XCOFF format and used a little binary file editor to see what the linker output was. Again, everything just as it should be.

    I just knew that somehow, what was getting executed was different from what was in the file. So we fired up the logic analyzer, and attached it to the DSP, and set it up to watch the contents of the address bus and data bus at each clock cycle.

    This is incredibly painstaking--you have to look at 32 lines of step-functions to read off the address, and 48 lines of step-functions to read off the data (yes, it was a 48-bit data register; go figure) for EACH OPCODE. This will make your eyes bug out in a hurry.

    But even then--nothing was wrong! The opcodes being loaded into the processor were exactly what they should be. But on this one particular test-and-branch instruction, the processor would just start to go crazy (address and data lines full of random noise; had to be powered down).

    I dug out the processor manual and triple-checked the opcode name and number, addressing mode, and operands. Every bit was correct.

    In utter frustration, we decided to call Motorolla to see if we could get some assistance from them. After going through a small maze of transfers, we finally ended up talking to the right person who knew (and quickly told us) that:

    That particular addressing mode, when used with that particular
    opcode, was known to throw the DSP into a hosed state.

    It was a bug in the processor itself. The solution was simply to change my code to use a different addressing mode, and all was well.

  10. Re:String equality in Java by avalys · · Score: 3, Funny

    static final String ANYTHING = "anything; /*...*/
    String x = ANYTHING;
    if (x == ANYGHING) { /*...*/ }


    Ahh, the irony.

    --
    This space intentionally left blank.
  11. One on GCC / No debugger by guerby · · Score: 5, Interesting
    When GNAT (the Ada front-end for GCC) was commited into the CVS GCC, there was a bootstrap object file comparison failure.
    2001-10-27 Laurent Guerby <guerby@acm.org>

    * trans.c (gigi): Fix non determinism leading to bootstrap comparison failures for debugging information.
    The culprit was the following line:
    init_gigi_decls (gnat_to_gnu_entity (Base_Type (standard_long_long_float), NULL_TREE, 0),
    gnat_to_gnu_entity (Base_Type (standard_exception_type), NULL_TREE, 0));
    The two stage compilers were calling the gnat_to_gnu_entity in different order (as authorized by the C language) leading to different debugging id assigned to both created types hence the object debugging information comparison failure. Luckily it stroke me while reading the entry point in the compiler and thinking about non-determinism.

    Cute :).

    Laurent

  12. Re:String equality in Java by Anonymous+Brave+Guy · · Score: 3, Insightful

    The really sad part here is that certain types of people rail on C and C++ for having pointers, and consequently being susceptible to null-pointer bugs, comparing at the wrong level of indirection, etc, and yet here is an equally subtle and unfortunate situation in Java, a language much hyped for its improved safety.

    I can see a good reason for low-level languages to provide this level of control, and the consequent risks associated with it. Surely, though, it would be better for most applications if higher level languages prevented such things happening at compile-time, rather than leaving you to clean up the mess in debugging (assuming, of course, that you actually hit the code in question during your testing). Until then, threads like this will forever feature bugs that should never be able to happen...

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  13. Purchased library by topham · · Score: 3, Interesting

    I purchased a library to support multiple serial ports under DOS, way back in 1989/90. I purchased the library because I didn't know assembler well enough, and I specificly wanted flexible support for multi-port serial devices (Digiboards come to mind).

    I wrote up my program, testing it with a single serial port and had success quickly. I expanded it to support multiple ports and had it working, up until it was supposed to actually communicate with both ports at the same time.

    The company which released the library failed to reference anything except the first port in their interrupt handler. Leaving me to trace through unknown assembler code trying to figure out what was wrong with code that worked, but not correctly.

    I sent them a letter complaining about the problem, they sent back disks with that bug patched (exactly the same as my patch) and a couple of other bugs fixed (which didn't effect me).
    And vowed at that point to never trust thrid party code. (I did have full sourcecode though, which was nice).

  14. Going way back... by martyb · · Score: 5, Funny

    Two situations come to mind.

    The first is not so much a specific bug that was difficult to find, as it was the general means that I was forced to use to locate bugs. (Yes, I found quite a few.) Back in the early 80's I was working at IBM in the QA group responsible for testing their VM operating system. We were tasked with taking the existing VM OS and not only were we to improve its performance on multi-processor systems, we were also to improve its reliability by doing extensive reviews and testing. I was responsible for testing the free storage allocator.

    Some background: For those who may not be aware, that whole operating system was written in IBM BAL (Basic Assembler Language) using the 370 instruction set. The VM operating system created a virtual machine environment for each user - thus producing the appearance of a machine, identical in [almost] all respects to running on the bare hardware. Those few differences pertained to some optimizations in the the virtual memory management's use of PTLBs (page table look-aside buffers) among others.

    So, I needed to test memory allocation on the bare hardware to make sure that it worked okay. Once that was nailed down, I had to test memory allocation when VM was running on VM. But, there was yet another set of optimizations that I needed to test when a VM was running on VM running on VM (i.e. a "3rd level" VM).

    It was not possible to just issue VM commands to test the various code paths. So, each test consisted of setting hardware breakpoints at the appropriate hex offset, and single stepping through these allocations. By the time I got to testing the 3rd level VM code, I was tracing and debugging these PTLB calculations and allocations, single stepping through instructions in hexadecimal and verifying multiple levels of indirection to memory pages where those calculations were also in hex. Those were the days! Just a year or so out of college, and I had all to myself a multi-million dollar mainframe computer that could support several hundred people!

    The other bug was actually a specific bug that caused much early hair loss. I was working at a place where there was a lot of new employees come on board. Along with that, new departments were being formed, people were being promoted and moving to different groups, and there was a great deal of office moves as a result. So, it soon became a problem finding somebody's office. "Gee, wasn't Mary here just last week?"

    Sensing a need, I wrote up a quick REXX program (yes, this was back in the early 80's, too) which did data aquisition through forms and supported the generation of reports sorted by various categories: Name, Department, Room Number, etc. This was pretty straightforward and in a couple days I'd gotten it coded, tested, and all the data populated. As there were only a few hundred people I used a flat file (the other alternative was creating a DB2 database and disk space was very dear back then!)

    Rolled it out and received much positive feedback. Except, there was one person who noticed there was an error in the ordering of room numbers. See the format was: "Floor: 1, 2, or 3"; then the "building wing: compass direction: N, S, E, or W", and lastly, "room number: 2-digit number". As this was a rapidly growing organization, managers would be allocated several empty offices in advance for the people they'd hire during the next quarter. Also, the building had just been constructed and some areas and sometimes whole wings were still not yet ready for use so there were many gaps in the data. Here is a selection of the kinds of results I saw for the room numbers, in ascending order:

    • 1E17
    • 1E18
    • 2E18
    • 1E19
    • 1E23
    • ...
    • 1S17
    • 1S18
    • 1S19
    • 2S03
    • 2S05
    • 2S15

    Why are the rooms in the South wing nicely ordered, but things are really messed up for the East wing? I spent HOURS and HOURS trying to figure this one out. See if you can tell what the problem is before reading ahead.

    The problem? There is a single set of comparison operators <, =, and > and they just did the right kind of comparison depending on the data type of the operands. Well, here I was thinking the data was text, but the program was making the comparisons as if these were numbers that contained an exponent part

  15. Beginner's bug... by jmv · · Score: 3, Interesting

    It's a simple one, but the first time (it happened to me a couple years ago), you really search for it:

    #define square(x) x*x ...

    value = 1/square(x);

    Ever since, any C macro I'd write has a dozen ('s in them... That's why I like inline functions...

  16. My greatest S/W bug was a H/W bug.... by BranMan · · Score: 3, Interesting

    My greatest debugging was on the first production run of the upgraded PATRIOT radar transmitter I was working on. The particular unit in question would start fine, run up, warm up the humoungus amplifier tubes for the radar, switch into high power (for long range operation and tracking) and BAM! reset itself.

    Looked like a S/W bug to everyone, so I (as the last 'surviving' member of the S/W team at that point) was called in to find it and fix it.

    Well, gathering data was the hard part - I needed to figure out what was happening with scope probes (tracing didn't work, and I couldn't rewrite all the firmware to do any logging or checkpointing). Small catch - the cycle running the system up to high power (where the bug was seen) takes 8+ minutes. Each time.

    So I basically had 7 tries per hour (max) to figure out where to hang a scope probe off a backplane of about 4000 wires to figure out what the heck was going on. While at the same time leafing through 40K of assembler code trying to eye-ball the problem.

    Three solid days of doing that (about 10-12 hours per day) with my boss constantly pestering me for a accurate estimate of how long it would take to fix it (Gee, thanks for that). Did I mention that I was 3 years on this project at this point - and that it was the first project I was on right out of school - and that I'd 'inherited' 2/3rds of the firmware from other developers who'd moved on? Way to be supportive boss.

    Anyhow, I finally figured out it was a H/W fault, not the S/W at all. Turns out a 24 volt PS was "weak". When the 208 3-phase power that runs the transmitter dipped from the load of switching to high power, the 24 volt PS would drop it's voltage. Just enough that the 5 volt PS running the logic detected the drop in 24 volt PS voltage, and due to the fail-safes to protect the circuitry - shut itself off!

    Which resets the control logic, brings down the power, steadies the 208 3-phase, bringing the 24 volt PS back in line, starting up the 5 volt PS, and away we go again.

    All found with a couple of scopes. Boy that was fun.

  17. reflection by MillionthMonkey · · Score: 3, Insightful

    Reflection is a nice feature in Java except they made it a pain in the ass to use. I work on a product that is a Java application running on Windows, Linux, Solaris, and Mac (both OSX and Mac OS 9). Because we are still supporting Mac OS 9, we cannot use a Java 2 compiler at all- so we are squeezing the entire tree through Sun's 1.1.8 compiler every night. (So we're still writing Java 1.1 code! In this day and age! If you call any Java 2 method, like add() on a Vector, it breaks the nightly build.)
    Now there are some things that our customers want that absolutely require Java 2, like drag and drop. If you are running Mac OS 9, drag and drop won't work in our program. Sorry. But we have it working for everybody else on all other platforms- by using reflection to access the DnD classes! And the code looks horrible. One line of ordinary code balloons to five lines of incomprehensible gibberish when you use reflection.
    The way I see it there are two primary uses for reflection. One is the use that Sun originally intended- for people writing IDEs, bean containers, debuggers, profilers, etc. The other is for people like us, who are compiling against a fossilized version of the JDK but need to introduce some forward-compatibility and access classes we know are usually there but we can't compile against statically. Sun's attitude is always to tell all customers to upgrade to their latest and greatest version of Java. (Sun's inability to take on the backward-compatibility issue from either a design or a policy perspective is really annoying. It's what killed the whole applet idea. And now their JDK 1.4 compiler is spitting out classes with version numbers that make old software freak out. I still have to find the compiler switch that turns that off.)
    I think it would be cool if Java had a "reflection" keyword with which you could declare a block of code as being dynamically and not statically compiled- so you could write ordinary code in there and the compiler would break it down during a preprocessing step into the required Class/Method/Field gibberish and let you catch something like an "UnsupportedApiException" in a catch block underneath. Of course, the chance of that happening is zero, and even if it did happen, the 1.1.8 compiler wouldn't understand it anyway. Does anyone know if Sun has any plans for introducing a standard for compiler extensions? It strikes me as a move that would involve relinquishing too much control.

  18. Partially working system by ebbe11 · · Score: 3, Interesting
    I spent a couple of days with a this bug in an ECG monitor:

    When the users pressed a specific (valid) key sequence rapidly enough, the keys stopped working.

    The monitor used a message passing RTOS and it turned out that a small utility function was the culrpit. This function aquired a message buffer, filled it in and sent the message to a specific task. That task was then responsible for releasing the buffer. This worked fine - until someone used this utility function in that very same task. When the key sequence was entered, one task would send a message that in turn would require a call to the utility function in that task. But before that message was processed, another task would call the utility and thus aquire the message buffer. So when the offending task became running, it would have two messages in its queue: The first message would require it to try to aquire the message buffer already used in the second message.
    Result: deadlock.

    I should add: this was in 8086 assembly...

    --

    My opinion? See above.
  19. UNIVAC 1108 multiprocessor by Animats · · Score: 3, Interesting
    My worst experiences involved making a multiprocessor UNIVAC 1108 system work reliably in the 1970s.

    This was a mainframe. A physically huge mainframe. The two CPUs and memories alone took three rows of cabinets, each row about 25 feet long, connected by cross-cabinets. Then there were about forty cabinets of peripheral gear, including many drums, tape drives, printers, and two desk-sized consoles, one per CPU. All of this gear, though, delivered only 1.25 MIPS, and there was only about 1MB of memory (256K of 36 bit words.)

    The system kept crashing. For each crash, a dump was produced - a stack of paper about two inches thick, with some of the major data structures decoded at the front, followed by the entire contents of memory, in octal. When I arrived for the job, there were two stacks of these six feet high waiting for me.

    So I started in on this, figuring out what had caused each crash, tracking pointers with multicolored pens, and fixing the bugs in the operating system, which was all in assembly. After a while, the most common crashes had been fixed, and I was then spending time on the more difficult problems.

    Some problems required software workarounds for hardware problems. The system clock would make errors when its fan filters were clogged. (Yes, electronics was so big back then that the system clock had multiple muffin fans.) Code was written to deal with this.

    Occasionally, code overlays would be misread from the drums. A checksum crashed the system when this happened, but reread support was added to make that error recoverable.

    The most intractable problem involved data that seemed to be corrupted when written by one processor and read by the other. We looked and looked for race conditions, but even additional locking didn't help. Finally, a hardware consultant was brought in, and he built a custom hardware device that checked that certain bits matched between the processor and one of the memories. This was used during operation, and finally, after several days, the device triggered, the whole system froze with its clock stopped, and we could verify that the neon lamps at the processor end of the data path didn't match those at the memory end.

    Eventually, we had that beast running reliably, with a month or so between crashes. Gradually, the operation was expanded, until there were five mainframes crunching away.