Slashdot Mirror


Pet Bugs II - Debugger War Stories

AlphaHelix queries: "A few weeks back there was an article on Pet Bugs, where people were asked about their favorite bugs. I have a different sort of question: what was your greatest debugging challenge? I've been debugging for a long time, from analog circuits all the way up to multi-kLOC multithreaded servers, and I have some pretty grisly war stories, like the time I debugged a problem in a third-party DLL in machine code because the client didn't have the source for it (yay open source.) What was your greatest debugging triumph?" The first time Slashdot did this it was more about bugs that you had encountered (and may not have solved), this one is about bugs in your own projects code and the trials and tribulations you had to go thru to get them fixed.

6 of 121 comments (clear)

  1. Re:Multi-threading by PD · · Score: 4, Interesting

    Not the most difficult bug I ever encountered, but one that didn't pop right out.

    Project was porting code from Solaris to AIX, multithreaded app. At one point in the code, two threads were started, and they needed to synchronize with each other.

    Anyway, on Solaris, the threads would start and interact properly. On AIX, the system would crash. Turned out that right after a thread was started on Solaris, the scheduler would stop one thread and allow the other one to start up, and from then on both threads existed at the same time as they should.

    Under AIX, the scheduler would start a thread, and that thread would run through to completion before the other one even got started. To fix this, we had to add in a rendezvous point at the top of each thread, so that the first thread would stop and wait for the second one to be created.

  2. Amiga 500 and Action Replay fun by codexus · · Score: 5, Interesting

    That one is an oldie :) Back in the Amiga days I had made this game that worked fine on my A500 but that stopped working after a while on most other A500. That was strange as the machines were supposed to be identical and I couldn't make more tests at home.

    So I used the Action Replay cartridge. For those who don't know about Action Replay, those were "hardware debuggers" that pluged on the bus and could stop and restore the execution of the running program. They were very powerful debugging tools.

    After inspecting the content of the hardware registers thanks to the Action Replay, the result was that on some revision of the A500 motherboard the audio interupts had a slightly different timing that caused an improbable case were the audio samples always stopped playing on offset 0 retrigering an audio interupt as soon as one was handled.

    The Amiga was so much fun...

    --
    True warriors use the Klingon Google
  3. No contest. Bad DMA. by inkfox · · Score: 5, Interesting
    As a game programmer, and as someone who's banged on device drivers, I have to say there's no contest: The most fun bugs result from errant DMA.

    It's really easy to set up a bad DMA chain on most architectures, and when that happens, it can do wonderful damage that's tough to reproduce.

    One of the more fun things about it is that DMA generally ignores the MMU completely, so you can consistently trash whatever's at some physical address time and again, ignoring all protection.

    Even better, DMA doesn't cause hardware breakpoints, so even if your debugger/system are capable of watching for all writes to a given address or page, it'll still merrily corrupt it.

    Even more fun if data has been corrupted, but the correct data is still in the data cache from a previous access, making failures even more unpredictable, often relying on an interrupt or other random bit of interference clearing the cache.

    On top of all this, a bad DMA chunk may not manifest itself in an obvious way. The program may crash in random sections well before you realize that the DMA you intended didn't happen, or the program may just keep on running with a single fram graphic glitch or a brief bit of static in the case of DMA meant to go to video or audio hardware. That's easy to miss when you're focussing on the debugger and not the running program.

    I've seen products ship weeks late just because of a single hard-to-find DMA glitch.

    --
    Says the RIAA: When you EQ, you're stealing bass!
  4. Satellite systems by itwerx · · Score: 4, Interesting

    I worked on code for a satellite ground-control system for a few years. It was all Fortran-5/77 and handled a couple-dozen satellites and ground stations in real-time. The problem was that it was written back in the 60's and the programmers who implemented it had really old slow MV-8000's. There weren't enough spare clock cycles to have decent synchronization between modules so they just depended on different subroutines taking an exact number of cycles to execute so they'd match up with whatever they were talking to. Change a single line in anything and you had to recompile it and time every possible way it could execute. Horrible stuff...

  5. Bug in the CPU by dant · · Score: 5, Interesting

    Once I was writing some C code to run on an old Motorolla DSP in an embedded system.

    One particular function kept crashing. My debugging tools were very limited in this environment--basically, I had a total of 4 LEDs that I could blink on and off by insert function calls into my code. That and a logic analyzer for when things got really nasty.

    Well, things did get really nasty. After reviewing and rewriting that function dozens of times, I finally decided the bug couldn't be in my C code. So I had the compiler spit out the assembly it was generating, brushed up on my DSP assembly, and read through its code on the hunch that there was a bug in the compiler (the compiler was very new and still pretty crappy).

    But after spending a couple of days staring at the assembly, I concluded that it was perfectly fine. What else could be going wrong? I started thinking maybe something was going wrong in the link step or in the process of getting the file transferred down onto the embedded controller.

    I went and learned more than I wanted to know about XCOFF format and used a little binary file editor to see what the linker output was. Again, everything just as it should be.

    I just knew that somehow, what was getting executed was different from what was in the file. So we fired up the logic analyzer, and attached it to the DSP, and set it up to watch the contents of the address bus and data bus at each clock cycle.

    This is incredibly painstaking--you have to look at 32 lines of step-functions to read off the address, and 48 lines of step-functions to read off the data (yes, it was a 48-bit data register; go figure) for EACH OPCODE. This will make your eyes bug out in a hurry.

    But even then--nothing was wrong! The opcodes being loaded into the processor were exactly what they should be. But on this one particular test-and-branch instruction, the processor would just start to go crazy (address and data lines full of random noise; had to be powered down).

    I dug out the processor manual and triple-checked the opcode name and number, addressing mode, and operands. Every bit was correct.

    In utter frustration, we decided to call Motorolla to see if we could get some assistance from them. After going through a small maze of transfers, we finally ended up talking to the right person who knew (and quickly told us) that:

    That particular addressing mode, when used with that particular
    opcode, was known to throw the DSP into a hosed state.

    It was a bug in the processor itself. The solution was simply to change my code to use a different addressing mode, and all was well.

  6. One on GCC / No debugger by guerby · · Score: 5, Interesting
    When GNAT (the Ada front-end for GCC) was commited into the CVS GCC, there was a bootstrap object file comparison failure.
    2001-10-27 Laurent Guerby <guerby@acm.org>

    * trans.c (gigi): Fix non determinism leading to bootstrap comparison failures for debugging information.
    The culprit was the following line:
    init_gigi_decls (gnat_to_gnu_entity (Base_Type (standard_long_long_float), NULL_TREE, 0),
    gnat_to_gnu_entity (Base_Type (standard_exception_type), NULL_TREE, 0));
    The two stage compilers were calling the gnat_to_gnu_entity in different order (as authorized by the C language) leading to different debugging id assigned to both created types hence the object debugging information comparison failure. Luckily it stroke me while reading the entry point in the compiler and thinking about non-determinism.

    Cute :).

    Laurent