Pet Bugs II - Debugger War Stories
AlphaHelix queries: "A few weeks back there was an article on Pet Bugs, where people were asked about their favorite bugs. I have a different sort of question: what was your greatest debugging challenge? I've been debugging for a long time, from analog circuits all the way up to multi-kLOC multithreaded servers, and I have some pretty grisly war stories, like the time I debugged a problem in a third-party DLL in machine code because the client didn't have the source for it (yay open source.) What was your greatest debugging triumph?" The first time Slashdot did this it was more about bugs that you had encountered (and may not have solved), this one is about bugs in your own projects code and the trials and tribulations you had to go thru to get them fixed.
Once I had to debug a program written in MFC... Wait. Sorry. The memory is too painful to recall.
This one time, at computer camp, I found a bug and stuffed it into my pussy!
Linus Torvalds, widely respected throughout the industry (yay open source!) as a programmer par excellence, has stated in public that debuggers are for wimps.
Thanks to Linus, we of the Free Software community can rest assured in the knowledge that we have the most stable, most secure operating system in the world. And to what is it that makes Linux so great? Why, the fact that it was debugged entirely with printf's.
Karma: Good (despite my invention of the Karma: sig)
When we were working on a game title on the Nintendo 64, we were maintaining a parallel version for Windows, as it was substantially easier to develop for. Basically, we created some rendering, sound, controller and I/O code for Windows that duplicated what we'd created on the N64.
At one point, we were trying to find some problems with the camera behavior, so we created a flying camera object that coincided with the real camera. It looked like an old Hollywood camera, though the lens cap, reels, everything was just flat black. Then, we'd set up a fixed camera and watch what the game camera would be trying to do by observing where the flying camera went.
Time passed, and we'd forgotten about the added camera altogether. Then, as we were approaching a critical milestone, we went to bring the N64 build up to date... and the screen was black.
The game seemed to be playing, the menus were there, and the framerate counter was up, but - black. The Z-buffer had a constant value all the way across it, meaning there was some mysterious polygon that was exactly covering the screen all the time.
We were there until something like 2am, trying to figure out what the hell was going on. We were risking blowing this milestone, and with that, taking on a pretty hefty late delivery penalty.
We finally figured it out. Stepping back... well, the N64 engine didn't support back-face culling at that point, whereas the Windows engine did. So what's the upshot of the whole thing?
We'd left the lens cap on.
Says the RIAA: When you EQ, you're stealing bass!
static final String ANYTHING = "anything; /*...*/ /*...*/ }
String x = ANYTHING;
if (x == ANYGHING) {
Ahh, the irony.
This space intentionally left blank.
Two situations come to mind.
The first is not so much a specific bug that was difficult to find, as it was the general means that I was forced to use to locate bugs. (Yes, I found quite a few.) Back in the early 80's I was working at IBM in the QA group responsible for testing their VM operating system. We were tasked with taking the existing VM OS and not only were we to improve its performance on multi-processor systems, we were also to improve its reliability by doing extensive reviews and testing. I was responsible for testing the free storage allocator.
Some background: For those who may not be aware, that whole operating system was written in IBM BAL (Basic Assembler Language) using the 370 instruction set. The VM operating system created a virtual machine environment for each user - thus producing the appearance of a machine, identical in [almost] all respects to running on the bare hardware. Those few differences pertained to some optimizations in the the virtual memory management's use of PTLBs (page table look-aside buffers) among others.
So, I needed to test memory allocation on the bare hardware to make sure that it worked okay. Once that was nailed down, I had to test memory allocation when VM was running on VM. But, there was yet another set of optimizations that I needed to test when a VM was running on VM running on VM (i.e. a "3rd level" VM).
It was not possible to just issue VM commands to test the various code paths. So, each test consisted of setting hardware breakpoints at the appropriate hex offset, and single stepping through these allocations. By the time I got to testing the 3rd level VM code, I was tracing and debugging these PTLB calculations and allocations, single stepping through instructions in hexadecimal and verifying multiple levels of indirection to memory pages where those calculations were also in hex. Those were the days! Just a year or so out of college, and I had all to myself a multi-million dollar mainframe computer that could support several hundred people!
The other bug was actually a specific bug that caused much early hair loss. I was working at a place where there was a lot of new employees come on board. Along with that, new departments were being formed, people were being promoted and moving to different groups, and there was a great deal of office moves as a result. So, it soon became a problem finding somebody's office. "Gee, wasn't Mary here just last week?"
Sensing a need, I wrote up a quick REXX program (yes, this was back in the early 80's, too) which did data aquisition through forms and supported the generation of reports sorted by various categories: Name, Department, Room Number, etc. This was pretty straightforward and in a couple days I'd gotten it coded, tested, and all the data populated. As there were only a few hundred people I used a flat file (the other alternative was creating a DB2 database and disk space was very dear back then!)
Rolled it out and received much positive feedback. Except, there was one person who noticed there was an error in the ordering of room numbers. See the format was: "Floor: 1, 2, or 3"; then the "building wing: compass direction: N, S, E, or W", and lastly, "room number: 2-digit number". As this was a rapidly growing organization, managers would be allocated several empty offices in advance for the people they'd hire during the next quarter. Also, the building had just been constructed and some areas and sometimes whole wings were still not yet ready for use so there were many gaps in the data. Here is a selection of the kinds of results I saw for the room numbers, in ascending order:
Why are the rooms in the South wing nicely ordered, but things are really messed up for the East wing? I spent HOURS and HOURS trying to figure this one out. See if you can tell what the problem is before reading ahead.
The problem? There is a single set of comparison operators <, =, and > and they just did the right kind of comparison depending on the data type of the operands. Well, here I was thinking the data was text, but the program was making the comparisons as if these were numbers that contained an exponent part
Found it yet?
"It's too bad that stupidity isn't painful." - Anton LaVey