Slashdot Mirror


Lessons From Your Toughest Software Bugs

Nerval's Lobster writes: Most programmers experience some tough bugs in their careers, but only occasionally do they encounter something truly memorable. In developer David Bolton's new posting, he discusses the bugs that he still remembers years later. One messed up the figures for a day's worth of oil trading by $800 million. ('The code was correct, but the exception happened because a new financial instrument being traded had a zero value for "number of days," and nobody had told us,' he writes.) Another program kept shutting down because a professor working on the project decided to sneak in and do a little DIY coding. While care and testing can sometimes allow you to snuff out serious bugs before they occur, some truly spectacular ones occasionally end up in the release... despite your best efforts.

18 of 285 comments (clear)

  1. Compiler optimizer bugs by Dan+East · · Score: 4, Interesting

    Some of the bugs I've beat my head against the wall over the most are compiler bugs. It's easy to have the mindset that the compiler is infallible, and so programmers don't usually debug in a way that tests whether fundamentals like operators are really working right. This was particularly bad developing for Windows CE back around 2000 when you had to build for 3 different processors (Arm, MIPS and SH3). I ran into a number of optimizer bugs usually related to binary operators. The usual solution was precompiler directives to disable the optimizer around a specific block of code.

    --
    Better known as 318230.
    1. Re:Compiler optimizer bugs by Anonymous Coward · · Score: 2, Interesting

      Just after I graduated and I was working at my first job writing my first program ever that was not a homework assignment, I decided to write it as a multi-threaded program. I had a race condition that was causing a datastructure to give bad data. Took me almost 30 minutes to track it down. Now that I've gotten better at programming, race conditions take me much less time and rarely involve any debugging.

    2. Re:Compiler optimizer bugs by eulernet · · Score: 3, Interesting

      I had a worst experience: hardware bugs.

      Back in the 90s, I was working on a trucks game.
      Strangely, when playing via network, the trucks on some computers sometimes desynchronized.
      I spent one week locating the problem by digging into verbose logs: it was due to the FDIV bug, which was subtly changing the positions of some trucks.

      More recently, I spent a lot of time figuring why some programs crashed on my computer.
      After a few weeks, I realized that some bits in the RAM were dead, writing into them returned random values.

    3. Re:Compiler optimizer bugs by arglebargle_xiv · · Score: 4, Interesting

      Some of the bugs I've beat my head against the wall over the most are compiler bugs.

      Ah yes, the gift that keeps on giving. Every new version of gcc that gets deployed has new optimizer bugs, to the point that, several years ago, we stopped using O3 and above since the small loss in performance (if there even was any) was easier than handling a long tail of compiler bugs across dozens of different CPU types with every new release ("dozens" may be an under-estimate depending on how you want to count families of ARM, MIPS, Power, and other embedded CPUs).

    4. Re: Compiler optimizer bugs by Anonymous Coward · · Score: 5, Interesting

      A compiler guy here, who used to work for one of the RISC companies. Most compiler bugs are not that difficult to debug. But I worked on instruction scheduling and register allocation, hence always got assigned all the weird bugs. The most memorable one for me was actually a hardware bug - most people don't realize but most of the commercial microprocessors have a lot of bug in them. See published erratas and you will find many bugs. A few years after the particular generation of this processor was on the market, I got assigned a bug from this commercial DBMS vendor (I.e. very important customer) on this weird crash bug. It took me forever to figure out but it turns out to be a bug in the processor that corrupts a particular register (due to the register renaming logic screwing up in a rare combination of instructions) that is dependent on the timing and the instruction combination. It became anothet errata item, and I ended up implementing a workaround - if you notice some benign but odd code sequence a compiler generates, there might be a good reason behind :)

    5. Re: Compiler optimizer bugs by TheRaven64 · · Score: 3, Interesting

      Most compiler bugs are not that difficult to debug

      Another compiler guy here: Some compiler bugs are not that difficult to debug if you have a reduced test case that triggers the issue. Most are caused by subtle interactions of assumptions in different optimisations and so go away with very small changes to the code and are horrible to debug (and involve starring at the before and after parts for each step in the pipeline to find out exactly where the incorrect code was introduced, which is often not where the bug is, so then backtracking to find what produced the code that contained the invalid assumption that was then exploited later).

      --
      I am TheRaven on Soylent News
    6. Re:Compiler optimizer bugs by arglebargle_xiv · · Score: 3, Interesting

      Having said that, there was one gcc compiler bug that got me a trip to Europe. A client had spent about three months trying to track down an impossible data corruption bug on their NIOS II embedded device, and eventually flew me over to try and sort it out. Our code is paranoid enough to run checksums on internal memory blocks, and that was reporting a memory-corruption problem. After about a week of work (with half-hour turnaround times on the prototype hardware whenever we made a change) we found that gcc was adjusting some memory offset by 32 bits. Everything looked fine at a high level, e.g. in a debugger, but if you took a cycle-by-cycle memory snapshot then at some stage writes started being out by four bytes. It was only the memory-checksumming code that caught it initially, it knew there was a fault but you couldn't see it using any normal debugging tools. We fixed it by detecting when the memory block had "moved" due to the alignment bug and memcpy'ing it 32 bits over so it was where gcc thought it was.

  2. Passing Parameters with Side Effects by Etherwalk · · Score: 3, Interesting

    I had a bug once where red and blue values were swapping places across thousands of pixels that took quite a while to hunt down once. It turns out there was a function doSomething called with parameters (pixel[i++],pixel[i++],pixel[i++]) while doing transformations. The compiled code pushed the third parameter onto the stack first, so it was using the red value from the array in the blue spot and vise-versa across the entire image.

  3. Compiler bugs are the worst by sectokia · · Score: 3, Interesting

    When ARM first came out on some philips CPUs it had bugs in the C compiler. The IT department called us hardware engineers in after being stuck on a bug for months. The problem with programmers is to many of them work at a high level, and they hit a wall at some abstraction layer, usually at assembly code. The other problem with these compiler bugs was as you removed unrelated code, they went away, as the compiler had pointer corruption issues. So to get the vendor to fix it, you often had to submit an entire copy of your code project. Sometimes we had to submit images of entire machines because the compiler would interact with an IDE and with Windows. These days we use only open source compilers to ensure we arnt held up and can identify and fix problems quickly.

    1. Re:Compiler bugs are the worst by sectokia · · Score: 5, Interesting

      The absolute worst I've had was a soft cpu in a altera fpga. It shipped with a C compiler. A programmer came to me to explain how his program would crash if he changed the order in which subroutines were defined. After carefully checking the logic it, there was nothing wrong with his code. So i then trawled through the assembly. Again i could find nothing wrong And thought i was losing my mind. I had to painstakingly check the cpu state after each instruction until i eventually found one instruction that did not set a flag as per the manual, and the assembler matched the manual. It was a fault that would only trigger it you did a certain conditional jump after a certain fetch increment then store sequence. It was a bug in the cpu pipeline logic. I learnt a valuable lesson never to trust anything. We wasted allot of time because we were convinced we must have been the source of the fault.

  4. Debugging Gone Wrong by mlookaba · · Score: 4, Interesting

    Bug 1 (my fault) : Took over working on a financial application that took an identifier and enriched them with all sorts of useful data. The original programmer had left, and nobody at the company knew anything about how it worked. Soon after, we were troubleshooting an issue reported by a client that the output data wasn't consistent between runs. I grabbed a list of all the unique security IDs I could find (about 100k) and pushed them through a couple of times just to try and replicate the issue. HOWEVER... it turns out the application was actually using the Bloomberg "By Security" interface under the hood. That was a service where you drop a list of IDs onto Bloomberg's FTP server, and they would respond with data... for a fee of $1 per security. The client got an unexpected bill of nearly $200k that month, and I had the most awkward talk ever with my boss. Fortunately, Bloomberg forgave the charges, and it turns out they were actually responsible for the inconsistent data - which was fixed on their end shortly thereafter.

    Bug 2 (not my fault) : A client/server application is returning odd responses to a particular query. Developer (we'll call him "Jason") inserts a switch into the code that dumps this query out to a hardcoded folder on the server. The code then gets checked into production WITH THE SWITCH TURNED ON. It went undetected for nearly a year because the query wasn't terribly high volume. But slowly and steadily, the query files built up over time. Our IT had lots of money to play with, so server space was not an issue. Unfortunately, the number of files was. Server performance went steadily downward every so often, until finally this query would make it crash every time. When we eventually tracked down the cause, there were millions of files sitting in the same folder of every single server in the group. It took nearly three days just to get the OSs to delete the files without falling over.

  5. More of an update than a bug by coop247 · · Score: 3, Interesting

    First job out of college doing tech support for a big corp. One day thousands of Win2000 computers start taking multiple hours to boot up. Nobody can figure out what the problem is, got like 20 people working on it for almost two weeks.

    After digging through logs and error messages I discover than some idiot who had denied doing anything had sent out an update via our client management software to add a new local user for support purposes. He didn't do this via a script, rather "recorded" him adding it to a machine and then sent out a copy of the files and registry entries that had changed. Unbeknownst to this genius, the local security database is an binary (pretty sure encrypted) file that you can't just go copying between machines.

    I put together a script that repaired the local database and fixed the problem in a couple minutes. But literally had thousands of workers sitting around doing nothing waiting for computers to boot for like 2 weeks.

    --
    //TODO: Insert catchy phrase
  6. C library sleep(x) caused code instabilities... by Anonymous Coward · · Score: 2, Interesting

    My favourite head scratcher - back using Motorola's version of Unix, we had a voice response (IVR) application that would poll for activity, and otherwise sit idle using the sleep() command. The code had interrupt handlers SIGUSR (iirc) that would perform "real-time" activities as necessary (handling call hang ups, touch tone digit receipt, etc). When running under a load test scenario during a quality cycle, we kept running into scenarios where 1 in a 1000 or so instances of our event handlers were NOT handling the activities such as call hangups, missing digits, etc.

    After MUCH digging, having witnessed our interrupt handling code, half way through a trace, simply stop executing, we did a reverse disassemble of the sleep command, and found this jewel: a SETJMP on invocation, and a LONGJMP back to the stack location when the SIGALRM timer that it set ran out. Assumption being that while in the sleep() call, no other code would be executing. In reality, if our event handlers where running when the the SIGARLM timer ran out, the sleep call did a LONGJMP, restoring the stack back to its original state, wiping our interrupt handler off the stack.

    When Motorola was confronted, the first reaction was "no, we didn't do that. We're looking at the code." Only when we showed them the disassembled output did they admit there was an issue with the release of software we were using.

    That one took 4 days for me to track down as a junior programmer at the time, some 25 years ago.

  7. Incrementing by darkain · · Score: 3, Interesting

    One night while coding half asleep, I wrote the following to increment a variable in C++

    x = x++;

    The problem with this code is that it is an undefined behavior. It looks okay at first glance, and then when you consider the machine code that would be built from it, a bit of ambiguity arises. The problem comes in with the = sign vs the ++ operator. Both of which are assignment operators for the x variable, but it is not well defined which assignment should happen first/last. The code in use was actively being used in both MSVC and GCC environments, each producing opposite assignment ordering. This was awesome to debug, since the code "worked" on one platform but not the other!

  8. Re:debugger by Jeremi · · Score: 4, Interesting

    Some people, when trying to analyze a buggy program, think "I know, I'll use a debugger". Now they have two buggy programs to analyze.

    -- a grumpy old programmer

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  9. Is return value optimisation a bug? by Anonymous Coward · · Score: 2, Interesting

    Because it stymied me for weeks years back when I first started in C++. I'd written some code that made assumptions about where variables were initialised and what happened when said variable were returned, using some custom stuff in operator= and the constructor. (irrelevant detail: I wanted to be able to return sub-matrices of a matrix that could be assigned to to overwrite the relevant parts of the full matrix. Think matlab A([1 2 3], [3 4 5]) = B overwrites part (but not all) of matrix A style. And I was fairly new to C++).

    Worked great without optimisation.
    Broke horribly when optimisation was turned on.

    It was a learning curve, but eventually google turned up a little thing called return value optimisation (or something-or-other ellision, it seems to have a few names). Basically, by design, how code executes (literally what it does) can be a direct function of your optimisation flags. Specifically what assignment operators etc get called, and in what order, when you start returning classes from functions.

    I know it's not technically a bug - after all, it's right there in page 5 billion point 2 of the spec - but still, it marked the end of my "my god C++ is amazeballs and can do no wrong" phase.

  10. Re:Hardly devastating, but a waste of several hour by Dutch+Gun · · Score: 3, Interesting

    Oh, damn... yeah, done that as well. Frustrating as hell, because it just doesn't make sense until you finally figure out you're not even debugging the code you're working with.

    Other variations of "the impossible is happening" include:

    * Syncing to new code, recompiling, and crashing. Crashes only go away once you force a full rebuilt to update stale precompiled headers.
    * Program crashes mysteriously, and only is fixed after the machine is rebooted (likely some process in RAM has been corrupted).
    * When you get automated crash debug reports from hundreds of thousands of customers, you eventually realize that a staggering number of people simply have bad hardware, due to the impossible crashes that occur (e.g. a = b + c; // --- crashes here. all variables are integers).
    * Compiler or hardware bugs - thankfully much more rare than they used to be.

    --
    Irony: Agile development has too much intertia to be abandoned now.
  11. 4000 is greater than 5000 by wolf12886 · · Score: 3, Interesting

    I was working on an embedded system recently that had a 5 minute timer to shut off the machine. We had received customer complaints that the machine occasionally shut off early. The code was a simple while loop that ran some pid controls and every loop checked "If (run_time > 5 minutes): exit;". I ran the machine in the lab for a while and sure enough, it shut off early once in a while. I looked through, and eventually SCOURED the code, assuming there was a subtle bug, such as clock corruption due to interrupts, or some kind of type conversion mistake, I couldn't find anything. I eventually set up a serial printout from the machine so I could see what was happening. And it would run and then print out "5 minutes elapsed, shutting down". No glitches or resets (which is what I was expected). So now I'm staring at this one line "If (run_time > 5 minutes): exit;", pulling my hair out. Finally in a moment of insane desperation, I added another line to the while loop. "if (4000 > 5000): print("Something is very wrong!"); I carry the machine to the lab and set it up, and IT PRINTS. Every few minutes or so it pops up on the display. So now I'm just like "fuck everything" how can I possibly run code if I can't even trust the basic principal that the computer will do what I tell it too. So the first thing I do is add triple checks to all critical comparisons, that eliminates the symptoms for now but I know it's going to cause weird problems forever if I leave it like that. Ok so the execution is buggy, I get out the scope and check the power line and various other things and it looks ok, but I notice at this point that the problem never occurs when the machine is running empty, only when it's loaded, so I clip ferrites everywhere you can possibly fit one and spend half a day putting metal covers on everything. As I run the machine this time I'm practically holding my breath, 1 run good, 2, 3. I'm getting super excited at this point, then bam "Something is very wrong!" prints and I die a little inside. After walking out to my car and screaming at the sky for a while, I get back to it. At least I know it has something to do with noise. Since the machine can't possibly be more shielded a take a look at the schematic, it looks normal, but there's a bunch of funky stuff on the reset line. I ask around and nobody knows why its there. It's got a regular pull up resistor, but somebody added a diode in series, and a ferrite bead right before the pin. Due to the voltage drop the MCLR is only being pulled up the 3.9v instead of 5v, so that's not good. Then I take a look at the ferrite on the board and it's sticking off the board with a coil of wire through it not 2 inches from a brushed motor the size of my fist. It must be acting like a transformer secondary. I shorted the diode and the ferrite and the problem never happened again!