Slashdot Mirror


Eric S. Raymond Identifies A Common Programming Trap: 'Shtoopid' Problems (ibiblio.org)

"There is a kind of programming trap I occasionally fall into that is so damn irritating that it needs a name," writes Eric S. Raymond, in a new blog post: The task is easy to specify and apparently easy to write tests for. The code can be instrumented so that you can see exactly what is going on during every run. You think you have a complete grasp on the theory. It's the kind of thing you think you're normally good at, and ought to be able to polish off in 20 LOC and 45 minutes.

And yet, success eludes you for an insanely long time. Edge cases spring up out of nowhere to mug you. Every fix you try drags you further off into the weeds. You stare at dumps from the instrumentation until you're dizzy and numb, and no enlightenment occurs. Even as you are bashing your head against a wall of incomprehension, consciousness grows that when you find the solution, it will be damningly simple and you will feel utterly moronic, like you should have gotten there days ago.

Welcome to programmer hell. This is your shtoopid problem.... If you ever find yourself staring at your instrumentation results and thinking "It...can't...possibly...be...doing...that", welcome to shtoopidland. Here's your mallet, have fun pounding your own head. (Cue cartoon sound effects.)

Raymond's latest experience in shtoopidland came while working on a Python-translating tool, and left him analyzing why there's some programming conundrums that repel solutions. "You're not defeated by what you don't know so much as by what you think you do know," he concludes. So how do you escape?

"[I]nstrument everything. I mean EVERYTHING, especially the places where you think you are sure what is going on. Your assumptions are your enemy; printf-equivalents are your friend. If you track every state change in the your code down to a sufficient level of detail, you will eventually have that forehead-slapping moment of why didn't-I-see-this-sooner that is the terminal characteristic of a shtoopid problem."

Share your own stories in the comments. Are there any programmers on Slashdot who've experienced their own shtoopid problems?

31 of 189 comments (clear)

  1. Not always... by rbeattie · · Score: 4, Interesting

    More times than not, the solution is actually really difficult - you just underestimated the problem. Then you go to github and find a library that shows you how it should be done, and you can't believe it takes so much code to do something that seemed so straightforward.

    --
    Me
    1. Re:Not always... by StormReaver · · Score: 4, Informative

      ... can't believe it takes so much code to do something that seemed so straightforward.

      While that happens too, it is on the other end of the spectrum of what Eric is describing.

    2. Re:Not always... by igny · · Score: 3, Interesting
      The shtoopidest problem I faced was in TransactSQL. Usually, the syntax there is case insensitive, but there is a difference between
      • where timeStamp >= format(watermark,'yyyy-MM-dd hh:mm:ss') --<-- incorrect
      • where timeStamp >= format(watermark,'yyyy-MM-dd HH:mm:ss') --<--correct

      This bug was extremely elusive for me because the code looks fine and watermark in our data is almost never between 00:00:00 and 01:00:00 and that was when the bug sometimes causes missing data in our target tables.

      --
      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
    3. Re:Not always... by Rei · · Score: 4, Insightful

      Agreed. But I don't really think Eric's "solution" is that helpful. Heres the last two "shtoopid" problems I had (in a CFD model evolution app):

      1) My program (which a piped child process, which in turn had its own children) was randomly locking up when one of the subprocess's children died. Now, normally that's an eminently solveable problem... except for the fact that it was locking up in a different place each time. I was stuck digging deeper and deeper into pipe magic with no luck. I even went to strace python, but that just added more confusion, as strace was dying at random times, and sometimes not even printing out full lines!

      The problem? The output of my program was running through "tee". I was only seeing the last section of the buffer to be printed out :P The real problem was that the subprocess had simply stopped printing data and so the pipe read was hanging; it was instantly obvious when "tee" wasn't used.

      2) My program would sometimes go into a "subprocess keeps dying" mode. This started out of the blue with no changes to my code. Again, I kept instrumenting more and more, with no luck.

      The problem? I had started, in another window, a shell script that ran on a loop to generate visualization data at regular intervals whenever the process was running. When the visualization data would appear in the middle of a run, it would sometimes interfere with the raw data, due to the way the data processing was set up. But since that visualization data wasn't present when the run started, it took time for the problem to show up, and then would just occur out of the blue.

      The short of this is... if you follow Eric's "instrument everything" solution to "shtoopid" problems, you'll sometimes just dig yourself further into a hole. The problem is that you have a base assumption that's wrong. IMHO, the best solution is to bring a third party in and explain everything about what you're doing and where it's going wrong. Not only can their different perspective add insight, but the very act of having to explain and reproduce everything from scratch (and answer their questions) can help you as well.

      --
      "Who the hell is Nietzche? It's a question stupid people are asking." -- Newscaster, "Jesus Christ Supercop"
    4. Re: Not always... by UnknowingFool · · Score: 3, Interesting

      Sometimes it's a very slight difference between environments that causes problems. We rolled out some code to production that had been fully tested in Dev and Test environments. Things started to break due to SQL errors. Ran the SQL directly on the production database server but it ran fine. Somehow the SQL was getting different results running through the production server than it was on the production database server directly.

      After some investigation the only difference between the production server and other environments was the server used a slightly older database driver. It was a minor version difference. How this caused errors was that in the older db driver all math operations had to be explicit data casts despite what documentation said but the newer driver followed the database documentation. So Integer A / Integer B should be implicitly cast as Integer according to the documentation. However the older driver would cast that as Float for some unknown reason and that would cause errors.

      But this would only happen using the db driver on Production. Testing the SQL directly on Production DB wouldn't have found it. Testing the code and SQL on Dev and Test servers wouldn't have found the bug. The patch notes for the db driver didn't mention the change.

      --
      Well, there's spam egg sausage and spam, that's not got much spam in it.
  2. Re:He really is old, isn't he? by StormReaver · · Score: 5, Insightful

    ...he doesn't simply use a debugger to step through the problematic code?

    That misses the entire point. In the class of problem he is describing, everything looks fine at the debugging level (regardless of how you are debugging). Or better yet: your debugging tools show that something is wrong, yet how the program gets into that state is elusive. You have traced the program execution in excruciating detail, and everything looks great until the very next line of code morphs your perfect execution state into a problematic one for reasons that appear to be impossible. Eventually, you figure out how it's possible, write a small amount of code that you should have written earlier in the process, and fix the problem.

    You then realize the obviousness of the solution, and feel like an idiot for having spent hours, days, weeks, or months figuring it out.

  3. Re:He really is old, isn't he? by ledow · · Score: 4, Interesting

    Ever tried debugging deep-level OS kernel code?

    To be honest, debuggers also introduce just as many differences - I have crafted code (nothing special, fancy or playing tricks) that, when debugged, works entirely differently to non-debugged. Debugging inserts all kinds of stuff into the code that modifies the pointers of all kinds of data by vast amounts, and can made it "pass" whatever it is you wanted to do.

    Also, if you program against many architectures, an architecture-specific bug might be something that you don't have the tools for, despite debugging the code on all your normal platforms. Yes, a debugger is the ultimate solution, but mostly you might just not have that stuff available and it could be days or weeks before you can get it going to the point that you can effectively debug code that you've been working on for 20 years and know inside out.

    Plus many problems are not debuggable - maybe your users are having the issue but you're not, and you can't reproduce, but dozens of your users can, and yet they have almost identical environments to you - the only way to debug that is to set up a full programming, debugging and source environment on their machine - which may be something you don't want to do - or give them an instrumented version of the executable, which may not reproduce the problem.

    I know for a fact that I have programs that work on Linux, Windows, even HTML5 (via emscripten), that also can work on Mac. But for sure I wouldn't be buying a Mac to diagnose problems on that platform until it was absolutely necessary. And I wouldn't be giving my code to users for them to diagnose it.

    But through in a bunch of printf's and a log and - no matter the architecture or tools available - you can get down to a function, a line, a set of parameters enough to debug before you even need to think "How the fuck am I'm going to go about getting debug info out of that person/system/architecture?"

    I know I have a C macro that I prefix all functions with. In "normal" mode, it just expands to a function definition. In "debug" mode, it expands to the function, and a bunch of debugging lines for when it enters/leaves each function and the parameters given to it. This means one switch change and the program runs basically identically to how it runs without debugging, churns out a huge log file, doesn't modify any structures, pointers, etc. and which I can skim the bottom of after a crash report to know where and why it crashed, on any architecture, with a compiled binary, without including the full -g debugging shit that basically gives away your source code (or a version of it).

  4. Assert is your friend by Serif · · Score: 4, Insightful

    Been there, got several wardrobes full of T shirts.

    If unit testing and staring at code for more than a few minutes doesn't solve this kind of problem, then the assertion hammer comes out. Assert everything, especially the things that are so obvious that they don't need an assertion. The bugs just have fewer and fewer places to hide and eventually surrender.

  5. assert()'s for every assumption by jrbrtsn · · Score: 5, Interesting

    Over my 30 year career, I cannot believe how many 'C' programmers I've come across who are unfamiliar with the assert() macro. This macro is essential for trapping all invalid assumptions! Usually it's as simple as:

    if ( ! functionWhichCanFail(a,b,c) ) assert(0);

    Run your program from the debugger, and it will stop when the assert(0) is encountered, giving you full and convenient access to everything needed to hunt down the issue.

    1. Re:assert()'s for every assumption by pauljlucas · · Score: 4, Informative

      That is why the assert macro can be disabled via NDEBUG. You enable asserts during development and testing to catch errors so they do not go unnoticed, then disable them for production.

      --
      If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
    2. Re:assert()'s for every assumption by UnknownSoldier · · Score: 2

      A "professional" programmer bragging how much he is ignorant.

      *facepalm*

      Hint: If you aren't using all the (language's debugging) tools available then you aren't as smart, or professional, as you think you are. A "real" professional would go "Oh cool, this defensive programming -- an implementation of Fail Fast -- will help in tracking down bugs! Nice!"

      C programmers who don't use assert() are either ignorant, stupid, writing toy programs, or some combination. Period.

    3. Re:assert()'s for every assumption by blackhedd · · Score: 4, Informative

      You got the right way to say it: "Assert all of your assumptions."

      Code rots when it gets modified in ways that don't respect the implicit assumptions made in the past. Have you ever said to yourself, "this function is only called from two places and I know those two places validate the parameters"? When you then write the function without checking parameters, you've made an implicit assumption that makes all the sense in the world at that moment. But someone else (or you in three months) will forget or won't know the assumption, and call that function from somewhere else, with unchecked parameters.

      Either document your implicit assumptions (making them explicit), or (better) assert them. The way to get good at asserting implicit assumptions is first to learn how to recognize when you're making an implicit assumption! That takes skill and practice, but if you don't do it, asserts don't help you.

      And I leave the asserts in the final builds, too. In decades of professional C programming, I've never had a case where the asserts imposed a measurable performance penalty.

      To the people who say "use NDEBUG to disable your asserts for production, because customers hate interruptions": no, don't do that. A violation of an implicit assumption is ALWAYS a bug, and it's always better for it to bite your behind sooner rather than later. The assert tells you exactly what you did wrong. I've had this conversation dozens of times:

      [Angry customer]: Your software crashed!
      [Me]: pop open the syslog and search for the word "assert."
      [Angry customer]: It says "line XXX in file YYY"!
      [Me]: you'll have the fix in two hours.

      Interestingly, there is a handful of classes of bugs that are impervious to asserting your assumptions. The worst of these, in my experience, is the accidentally shadowed variable. But using assert in a disciplined way is incredibly useful.

    4. Re:assert()'s for every assumption by blackhedd · · Score: 2

      Not the way I'd do it. An assert is always to be considered an orthogonal instrumentation of code, not code itself. The code must ALWAYS work according to intent, with or without the assert. The point of the assert is to detect when the intent isn't correct or has changed over time due to violated assumptions.

      Write it this way:

      auto ok = FunctionWhichShouldntEverFail (a,b,c);
      assert (ok);

      In short: always have the assert as its own statement (easily removed or commented out); and only assert values, not functions or expressions with side effects.

  6. printf() may not work for multithreaded problems by jrbrtsn · · Score: 4, Interesting

    A few years ago I had an issue in a multi-threaded program where using printf()'s caused the problem to go away. In order to track the problem down, I ended up writing messages to a buffer in RAM, and dumping the buffer to stdout after the problem occurred.

  7. So familiar! by jtgd · · Score: 2

    Been there, done that many times. Nothing more frustrating to see something you know is absolutely impossible! But fairly satisfying when you ultimately find the bug.

    --
    J
  8. Rubber Duck Debugging by Anonymous Coward · · Score: 2, Informative

    I learned long ago to recognize the feeling that comes when I know I'm missing something obvious. When I do that, I grab a coworker, and explain the issue to them. Just explaining it to someone is frequently enough, but sometimes they spot something glaringly obvious that I've missing.

    I spent an hour once trying to find an issue where the difference was between I5 and l5. Yeah, depending on your font and display that may be an easy problem, or a hard one. One of those is a capital i, the other a lowercase L.

  9. Re:Get a better debugger by serviscope_minor · · Score: 4, Insightful

    Or with experience you realise that stepping debuggers are great for some problems and printfs are great for other problems.

    --
    SJW n. One who posts facts.
  10. Re:He really is old, isn't he? by Anonymous Coward · · Score: 2, Insightful

    We called these "Heisenbugs" - attempting to study the bug (via debugger/variable dumps, etc.) causes it to vanish from sight.

  11. Off by one...... by Proudrooster · · Score: 2

    I feel ya brother.. the off by one still gets me 30 years later.
    https://en.wikipedia.org/wiki/...

    I wish we could have an agreement that lists, arrays, elements, and anything put into a list, table, query, associative array, start with an index value of either 0 or 1.

    I don't care just pick one, and don't use two different standards in the same environment.

    1. Re:Off by one...... by Darinbob · · Score: 2

      Don't forget the old Fortran programmer who moved to C/C++ and insists that all of his arrays start at 1, like God intended, and added some helper functions and macros to do this. Pity the developer who inherited that code and had to maintain it.

  12. Offensive by 110010001000 · · Score: 4, Funny

    I find that calling someone "stupid" (even yourself) is offensive and the imagery of "hitting with a mallet" is extremely violent. He shouldn't be allowed to work on open source projects.

  13. Easy prevention: if (10 == variable) by raymorris · · Score: 2

    I've started preventing that by habitually putting the variable on the right side. If I accidentally use = instead of == I'll get a syntax error. It makes that bug impossible by just changing an arbitrary habit.

    if ( 10 == variable )

  14. Sometimes you need to debug optimized code. by Anonymous Coward · · Score: 2, Informative

    So you have a failed assertion. What happened? Fire up the debugger, breakpoint on abort. Breakpoint gets triggered, you get a backtrace. Can't imagine how you got there.

    Days of debugging later...

    The abort function is marked as "noreturn". Consequently instead of calling abort, the compiler saves a few bytes/cycles by jumping to a preexisting abort call, never mind the state of the stack frame. Of course, this single recycled abort call in the whole module is where all backtracks end up. Hooray.

    Now obviously the whole purpose of abort as opposed to exit is to get a core dump. And the whole purpose of a core dump is debugging. And debugging involves backtraces, so abort calls should leave stack and continuation in a useful and recognizable state. So the obvious remedy is not to mark abort as "noreturn". Because you never want to have the stack in a mess when aborting as opposed to exiting.

    Enter your most beloved glibc maintainer of yore. Who refuses to lie to the compiler for any reason at all.

    This shtoopid problem will stick around. -fno-crossjumping for yall.

  15. Re:printf() may not work for multithreaded problem by Wrath0fb0b · · Score: 5, Interesting

    Fun story time related by a colleague. A pretty common piece of software (hint: there's probably one running within a few hundred yards of you) had an elusive bug. But as the parent noted, printf caused the problem to go away, and it was suspected because it caused synchronization on stdout. Unlike the parent, the developers didn't have time to actually implement a buffered-log solution to figure this out, so they the obviously-logical thing -- they replaced all the printf calls with barrier() and shipped it. It's still running like this today.

    Another good one, I worked with someone who would log everything all the time by fprintfing to a high-numbered pipe. When I asked him, he gave a few advantages that still ring partially true (depends on context): first, he said, I can get the log from any running instance without even stopping by d-tracing the system call. But most critically, he said, all the formatting happens in userland and only after the syscall does the kernel actually realize that there's nothing on the other end of the pipe and drop the write. That means, he reasoned, that the release/debug versions would always have very close behavior and would avoid the class of 'bugs that don't reproduce in debug build'. As with the other story, to this day, there's a slew of machines out there, formatting and writing log messages to a pipe that's never open.

  16. we can do better, but are doing worse by mothlos · · Score: 2

    We have solutions to reduce this sort of problem (at least once you get past the learning curve), but the top programming languages tend to implement very few of them. Reasoning about state is difficult, particularly when that state can be altered in unexpected ways. It is difficult to be confident that your code does what you think it does when you don't have a computer-checked method of specifying your intentions separate from what your code does.

    There are no magic solutions here, at the least you will end up needing to spend more time writing in a specification language and that requires learning how it works. I would say that a gentle introduction to something like this is Elm which has an aim of stripping down typed functional programming into something that doesn't really need a C.S. degree. Here is a video which helps to explain what a better type system can do for your code. If you want to see something a bit more mind-bending check out Idris which has a much more powerful specification language which can prevent things like off-by-one errors or unbounded recursion in many cases. Moving off the scale of usability a bit, there is ATS which is a difficult language, but its specification language is able to make pointer arithmetic safe and doesn't bind you to immutable data structures. Hell, even Rust is full of good ideas that help to avoid these issues. And if fault-tolerant distributed systems are your thing, you need to check out Erlang (or its sibling Elixir) as there are so many great ideas that have been around for decades yet don't get nearly enough exposure.

    This doesn't prevent us all from occasionally falling into this trap, but the themes of the languages listed is to find ways to encourage (or force) you to get the little things right the first time with computer-verified specification and to isolate the search space where problems are likely to occur.

    1. Re:we can do better, but are doing worse by AHuxley · · Score: 2

      Back to Ada.

      --
      Domestic spying is now "Benign Information Gathering"
  17. Re:printf() may not work for multithreaded problem by goose-incarnated · · Score: 2

    A few years ago I had an issue in a multi-threaded program where using printf()'s caused the problem to go away. In order to track the problem down, I ended up writing messages to a buffer in RAM, and dumping the buffer to stdout after the problem occurred.

    Similar story, except that the processor would reboot, clearing all the variables I stored leaving no opportunity to grab all the diagnostics.

    I examined the map, determined what the last address was, added an interrupt handler on the clock that logged the stack pointer ~250/sec (only needed to log the pointer if it was smaller than the existing one) to determine how much margin I had and used that little space between maximum stack and variables to write my diagnostics to.

    Once I had determined the smallest stack address that got used, I wrote my diagnostics into that margin between the stack and the bss. To make sure that the values wouldn't be overwritten on processor startup I could not use actual variables, I had to use a pointer variable that pointed to those ten bytes I could write into. At startup the bootstrap code would grab whatever was in that memory, chuck it via i2c onto another processor, clear the ten bytes, and then proceed with normal bootup.

    When booted from cold that memory held nothing, when rebooted the memory was not cleared (because power was not removed) and thus I had my diagnostics from the previous execution.

    And yes, I found the bug with the help of the diagnostics (don't recall what it was, but that isn't important).

    --
    I'm a minority race. Save your vitriol for white people.
  18. Re:Profilers by Galactic+Dominator · · Score: 2

    (2) Because they don't distinguish between waste (a) and time consuming functionality (b)

    If you are looking for profilers to analyze your code for inefficiencies, then you have a different definition of profiler than I believe most high lever users do. Profilers are there to make a representation of where time/cycles are spent in code. It is up to the author to analyze and act upon such information. And profiling is extremely useful provided you understand the code and infrastructure. You are correct in one way though, it useless for optimization provided you don't know the very basics of programming.

    --
    brandelf -t FreeBSD /brain
  19. != instead of == or vice versa by mark-t · · Score: 2

    I know what it is that I mean for the program to do, but sometimes will type exactly the opposite, all the while continuing to read it the way that I meant it. Even putting an assert in will not help because in close proximity to where I've accidentally created this kind of inverted condition, it is unfortunately quite likely I will repeat the mistake. And again, when I make these kinds of mistakes, I cannot easily feel nd them on my own because I see the code I thought I wrote instead of what is necessarily actually there.

  20. Weeping Angels by Weaselmancer · · Score: 2

    I like to think of those kinds of bugs as Weeping Angels. They only move when you're not looking at them.

    I have about a dozen years experience in MS Embedded CE. There is typically a Release build, and a Debug build. Release will macro out all the debug statements, which changes the execution timing. Enough so to where the bug that is biting you is often seen only in Release. Switch to Debug to chase it, and it goes away.

    I had a similar experience recently with a PIC32 project. The devboard they sell has floating inputs on UART1. It never fails in the devboard. It does fail in the board I made. The floating inputs every so often will decide to twitch back and forth rapidly, firing a shitstorm of interrupt requests that crash the firmware. It never dies on the devboard. It occasionally gets twitchy and dies on our board, which is exactly derived from the schematic of the devboard. As an added plus, if you hook up an oscilloscope to the pins that changes impedance, and the float goes away, and the problem goes away. I have no idea how the devboard does not suffer from the same problem.

    --
    Weaselmancer
    rediculous.
  21. Re:He really is old, isn't he? by Darinbob · · Score: 2

    I didn't have a debugger, since the stupid chip gets wonky when turning it on. So compile, load the code, look at the oscilloscope, scratch my head, and repeat. I worry I was doing something stupid like it wasn't really loading my new code but had the old code, but that checked out too. Ask for some help over skype, but get nowhere.

    Stop and stare at it, the change was supposed to be over and done in 5 to 10 minutes and it's been a few hours. Then I see it, I forgot a "~". I wasn't clearing that bit, I was clearing everything but that bit. And that's the first programming related question I tend to ask in interviews, so I felt pretty dumb.

    This was Thursday. I will be keeping a journal of my senility.