Slashdot Mirror


Protothreads and Other Wicked C Tricks

lwb writes "For those of you interested in interesting hard-core C programming tricks: Adam Dunkels' protothreads library implements an unusually lightweight type of threads. Protothreads are not real threads, but rather something in between an event-driven state machine and regular threads. But they are implemented in 100% portable ANSI C and with an interesting but quite unintuitive use of the switch/case construct. The same trick has previously been used by Simon Tatham to implement coroutines in C. The trick was originally invented by Tom Duff and dubbed Duff's device. You either love it or you hate it!"

11 of 229 comments (clear)

  1. Looks pretty cool by bobalu · · Score: 4, Interesting

    I used a Lifeboat lib back in the late 80's that this reminds me of. Cooperative multitasking. Eventually ported the whole thing to OS/2 and used that threading instead. All the code pretyy much worked as-is.

    --
    The revolution will NOT be televised.
  2. It isn't Duff's device. by mc6809e · · Score: 3, Interesting


    Duff's device is a way of forcing C to do a form of loop unrolling. It has nothing to do with coroutines.

  3. Loop Abuse by wildsurf · · Score: 4, Interesting
    Reminded me of a function I once wrote...

    The PPC architecture has a special-purpose count register with specialized branch instructions relating to it; e.g., the assembly mnemonic 'bdnz' means "decrement the count register by one, and branch if it has not reached zero." I've used this in some pretty weird loops, including this one that broke the Codewarrior 9.3 compiler (fixed in 9.4.) This computes the location of the n'th trailing one in a 32-bit integer. Pardon my weak attempt at formatting this in HTML:

    static uint32 nth_trailing_one(register uint32 p, register uint32 n) {
    register uint32 pd;
    asm {
    mtctr n; bdz end
    top: subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdz end
    subi pd, p, 1; and p, p, pd; bdnz top
    end: }

    return __cntlzw(p ^ (p - 1));
    }

    The idea was that the instruction stream should stay as linear as possible; most of the time the branches are not taken, and execution falls through to the next line of code. Ironically (siliconically?), the entire function could probably be implemented in a single cycle in silicon; shoehorning bitwise functions like this into standard instructions tends to be extremely wasteful. Perhaps FPGA's will make an end run around this at some point. I've also tried this function with a dynamically-calculated jump at the beginning, similar to the case statement logic in the article.

    Hmm, I had a point I was trying to make with this post, but now it's escaped my mind... :-)
    --
    Weeks of coding saves hours of planning.
    1. Re:Loop Abuse by wildsurf · · Score: 3, Interesting

      Unless you know that n is usually large, wouldn't this be more efficiently implemented with cntlzw?

      It would be if I were looking for the n'th leading one, but this code is looking for the n'th trailing one. (e.g. for 0b0010011001011100, the 3rd trailing one is in the fifth-lowest bit.) The equivalent code sequence for leading ones is in fact more complicated, requiring three arithmetic instructions and a branch per iteration. (cntlzw, shift, xor, branch).

      I actually use this code as part of an algorithm where I have a very large (e.g. 65k-element) packed single-bit histogram array, and need to find the position of (say) the 1000th set bit. Vector instructions can do a coarse population-count from one end fairly efficiently, but once it's narrowed down to a 32-bit region, it comes down to slicing and dicing. My code operates by clearing the rightmost set bit in each iteration (x & (x - 1)), then at the end, isolating the needed bit (x ^ (x - 1)) and using cntlzw to find its position. To clear the leftmost set bit, you need three instructions: first get its position with cntlzw, then shift 0x80000000 right by that number of bits, and finally XOR to clear the bit. (If there's a shorter sequence, I haven't found it.)

      (oh, and for the troll responder-- you are quite spectacularly wrong. But thanks for the giggle.)

      --
      Weeks of coding saves hours of planning.
  4. It was looking interesting until by achurch · · Score: 4, Interesting

    I got to this little gem:

    The advantage of this approach is that blocking is explicit: the programmer knows exactly which functions that block that which functions the never blocks.

    My English parser thread shut down at that point . . .

    Seriously, this looks like a handy little thing for low-memory systems, though I'd be a bit hesitant about pushing at the C standard like that--the last thing you need is a little compiler bug eating your program because the compiler writers never thought you'd do crazy things to switch blocks like that.

  5. Python by meowsqueak · · Score: 4, Interesting

    Weightless threads in Python:

    http://www-128.ibm.com/developerworks/library/l-py thrd.html

    They are cooperative but far more efficient than Python's own threading model. You can easily create hundreds of thousands of concurrent threads.

  6. You want cool C stuff... by Dr.+Manhattan · · Score: 4, Interesting

    Get the book Obfiscated C and Other Mysteries by Don Libes. Explanations of various Obfuscated C contest entries, and alternate chapters illustrate neat corners of C, including a few things similar to this little library. Occupies a place of honor on my shelf.

    --
    PHEM - party like it's 1997-2003!
  7. Yes, can be useful (depending on platform) by ihavnoid · · Score: 3, Interesting

    As the prothread homepage says, it's for extremely small embedded systems, where there are no operating systems, with tiny amount of memory (You can't use DRAMs on systems that cost something less than $1). Want to use threads on those kind of systems, you have no choice.

    Another advantage is its portability. Small embedded systems, whether they have operating systems or not, usually can't support some fully-blown threading standard. Those operating systems seem to implement some kind of 'specially tuned' thread APIs.

    Using these kind of threads on a full-blown PC (or servers) would have almost no benefit. However, in the embedded software engineer's perspective, it's great to see a ultra-lightweight thread library without any platform-dependent code.

  8. wtf? by DeafByBeheading · · Score: 3, Interesting

    Okay, I'll play the n00b. I understand most of this, but my coding background is not that great, and mostly in C++, Java, and PHP, and I'm having problems with the switch from Duff's Device...


      switch (count % 8)
      {
      case 0: do { *to = *from++;
      case 7: *to = *from++;
      case 6: *to = *from++;
      case 5: *to = *from++;
      case 4: *to = *from++;
      case 3: *to = *from++;
      case 2: *to = *from++;
      case 1: *to = *from++;
            } while (--n > 0);
      }


    What the hell is up with that do { applying only in case zero? It's in several places on the net just like that and Visual Studio compiles this just fine, so it's not an error. I checked K&R, and they don't even hint at what could be going on there... I'm lost. Help?

    --
    Telltale Games: Bone, Sam and Max
  9. Use of this technique in Felix by skaller · · Score: 4, Interesting

    FYI this technique is heavily exploited in the programming language Felix:

    http://felix.sf.net/

    to provide user space threading. The main difference is that all the 'C tricks' are generated automatically by the language translator. If you're using gcc then the switch is replaced by a computed jump (a gcc language extension). On my AMD64/2800 time for creating 500,000 threads and sending each a message is 2 seconds, most of the time probably being consumed by calls to malloc, so the real thread creation and context switch rate is probably greater than Meg/sec order .. just a tad faster than Linux. Both MLton and Haskell also support this style of threading with high thread counts and switch rates (although the underlying technology is different).

    --
    John Skaller mailto:skaller@users.sf.net
  10. Re:I guess the idea is it's extremely portable. by GlassHeart · · Score: 3, Interesting
    There isn't a program that can't be written as a state machine; but most programs expressed this way are difficult to understand and maintain.

    My experiences contradict your statement. State machines are both easy to implement, and easy to debug, if you do it the right way. I have seen many entirely wrong implementations, including one where you can go from any of about two dozen "states" to any other. I have seen some that just switch states when they feel like it, or switch states based on complex decisions, which makes debugging difficult. Put another way, you can make a "state machine" degenerate into something else, and nullify its benefits, if you refuse to follow the rules.

    A well-implemented state machine has an important characteristic: it is clear to see why you are where you are. This means that state transitions are checked (against unexpected events) and traced, so debugging the machine is literally a matter of reading a log that looks like this:

    in state 0, received event A so went to state 2
    in state 2, received event B so went to state 1
    in state 1, received event C so went to state 5
    in state 5, ignoring event A
    in state 5, received unexpected event C

    and so on. In this particular example, the question to answer is why we're not handling event C properly in state 5, or why we went to state 5 in the first place. Either should be pretty obvious when you consult the original design. The fix is likewise obvious. Figuring out state machines, in my experience, has always been easier than figuring out multi-threaded code.

    This isn't to say that all programs should be implemented as a state machine. Simple Unix-style pipe programs, for instance, are generally unsuitable. If you don't know how to design a state machine properly, it's also going to be unsuitable.