Slashdot Mirror


The Most Expensive One-Byte Mistake

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"

9 of 594 comments (clear)

  1. The Road Not Taken by symbolset · · Score: 5, Insightful

    Two roads diverged in a yellow wood,
    And sorry I could not travel both
    And be one traveler, long I stood
    And looked down one as far as I could
    To where it bent in the undergrowth;

    Then took the other, as just as fair,
    And having perhaps the better claim,
    Because it was grassy and wanted wear;
    Though as for that the passing there
    Had worn them really about the same,

    And both that morning equally lay
    In leaves no step had trodden black.
    Oh, I kept the first for another day!
    Yet knowing how way leads on to way,
    I doubted if I should ever come back.

    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood, and I—
    I took the one less traveled by,
    And that has made all the difference.

    - Robert Frost, 1920

    --
    Help stamp out iliturcy.
    1. Re:The Road Not Taken by looie · · Score: 5, Insightful

      Not sure where you took your "poetical exegesis" class, but you should ask for a refund.

      The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

      The poem is simply a reflection on how we, as individuals, make choices in life. Some of us choose to take the direction taken by most of those around us. That might be university, family, job, retirement in FL. Some of us choose to turn aside from that direction and try another path. Programming a PDP to play "Space Travel," for example. Or writing an operating system "just for fun."

      Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

      In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

      mp

      --
      "The secret to strong security: less reliance on secrets." -- Whitfield Diffie
    2. Re:The Road Not Taken by cecille · · Score: 5, Insightful

      Would anyone care you join me
      in flicking a few pebbles in the direction
      of teachers who are fond of asking the question:
      "what is the poet trying to say?"

      as if Thomas Hardy and Emily Dickinson
      had struggled but ultimately failed in their efforts -
      inarticulate wretches that they were,
      biting their pens and staring out of the windows for a clue.

      Yes, it seems that Whitman, Amy Lowell
      and the rest could only try and fail,
      but we in Mrs. Parker's third-period English class
      here at Springfield High will succeed

      with the help of these study questions
      in saying what the poor poet could not,
      and we will get all this done before
      that orgy of egg salad and tuna fish known as lunch.

      -- from Billy Collins "The Effort"

      --
      ...no two people are not on fire.
  2. Missed the point by mgiuca · · Score: 5, Informative

    Interesting, but I think this article largely misses the point.

    Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

    Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

    Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

    What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

    1. Re:Missed the point by Anonymous Coward · · Score: 5, Informative
    2. Re:Missed the point by dbc · · Score: 5, Informative

      Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

      Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

      The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

      Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

  3. Maybe a better candidate by phantomfive · · Score: 5, Interesting
    C. A. R. Hoare, the inventor of Quicksort, also invented the NULL pointer. Something he apologized for:

    I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

    --
    "First they came for the slanderers and i said nothing."
  4. PHK wide of the mark by epine · · Score: 5, Insightful

    Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

    C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.

    char string [] = { 't', 'e', 's', 't', '\0' };
    char* cdr_string = string + 1;

    Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

    Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

    No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

    If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

    C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

    Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

    PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

    For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

    And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

    The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

  5. Re:The Road Not Taken by other Slashdot FPers by o'reor · · Score: 5, Funny

    I for one welcome that refreshing new way of writing "Frost's pissed."

    --
    In Soviet Russia, our new overlords are belong to all your base.