The Most Expensive One-Byte Mistake
An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
- Robert Frost, 1920
Help stamp out iliturcy.
Interesting, but I think this article largely misses the point.
Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.
Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)
Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.
What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".
I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.
"First they came for the slanderers and i said nothing."
Slashdot is lost.
Help stamp out iliturcy.
Come on , this is complete rubbish___8^)_#;3,2,.3root>^$)(^(943hellomax0984)_))1..l2l2_}[[}{
hmm. marker character, or a length.
Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)
Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?
IMO, C and the null terminated string -saved- more than it cost. It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?
When you mark cost, know what you pay. Not all costs are negative.
FTA:
We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.
In other words, Ken, Dennis, and Brian did the right thing.
Jesus was all right but his disciples were thick and ordinary. -John Lennon
It probably wasn't about the bytes. The factors are:
1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.
2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.
3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.
For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?
Programmer for 20+ years here, BS and MS in CS. I used to share such opinions. Then I went to business school. I really enjoyed business school in part because I was constantly amused by how ignorant and wrong I had been regarding such opinions. May I be bold enough to suggest that the portrayal of MBAs in popular and nerd cultures are about as accurate as the portrayal of programmers in popular and non-nerd cultures.
None of the above should be interpreted to mean that business school makes one appreciate Dilbert any less. Dilbert is actually pretty popular with MBA types and their professors as well.
Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.
C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.
char string [] = { 't', 'e', 's', 't', '\0' };
char* cdr_string = string + 1;
Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)
Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.
No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.
If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"
C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.
Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?
PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.
For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.
And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).
The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.
The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.
Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.
This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.
Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.
The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.
That's how we got into this mess.
As I point out occasionally, the right answer would have been array syntax like
int read(int fd, char[n]& buf, size_t n);
That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,
int read(int fd, char* buf, size_t n);
is a lie - you're not passing a pointer by value, you're passing an array by reference.
C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.
After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.
Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.
it isn't forced on to you - you don't have to follow it.
1. with NULL-terminated strings, there's no distinction (other than in the string.h and related library) between a char * and a other_type *. Inventing a "string" type in C (not C++) would have made the compiler more complex (see footnote **)
2. because char * is no different than other_type* , I can pass the address in the middle of the string char * for processing. Not so much for a std::string. How does it matter? Well, take parsing for example (the most trivial strtok) not only that one will need an extra string-len prefix, but you'll need to keep a separate "curr_pos".
If you have a NULL-terminated char* string, one can invent/use a std::string (or GString, or NSString, or Pascal-string). The reverse is not true: having the compiler accepting only Pascal-strings, it's not possible to start using the NULL-terminated convention.
many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.
While in other cases (when you pass a std::string by-value and invoke the copy constructor, which tends to happen a lot), you have a hefty performance penalty.
Footnote ** - Dennis M. Ritchie on the C history.
C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type.
Questions raise, answers kill. Raise questions to stay alive.
TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.
If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.
It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.
BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day
Actually the tradeoff may not have been rational.
Actually, the choice was rational (at least, on purpose) - you see, it's not about a single byte, it's about a new data type.
C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user.
Questions raise, answers kill. Raise questions to stay alive.
Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings.
Not really, no. Having written basic library code for both, it usually requires more code to handle Pascal-style (length+data) strings than C-style (data+null) strings. You save quite a bit of code ("quite a bit" being relative, but I've had to squeeze code into 208 bytes of RAM before) by using the C-style strings most of the time.
"Convictions are more dangerous enemies of truth than lies."
Aside from your apparent confusion between NULL-terminated (0x00) and NUL-terminated ('\0') I completely agree.
Which would seem to imply you have reason to believe the GP is incorrect in his interpretation of the poem.
Please enlighten us with your insights.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".
I for one welcome that refreshing new way of writing "Frost's pissed."
In Soviet Russia, our new overlords are belong to all your base.
But you reply to him knowing he won't read you? ( Or, if I let my paranoia run, you're AC and you reply so you have a link to your prior post and can check answers... And so you'ld be trolling.)
(\__/) This is Lapinator
(='.'=) copy it in your sig
(")_(") so it can take over the world
http://poetrypages.lemon8.nl/life/roadnottaken/roadnottaken.htm Robert Frost on his own poetry: "One stanza of 'The Road Not Taken' was written while I was sitting on a sofa in the middle of England: Was found three or four years later, and I couldn't bear not to finish it. I wasn't thinking about myself there, but about a friend who had gone off to war, a person who, whichever road he went, would be sorry he didn't go the other. He was hard on himself that way."
The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.
Ahem.
The ironic interpretation, widely held by critics,[2][3] is that the poem is instead about making personal choices and rationalizing our decisions, whether with pride or with regret.
Source: http://en.wikipedia.org/wiki/The_Road_Not_Taken_(poem)
I'm tempted to bookmark this response as a great example of why engineers should not fear breadth requirements. (I'm assuming anyone with such a low Slashdot ID works in engineering...)
The ironic interpretation is widely held because it's supported not only by the text, but also Frost's own statements, and the broader context of his work--in which seemingly simple descriptive verse hides darker, more complex themes. (A major reason why he is held in such high regard.) This particular poem is a common subject for lessons on critical analysis of literature. The key starting point is that first-person narrators are not necessarily reliable.
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.