The Most Expensive One-Byte Mistake
An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
- Robert Frost, 1920
Help stamp out iliturcy.
Interesting, but I think this article largely misses the point.
Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.
Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)
Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.
What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".
When you look at std::string it uses both, and is better for it; many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.
That's the way it happens in Soviet Russia, too.
Seriously, though, it's hard to know what language you as a system administrator should use for something like a data logger that has to run continuously (or cron every minute or so) other than C, but then there's the security problem that some user will come up with some weird filename hack to subvert the system.
I'm not a lawyer, but I play one on the Internet. Blog
Doesn't the magic marker method give you string lengths limited only by available memory and not by the size of the piece of memory devoted to length?
I wouldn't call this a mistake. The paradigm of programming in C is largely based on nuances like this. It makes you write code in a certain way that, in my opinion, is better suited for certain situations. The alternative mentioned in the summary would have made it a bit closer to OO programming as far as strings go, which one can argue would have been better, but I prefer to have differences like this in lower-level languages.
I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.
"First they came for the slanderers and i said nothing."
They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?
Slashdot is lost.
Help stamp out iliturcy.
Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?
I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C. :-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.
Come on , this is complete rubbish___8^)_#;3,2,.3root>^$)(^(943hellomax0984)_))1..l2l2_}[[}{
this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day
Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings. Perhaps this is actually an example of premature optimization, optimizing things before proper profiling and analysis has shown the problem exists and the proposed solution is beneficial.
If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.
On all modern computers, the page size is a power of 2, cleanly divisible by, 32, 64, 128, 256, etc Modern computers have a terrible penalty (sometimes including SIGBUS) for memory accesses which aren't aligned on the native word size. Throw those two facts together and you can't accidentally read past the vm page.
Do you even lift?
These aren't the 'roids you're looking for.
Unchecked boundary conditions, in the case of the Therac-25 an overflow of a one-byte counter, are a fatal flaw in poorly written software. In older 8-bit apps, this could wind up with random unexplained crashes. Well, in this case it caused people to be exposed to high-doses of radiation over large areas of their bodies and cost people lives. (and when I learned about this, it was when I decided I was much happier working on web/e-commerce stuff than working on embedded systems programming)
hmm. marker character, or a length.
Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)
Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?
IMO, C and the null terminated string -saved- more than it cost. It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?
When you mark cost, know what you pay. Not all costs are negative.
The real problem with the addr+len approach is that now every string becomes a struct, or a structptr.
This means that when passing a string to a function, either the string takes up two register/stack slots, or you're passing around a const-ptr (but the contents of the struct are not const), which means one more memory access due to pointer indirection.
x86 and the PDP-11 are register-starved. the x86 has 8 registers, with 4 or 5 available as general-purpose registers.
The PDP-11 was similar with 8 registers total as well.
FTA:
We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.
In other words, Ken, Dennis, and Brian did the right thing.
Jesus was all right but his disciples were thick and ordinary. -John Lennon
Wow, this place has come a long way from a simple news for nerds site. Now, the authors are placing disclaimers specifically addressed to us :)
There's a 68.71% chance you're right.
From the article:
> Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed [...]
This is the kind of factual, objective and unbiased content that gives credibility to an article.
lucm, indeed.
It probably wasn't about the bytes. The factors are:
1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.
2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.
3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.
For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.
C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.
char string [] = { 't', 'e', 's', 't', '\0' };
char* cdr_string = string + 1;
Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)
Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.
No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.
If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"
C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.
Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?
PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.
For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.
And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).
The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.
Man, its a sad day on Slashdot that PHK says something and noone says that BSD is dead! You wingnuts are losing your edge.
Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class.
I spent my formative programming years primarily writing code in assembly and I resent that statement. :-) Runtime is always in one's mind and optimizing for speed is the desired goal. Optimizing for size is something that is merely forced upon us by circumstances beyond our control. No true assembly programmer, nor any true Scotsman, would prioritize size over speed if avoidable.
The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.
Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.
This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.
Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.
The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.
That's how we got into this mess.
As I point out occasionally, the right answer would have been array syntax like
int read(int fd, char[n]& buf, size_t n);
That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,
int read(int fd, char* buf, size_t n);
is a lie - you're not passing a pointer by value, you're passing an array by reference.
C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.
PHK's articles are worth reading... always.
Second: there is a /. Sensation Prevention Section where he explains that NUL-terminated strings was the correct choice at the time, it just caused some unforeseen consequences.
Die and rot in nullhell! The verbosity and work-arounds they force...
Table-ized A.I.
David Gay, scarred by Pascal "strings"
PS: I've often wondered the same about that other decried C feature, the preprocessor.
After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.
Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.
it isn't forced on to you - you don't have to follow it.
It would have been more urgent to find out where an allocated part of RAM ends.
Or just like Integers and floats, strings could have been their very own basic type. Essentially leave the implementation of it to the compiler, so it can do range checks. Most C-programmers seem to believe that this is done already.
BTW range check on integers don't cost anything anymore. I've benchmarked some real-life code using large arrays (doing statistics on it) and range checks didn't cause any slow down. Essentially the compare operation can be done in parallel with the memory read operation.... that is when your language supports that at all.
TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.
If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.
It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.
BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
If this was not be worst idea to employ backslash in DOS for path separator, than I do not know what else...
you fail at slashdot AC.
1) Slashdot is about the discussion of the news for nerds, not just having everything
2) posting a snide remark as AC means you won't get discussion
double fail at slashdot, enjoy.
Including you.
Help stamp out iliturcy.
I'm a bit surprised the PHK didn't mention Y2k as an example of a design choice that made sense at the time but required very expensive mitigation.
The problem isn't so much '\0' vs counted strings, it's in-band signalling in general. The telcos found this out in the 1970s with 2600 Hz whistles (and, eventually, fixed it), while the general computing world continues to use it, and in fact is busy inventing new and more complex ways of doing it all the time. String overruns, SQL injection, XSS, and many others are all examples of exploiting in-band signalling. The worst offender of the lot must be XML, which so thoroughly confuses what's data and what's control information that we'll still be trying to sort out the mess for decades to come. If you could remove in-band signalling, you'd also suddenly deal with a significant chunk of the OWASP perpetual top ten.
His reply: "I'd spell creat with an e"
one byte - but a world of errors
Been there, done that, paid for the T-shirt
and didn't get it
If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".
Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads.
As it happens, I researched this one for years, and I think the root cause of that one is Gates being an aggressive businessman who considered business as war, which helped it win the IBM PC deal, but also led to many of MS's evils over the 1990s.
A general convention (used primarily in the libraries) was to consider a "string" being a set of characters ended by a zero-byte.
By convention in C, the last character in a character array should be a `\0' because most programs that manipulate character arrays expect it.
-- Brian W. Kernighan
Slashdot, fix the reply notifications... You won't get away with it...
I can easily imagine a situation where on an MCU with 64 bytes of ram, the additional counter you need to maintain not NUL terminated strings is an issue (e.g. when sending out data in an interrupt routine...).
I for one welcome that refreshing new way of writing "Frost's pissed."
In Soviet Russia, our new overlords are belong to all your base.
...the more the complaints about bad language features.
The real problem is that many people suck at coding, and worse, many code without a proper and thorough understanding of what they do.
There are no bad tools. There are misused ones.
Comment removed based on user account deletion
He's underestimated the space cost. It's not just lose the NUL byte and gain an integer length, just encoding the current length in the type doesn't protect you from buffer overflows. For that you also need a maxstrlen integer.
So the size would be a length int, a maxlen int and the bytes of the string. But you're lucky, you see malloc holds a integer too for the length of the memory block it will free back into the pool, this can be overloaded with the maxstrlen field. So you're back to two bytes of overhead for a malloc'd string.
Constant strings are another matter, they don't have the malloc header so something else would have to be done. Probably the easiest would be to set the malloc length to zero, it's a constant after all so doesn't need to be freed or overwritten. That does mean 4.5 bytes overhead (including alignment) though.
The problem comes with buffers on the stack. These have a fixed chunk of memory allocated and so can't be expanded. But the malloc header we have for the strings explicitly can be expanded. There are two solutions; we could spend two more bytes and add a pointer to the base string structure then malloc the bytes for the string itself independently. A malloc'd string would be malloc's in two pieces doubling the malloc overhead and we'd have to do something about freeing this malloc'd space when the buffer goes out of scope ... I'm stuck, this isn't C anymore.
So the second choice goes right back to the start, the malloc size (maxstrlen) is now a hard limit; no library routine can expand it after it's been created. A string on the stack has a fake malloc header and the current string length (overheader 4.5 bytes). Of course there's the problem as to what to do if the string is too big; there are no exceptions, just truncation. (more, different, bugs)
So that's it the additional overhead is three bytes, not one, plus the alignment overhead. Two of the bytes (and the alignment) disappear with malloc'd strings but the added complexity stays.
And you don't even get dynamic strings.
But you reply to him knowing he won't read you? ( Or, if I let my paranoia run, you're AC and you reply so you have a link to your prior post and can check answers... And so you'ld be trolling.)
(\__/) This is Lapinator
(='.'=) copy it in your sig
(")_(") so it can take over the world
What are today's costly mistakes?
This is all just my personal opinion.
PHK actually hints at 2 things: that strings should have been length+array, and that the compiler should know about it.
The first assertion is subject to discussion and has its serious issues (strings would have become foreign to other C arrays, what integer size etc.).
The second point is I think more clear-cut. As it is, the C compiler knows mostly nothing about strings, which means that it's easy to design a different strings library and use it instead of strcpy() et al (cf. c++ strings). The only constraint is that you have to present a zero terminated string to system and foreign libraries interfaces.
Embedding the string structure in the compiler would have ossified the choice, making C a much less adaptable language, in contrast with its other features, and a fault of style.
The article is complete BS. Anyone who has programmed strings in pascal knows what a PITA
it is programming strings with a 255 character limit. I despised this while working in that hell of
a language.
I don't believe the the decision was based on memory concerns.
C was originally designed to be a portable assembler.
Most micro processors clear the carry flag and set the zero flag if a 0x00 byte is loaded into a register (or a 0x0000 word etc.) or moved.
That means loops like this:
are basically a 1:1 transformation from assembly into C.
Also keep in mind: having every string starting with a length indicator would make typical unix file handling and piping between utilities a little bit more complicated. ... should every line now start with 2 bytes length indicator? In which endian format? Or should they stay plain text but while reading lines the "readline()" call is counting .... (to which line terminator ?? \n??) and updating the size?
Text files
Bottom line using "0x00" as string terminator was pretty elegant. After all it allowed performant algorithms on strings and kept the library ore simple. That reminds me: how many "structs" are defined in the old standard C library?
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
There is nothing wrong with null terminated str^&%^&GShgayuat65a6 7gxhsvxhshxsgyuy6d5656565^&%&^&ZCVZCVZCVBAVCAF FAGAAAYSTWafgsgfsgd6565^%^^
http://michaelsmith.id.au
In the section on performance costs, the article states that a multi-byte string copy risks crossing a VM page boundary - potentially causing a "page not present" fault - if the NUL character was the last byte of the page.
This is simply incorrect.
No memory transfer within an aligned multi-byte string copy could ever straddle a page boundary, even if the NUL character was the last byte of the page. And performance considerations should preclude a non-aligned copy, even assuming the hardware could handle non-aligned memory accesses.
'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?
Why?
Its a false dilemma. You need arbitrary length null terminated strings for streams, and if anyone gave a damn in the last 30 years someone would have grafted an address/length extension on top of C's current stuff.
The mistake was simple binary thinking, both at design time and in the current article.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Please ignore the null terminated string past to printf(), but the point is clear, you can define a character array that contains a length followed with character values that does not have a terminating null added by the compiler. It can also contain embedded null characters if you like.
#include <stdio.h >
char mystring[8] = "\x07One\0Two";
int main(int c, char *v[])
{
printf("Mystring is %i bytes, and %i characters long\n",
sizeof(mystring),mystring[0]);
return 0;
}
I think my point stands. C does not force you to use null terminated strings (but I do agree that it is easy to).
They could have offered both solutions: a high level slow api that used strructs with lengths and a low level faster api with null terminated strings. The high level api would be used for string manipulation, and the low level api for hacking strings. The string array in the struct could contain the nul terminated character.
For anyone who has done assembly-language programming, C looks less like a "language" a more of a assembler with a whole bunch of pre-defined macros.
Just like MS-DOS isn't an operating system but merely a collection of subroutines.
"...their PDP computer had limited core memory."
Unlike modern computers, which have unlimited memory.
The "limited memory" apologia doesn't fly any better than the choice of two-digit year fields that led to the Y2K problem. At the time, DEC saw the advantage of string descriptors and made them available on exactly the same PDP11 computers (sometimes as an option, as suggested by other comments in this thread).
I am sure that there are many other solipsists out there.
C strings are an instrument for the devil, just saying.
Did Ken, Dennis, and Brian choose wrong with NUL-terminated text strings?
The author of TFA does not like that choice for the C system programming language.
But tries to demonstrate that it is an objectively "wrong" choice with weak and plain false arguments.
The hardware development cost argument is weak. The fastest CPUs around have a very rich instruction set. It maybe hacky and ugly but where is the evidence of a noticeable burden on CPU performance or cost due to the additional instructions to handle 0 in the input? And they are pretty handy instructions anyway.
The compiler argument is uninformed about how compilers work, and are permitted to work by the C standard. There is simply nothing there.
The gets(3) argument does not have anything to do with the NUL-terminated strings.
It has to do with the fact that gets is for most uses a broken API that should seldom if ever be used.
The FreeBSD libc bcopy/memcpy argument is plain false.
If the program is correct C, there is no "unwarranted page not present" fault.
One can do mistakes by mixing memcpy and C Strings if one is not careful, but it's exactly the same with any attempt to read past the end of an array.
If the confused author is trying to copy C Strings around, he can use the C String functions.
TFA speaks more about the ignorance of the author than the value of the NUL-terminated string choice for the C programming langugage, which can still be debated, but not on these grounds.
I see now that perpenso pointed out the same solution a few replies above this one. Disregard.
It might be of note that JOE (an old standby editor shipped with Linux and BSD distributions) used both systems at once. When JOE allocated a new string, it allocated room for the string plus an additional int prepending it. It would then return a pointer to the data past the int, the beginning of the character data. You could use that pointer just like any other char * with usual C functions, but if you wanted to find the length of the string in constant time (as opposed to linear time, which strlen operates in), you just needed to do ((int *)p)[-1]. It worked very well, though obviously it limits what sort of pointer arithmetic you can do with strings since you wouldn't want to ever end up with a pointer to the middle of a string and assume the length is still preceding it.
--we have to make choices between options even when we have been unable to tell the difference between them.
--even though we can't tell the difference and so can't anticipate the consequences in advance where the choice takes us DOES, often, make all the difference.
--We often rationalize the choice and convince ourselves that there IS a basis for the choice aside from an effective flip of the coin.
--As to the sigh -- one wonders - What might have been?
====
Robert Frost does not apply here. The immediate difference was the saving of one Byte - or not. That difference was real - not imaginary. And, it mattered to them.
Once can never anticipate all of the consequences of any choice. The time it takes me to stop and say hello will affect who I run into around the corner. A future spouse - or a bullet.
I find his hypothesis a little weak:
I spent the 80's doing assembly language programming and I used both NUL terminated strings and length based strings. It just depended on the situation. Sure I could do a quick test of the accumulator using BEQ which looks at the zero flag and if set exit the loop, or I could load a counter register with the length and do a decrement during each iteration and BEQ test on the counter instead. Pardon my foggy memory since I haven't used assembly language exclusively in twenty eight years (1983!).
This is a very speculative paper that is trying to place the blame for poor programming practices on a programming language that gives the programmer plenty of rope to hang themselves with.
These comments are my own and do not necessarily reflect the views or opinions of my employer or colleagues...
The problem with C strings is the same problem everyone has with C and assembler. It requires you to be absolutely competent. If you're not it does nothing to catch your mistakes. Blaming the current problems other people's "poor choices" is just rubbish.
The vulnerabilities to specially crafted attacks aren't mistakes. They were design choices that were correct given the knowledge the designers had at the time. Times have changed and nobody wants to pay to redo the code. I can just as easily craft a stack overflow using length type strings.
The author is short sighted or is deliberately making up something controversial to gain attention. In either case slashdot will you please ignore flag him?
- I've got bad karma because I won't parrot everyone else's opinion
Poul-Henning Kamp, is a non computer engineer.
'C' was primarily written to write the UNIX operating system. The fact that is was great for so many other things was a plus. If your writing an OS you want the control of a near assembly language like 'C'. For the rest us, in the 21 century, there are great scripting languages. Where strings are managed for us.
JAPH.
Brilliant, five internets for you!
To have a right to do a thing is not at all the same as to be right in doing it
Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer.
A pretty significant correction to your post. The article says "Using an address + length format would cost one more byte of overhead than an address + magic_marker format". If it costs one byte more, and the magic-marker is no longer used, then that means there are TWO bytes available for the length, which would allow strings of 65636 characters.
Then one could reserve the length value of 0xffff to indicate a 32 bit length value, allowing strings of 2^32 length.
If it was a single-byte length then it would require exactly the same storage requirements as the NULL terminated method. So the design choice was A) limiting strings to 255 bytes, B) using a NULL terminated string, or C) using an extra byte for 65636 character strings. The article says it was a choice between B and C, and they chose B. Option A which is what you refute, wasn't even an option at all, which is why it wasn't discussed.
Better known as 318230.
It's nothing more than convention to utilize a nul terminated character stream as a string, but that's not actually part of the language.
Further, there's nothing prohibiting a person from implementing their own, higher-level string notion that *does* utilize the address-length paradigm for representing strings. Meanwhile, if they had direct support for that in the language, they would have either had to drop their current notion of character pointers completely, or else added what is fundamentally an entirely new type to the language. The former solution would have been undesirable because it sacrifices generality. Further, neither solution would really fit the C paradigm of the data types corresponding to the most widely used native machine language types.
File under 'M' for 'Manic ranting'
The best solution for strings I have seen is the implementation of "Long String" for Delphi. It has the best of both worlds. It has reference counting with copy on write, it has a 32-bit size length and is also null terminated. Since it also has a null terminator it is very easy to communicate with C code and the Windows API all you do is pass the pointer directly.
You know, kids, there was such a thing as a computer, and programming languages, prior to the advent of C and UNIX.
The NUL-terminated string convention was a very common one. All of DEC's operating systems used it, including on the PDP-10.
It was considered to be a big improvement over FORTRAN Hollerith strings, where the programmer had to count the characters in the string, e.g.,
7FOO BAR
was the FORTRAN way of saying
"FOO BAR"
This article is way off base.
C was primarily a portable assembler. Complaining about string handling in C makes almost as much sense as complaining about string handling in assembler. C provided a convenient way to port system-level code (e.g., compilers, operating systems, etc.). As such, it was not intended that C would be used by today's legions of hacks. Instead of blaming K&R, why doesn't the author blame the universities for teaching C as if it were a high-level language.
If there is blame to be laid for the null terminated string, it's not with the language but with the string handling libraries. If the string handling libraries had defined a string as a struct that carried a length, then few people would have used the null-termianted construct.
I'll contribute a gold star as well.
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
-1 :: Off-topic Poetry Nazi
Join the Slashcott! Feb 10 thru Feb 17!
He didn't pick the road less traveled. As you note, he earlier said both appeared to be equally worn. Picking the road less traveled is a rationalization for the decision made years later, looking back on the event.
It is funny, the most famous line is basically a lie.
So with the literalist view you may now take that road that you think is less traveled in order to be nonconformist, but in your later years with the ironic view you'll look back on that decision and maybe rationalize your choice in an entirely different way.
A C string can be as long as you want and never requires more than length+1 bytes of storage. A length+data scheme would need unlimited lengths to be as flexible with preferably a 1-byte overhead for short strings at least. I guess you could do something UTF-8-ish and add extra length bytes as needed for longer strings, but then you'd need a formula to figure out how much storage the string would require. Sounds a little messy to me.
This is already +5 Hilarious, but I wish I had mod points to give it anyway. LMAO
Bab72 (Not my real name)
I actually think that computer science would benefit from more sage retrospectives on the path not taken, where one does not necessarily end the analysis with the smoking gun.
Buffer allocations in the C language family tend to be static. No matter what you do in the privacy of your own buffer, the boundaries (front and back) are firm: whether poaching from your neighbour's apple tree, making a spectacle of indecent display, or committing an access fault triggering a core dump of yellow police tape and chalk outlines.
Traditionally it takes more creativity to get arrested in the front yard. In the back yard, most programmers have no standard of conduct whatsoever.
...);
safecat (back_fence, the_usual_suspects,
wildcat (just_the_ammunition);
It's true, you might know that the unfenced wildcat() is OK due to some prudent arithmetic three loops up, as etched into stone tablets by a guru of right thinking (at the optimal dosage point between second and third coffee) embodied in an immutable marble monument of nil maintenance.
Usually in an API the front fence is defacto whatever src position is supplied as the current working position; where the algorithm swings both ways, both ought to be passed explicitly, in addition to the working position. Helper functions in a tightly-crafted C runtime library might sanely presume that this condition holds. A chosen few among us are well suited to stone work where efficiency matters.
The major fault with the C language was failing to provide an "Ordinary People" set of buffer management routines (anything that clobbers memory) where the back fence comes first in every function signature.
For variable-length strings that are not NULL-terminated in the string itself, do they use NULL-termination for the byte-length? If not, does this also introduce buffer overflow vulnerabilities? Since this is something that would be set and modified internally, I don't think it would be subject to the same vulnerabilities unless you could trick the code into thinking a string was a different length when it tries to set its length.
"Using an address + length format would cost one more byte of overhead than an address + magic_marker format" Why? If the string length = magic_marker length, they would be the same, no? Depending on the charset, it could be the same or even better to use the address + length format...
Have a length field that's unlimited expandable: 1 byte for lengths 0 - 254, if > 255 the first length byte contains 255 as a flag and the length is in the following 2 bytes, if length > 65024 these bytes contains 65025 as a flag and the length is in the following 4 bytes - etc in aeternum. The length field is here always minimal compared to the length of data. (For lengths 255 it takes no more space than a null terminator solution.)
Mundus Vult Decipi
With sized strings you need to know the length of the complete string before you begin streaming. So you'd stream the size first, followed by the content of the string. Not good if your string could be very long and memory is expensive.
But with null terminated strings you can keep on appending almost ad-infinitum, using whatever business logic you like, until you finally end it with a null.
The summary misrepresents the value of using the null terminator.
With a one-byte length, strings are limited to 255 characters. Is that good? Would you never want to have a longer string? If you want to have a 2-byte counter, do you now have to create a whole new long-string data type and overload every library function? It doesn't seem scalable.
On the other hand, the null terminator is a single byte no matter what and can support strings of arbitrary length.
Of course, there are disadvantages. For instance, computing length and concatenating strings take longer.
But don't act like using a byte-sized length field is fundamentally superior.
The problem is not with the C language. NULL terminated strings are just fine for printing status messages and suchlike, which is all they were intended for. The problem is using C to write text-bashing programs. In C, you have to spend a lot of time and effort checking string lengths, allocating and deallocating buffers, worrying about character sets and funny characters ("magic cookies", anyone?), dealing with byte order, and all sorts of other cruft that should be handled by the compiler.
IMHO, the first really useful language that was designed for text bashing was PERL, or perhaps Microsoft BASIC (I've used SED and AWK. Bleagh. I've not used SNOBOL so I can't say anything about it.)
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
Where oh where are my mod points. I agree that Robert Frost was mocking hipsters, and you've summed it up quite well.
http://poetrypages.lemon8.nl/life/roadnottaken/roadnottaken.htm Robert Frost on his own poetry: "One stanza of 'The Road Not Taken' was written while I was sitting on a sofa in the middle of England: Was found three or four years later, and I couldn't bear not to finish it. I wasn't thinking about myself there, but about a friend who had gone off to war, a person who, whichever road he went, would be sorry he didn't go the other. He was hard on himself that way."
One performance advantage of NULL-terminated strings is you can trivially maintain two independent representations of the same string, one of which has a static prefix.
char *str2 = str1 + prefix_length;
In Java, strings are represented as an object that has a char[], a start index in the array, and a length. This representation gives you all of the advantages of the string-with-length-baked-in design and the representation sharing that you describe.
Any non-reallocating modification of one string instantly affects the other.
Which is a form of aliasing, one of the major sources of software bugs. Don't do that!
Are you adequate?
The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.
Ahem.
The ironic interpretation, widely held by critics,[2][3] is that the poem is instead about making personal choices and rationalizing our decisions, whether with pride or with regret.
Source: http://en.wikipedia.org/wiki/The_Road_Not_Taken_(poem)
I'm tempted to bookmark this response as a great example of why engineers should not fear breadth requirements. (I'm assuming anyone with such a low Slashdot ID works in engineering...)
The ironic interpretation is widely held because it's supported not only by the text, but also Frost's own statements, and the broader context of his work--in which seemingly simple descriptive verse hides darker, more complex themes. (A major reason why he is held in such high regard.) This particular poem is a common subject for lessons on critical analysis of literature. The key starting point is that first-person narrators are not necessarily reliable.
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
Don't you remember Y2K?
Really, what a dumbass article--surprising considering the source... Pascal had the 1-byte length at the beginning, and the 255-byte limit caused ***FAR*** more problems than the supposed issues with null termination. Hell, the old Mac & Win APIs supported formatted string-like things with 2-byte length, and that limit on string length caused plenty of issues.
So where is it cast in stone that an author cannot
write a lib that uses a count + pointer for strings.
There is no reason that the string managing data engine in an application
cannot do it right (better/ differently) and then hand known safe strings to
those functions not yet rewritten.
It would be a bit of work but hey if it is important ....
It is not necessary to start with exec() and args. .txt to .ett ( Enhanced TexT) or some such
It is not necessary to attack text files but like
end of line converters it would be a modest task
to convert
thing....
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
On a related note, PHK points out that Ben Franklin totally screwed the pooch by defining current flow from plus to minus.
1. Using the NULL character allows for strings more than 255 bytes long;
2. Using the NULL character makes it quicker to append strings (strcat) - no need to update the length byte(s);
3. Using the NULL character saves more than 1 byte when you change architectures, and you don't have to worry about byte-padding when calculating storage;
4. You don't have different types of strings with different maximum lengths (255 bytes, 64k bytes, etc.) and code to deal with interfacing between the types.
IOW, the article's claim is wrong.