I agree it sounds a bit like grasping to start bringing up environmental concerns here. But really, why not bring up the environment? Even if there is a small environmental impact, why should there be any impact at all? It all comes down to the fundamental principle of "if you don't need to do something, don't do it," or even more basely, the principle of simplicity. "Always-on" DRM is an unnecessary waste of power, bandwidth and server load. (Especially since I'm pretty sure this isn't just a heartbeat ping, the Battle.net server will actually be running the game state.)
Casual LAN copyright infringement is still copyright infringement. I've done it, too, but I'm not going to try and claim it as some kind of right.
I'm anti-piracy. I buy games. I don't receive pirated games and I don't give out pirated games. But in a LAN situation, is it really "piracy" if we all play off one copy? 1990-2000 says it isn't. In fact, Warcraft II (1995) and StarCraft (1998) say "go ahead, play a LAN with one copy, but if you want to go off and play on your own, you have to buy it." There is an explicit feature for this, called "spawn", in which Blizzard encourages players to share around a limited version that can only join multiplayer games, for this very purpose. If I go to a LAN and eight people are going to play a game one time, am I really expected to purchase a full copy of that game just for a couple of hours (given that my friend already purchased a copy?)
To put it another way, if eight of us go to a friend's house to watch a movie, that my friend legally paid for, are we expected to each purchase the DVD just so we can all sit down and watch the movie one time? "Of course not!" any sane person would say. It is a basic right that when I purchase a DVD, I can watch it with my friends or family without them each having to purchase a separate copy. So why is the situation different when it comes to PC games? Moreover, why are we not outraged that this "right" has been taken away from us, and instead we have people arguing on forums that LAN piracy is bad?
The main argument against LAN piracy is "sure, it's OK for everyone to play at a LAN, but the problem is that once everyone has pirated the game, they won't delete it -- they'll go home and play it on their own. If they want to do that, they should pay for it." I absolutely agree. That's why "spawn" was such a genius invention -- if the option is available, I don't see why honest players would choose piracy over a legitimate "spawn" (certainly much nicer than having to download cracks and viruses). I think "spawn" is the ideal compromise between game companies and customers. But, a compromise is not what they're after, which is why Blizzard is glad everybody has forgotten all about spawn and public opinion has now come so far along that people argue "of course they had to ban LAN, because there was too much piracy!"
I think we are that stupid. I don't understand it, but most opinion I read online seems to be "Hiss! So much for 'don't be evil', Google are creepy sinister overlords. Good thing those hip folks from Apple are giving them a good kicking."
Something is really fucked up, because every story I read re-affirms my gut feeling that Google is still basically doing a lot of good for the world, in particular for the tech industry (sure, they are so huge now that they slip up every once in awhile), while Apple (who I see as Google's biggest competitor) is expanding without even a promise of being "good", taking away customer control, enforcing lock-in and suing everybody they can, yet somehow, they seem to get all the press. At the end of the day, we must be just as stupid as birds: "SHINEY!!! WANT."
It's one thing to use the money you make on one product to fund the development of another. It's quite different to abuse a monopoly position of one product to get into the market of another. (Specifically, the former is legal, the latter is not.)
Your Android vs IE analogy is therefore flawed.
If Microsoft made a shitload of money selling Windows, and then decided to use that money to build a web browser and make it available as a download on their website for anybody to install if they wanted to, that would be fine. Just as Google made a shitload of money selling search (ads), and then decided to use the money to build an operating system and made it available as a download on their site.
However, what Microsoft did was they built a web browser, and integrated it into the operating system. They bundled it into the OS so that everybody used it by default, and they integrated it so it couldn't easily be removed. That was anti-competitive, because they used the massive monopoly of Windows to gain a monopoly in the browser market. The equivalent would have been if Google had somehow installed Android onto the phone of anybody who used Google Search. Ridiculous... but the point is that Google in no way used their search market share to establish Android.
The fact that they used the money they made from search to build Android is perfectly valid. Companies very often make a loss on some products and subsidise it with profits from other products.
Surely Commander Keen is old enough that they don't need to protect the assets any more. Just give the entire game away for free.
Messy source code? Love it. Doesn't compile? Doesn't matter. Don't touch a thing. Just release everything onto the Internet. It will be like an archaeological find. I would love to have the original Keen code, even if it doesn't compile, just to see what it's like. Someone will get it working and ported to Linux within 24 hours.
It's beside the point, but the "security bug in Blowfish" is nothing of the kind. It is actually a security bug in a specific implementation of Blowfish, namely crypt_blowfish
Okay, when I said "Blowfish" I should have said "crypt_blowfish". Aside from that, did I say anything incorrect?
I originally said there could (for example) be problems with a length field because some people would use a signed value and other people would use an unsigned value. The GP said he didn't think people would fail to test values greater than 128. I was pointing out that it happens all the time that people don't test edge cases, and crypt_blowfish is a perfect example.
OK, firstly, I design programming languages and I agree with you in principle. My programming language allows NUL as a character in strings, and so should all modern languages. It is a valid character, as you say. It has a Unicode code point, U+0000.
However, in the context of C, there are historical reasons and technical reasons (simplicity is a priority) why I defended the NUL. Note that I didn't say it was the correct decision, just that it isn't exactly clear-cut that it was the wrong decision. When I say "text should not contain a 0 byte" I don't mean "programming languages should not accomodate text that does contain a 0 byte"; I mean it is never necessary for a non-binary text string to contain a 0 byte, so it is acceptable, I would argue, for any program to drop such characters. Therefore, it doesn't matter too much in practice that C doesn't allow this character in its strings.
Would you defend a file system that did not accurately store the length of files, that instead used an end of file marker? Before you respond too quickly, note that history is littered with those file systems.
Absolutely not, because that is a file system and files are binary things. They are a sequence of octets, with no octet value being any more special than any other. Those file systems which litter history should be dead. That is different, however, to a text string data type that doesn't really need to store those characters. I didn't say it was ideal, but it's acceptable. (Clearly it's acceptable, because we still use C today.)
Not quite true. You can't have binary strings -- true. So I think of text strings and binary "strings" being two completely different things (as they should be -- any modern language like Python 3 or Java does distinguish between them). A text string is what a char* is for, in C, and the string.h library. A binary "string", you should not use a NUL-terminated char*, you should keep the length yourself and use binary manipulation libraries like memcpy.
However, what you say about ASCII isn't true. Assuming you aren't going to be using the code point 0, many character encodings work fine with NUL-terminated strings: ASCII, Latin-1 and most importantly, UTF-8. This means that you can represent any Unicode string without a 0 byte (as long as your string doesn't include the NUL character). If you are using UTF-16, then you'll have a wchar_t* instead of a char*, and your terminator won't be the byte 0, it will be the wchar_t 0, so UTF-16 works fine as well.
They are simpler IFF you never use the null value. Do you have any files on your system which have NUL bytes in them? Hint: yes.
Yes -- this is a good reason not to use NUL-terminated strings (which, once again, TFA missed). Remember: I never said NUL terminated strings were good, just that the article missed the point by blaming NUL strings for a different, unrelated problem, and not actually picking up on any of the problems with NUL strings.
If you need a 0 byte in your strings, then this won't work. However, to be technically correct, strings should contain text, and text should not contain a 0-byte. What about binary strings? Those should absolutely not be stored as NUL-terminated. Remember, nothing in C forces you to use NUL-terminated strings -- it just means you should not use the string.h functions on binary strings. Instead, you MUST separately keep the length around, as you do for an array of ints. Think of a binary string as an "array of chars" and not a NUL-terminated string, and there *shouldn't* be any trouble. (Yet as I pointed out with the MS certificate bug, there can still be trouble.)
string concat (string a, string b, string c) {
string ret = strnew( strlen(a)+strlen(b)+strlen(c) );
strfill(ret, 0, a);
strfill(ret, strlen(a), b);
strfill(ret, strlen(a)+strlen(b), c);
return ret; }
What's so hard about that?
Nothing was hard about it. It's just that you had to invent two new library functions (strnew and strfill) which are much higher-level than other C library functions (with the possible exception of strdup, which combines allocation and copying). You are now saying to your C users (in the hypothetical "C with length-delimited strings" language) "you must never manually manipulate your own strings -- only ever use these library functions." That is antithetical to the way C works. C programmers want absolute control over the representation of everything. If you want a higher-level language, use a higher-level language.
Then that would be incompatible, yes (and as I said in the original post, there was historically such an issue, as B chose 4 as the string terminator!) But that is just one potentially-incompatible design decision, versus the four I listed in my post for length-delimited strings. Other issues for length-delimitation: do you put the length in a struct with the pointer, or in the buffer with the data? Do you make it variable-width or fixed-width? If it's variable-width, do you use 0 or 1 as an extension bit? Do you limit the length to a maximum of 32 or 64 bits, or allow arbitrarily long length fields? If you limit it, what is the limit? If you don't limit it, how do implementations cope when the length is too long to fit in their standard 'size_t' type?
Again, I'm sure all of the above questions have sensible answers, but my original point stands: it is *not* *straight* *forward* and undoubtedly there would be at least as much confusion and bugs with a length-delimited string as there would be with a NUL-terminated string.
Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).
You might ask why C doesn't include closures and list comprehensions: if you want high-level language features, then C isn't the language for you.
I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information.
Exactly. The real problem* is that C lets programmers fabricate data however they want.
*I say "problem" but it really is the whole point of C. It is a dangerous and powerful tool. To make it less dangerous would make it less powerful, and if you wanted such a language, there are plenty available.
Is it, or are you just used to dealing with NUL-terminated strings?
Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.
That's what libraries are for:-)
Well, let's assume a hypothetical universe in which C is still exactly the same C, only with length-delimited strings (still the same level of safety, still malloc and free, still pretty much the same library, only the string functions are implemented differently, etc). Could you write a library that abstracts over the string representation without ever requiring the user to manually read or write the string? I think if you did that (and certainly, C++ does that), you would have a much higher-level library. That isn't what C is good for. C is for when you need low-level access to the underlying representation.
The beauty of using C (and there aren't many) is that you can write your own efficient string manipulation code. For example, if you know you are going to concatenate three strings, you can allocate enough space for all three, then manually copy the bytes over and seal it with a NUL. In C++, you would probably have a stringstream and push each of the strings onto the end, but it would mean the library is internally adjusting lengths and so on -- the programmer can't make the code do exactly what he asks; there is a layer of abstraction. So you could change C's string representation and then provide a high-level API for manipulating it, but someone is going to get pissed off that the library doesn't do exactly what he wants, and dive down and do it himself. It would be very un-C-like to provide that API.
To put it another way, if you were going to provide a high-level string API for C and tell programmers "never ever manipulate strings on your own; use this library," then you might as well use NUL-terminated strings anyway, since the library will handle it, and programmers will never make a mistake. But again, that would be very un-C-like.
So once again, it comes down to this: NUL-terminated strings aren't the problem with C. C is the problem with C: the fact that it gives programmers a lot of power. You might argue that we should stop using C to write programs that don't need that speed or power. But there's no point arguing that C should have been a higher-level language, because then it wouldn't be C.
Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.
What is so undesirable about making a string larger than a pointer?
It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register). It makes things more complicated.
Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve.
No, it isn't too hard to solve. But it is non-trivial. Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion. Remember that in C, programmers fabricate their own strings (there is a minimal string library, but often you will see people just allocating memory for strings, populating them, and storing a '\0' on the end). If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong. The same applies to NUL.
Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 :.....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.
Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?
I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.
Where did you stop reading?
The only security issues mentioned were buffer overruns, with gets taking most of the blame. As I said above, only some NUL errors are buffer overruns and only some buffer overruns are NUL errors, and gets errors are not anything to do with NUL.
As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.
Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made a decision on all of those, how many implementations would have gotten it wrong?
Not just implementations, but individual programs. Assuming that in this hypothetical universe in which C doesn't use NUL terminated strings, but still assuming that C is a low-level unsafe language in general, how would this have been any different? Unlike C++ or Java, in C, programs manually construct strings. So we wouldn't have people forgetting to NUL-terminate strings. We would instead have people forgetting to set the length field, or setting the wrong length, or being given a 257-byte string and writing a "1" in the length field due to wraparound (granted, that wouldn't often be a security risk, just a bad result). If they had decided to use a variable-length length field, people would have found some way to screw that up. I'm sure hackers would have found a way to inject a long length into a short string and thus read past the end.
At the end of the day, the problem is that C lets programmers do whatever they want with memory, not the NUL terminator. And you can't really say "they should have designed it better," because it is rather the point of C that it lets you do this.
They could have but they didn't (e.g., in Pascal, where strings actually are limited to 255 bytes). So, history has made some worse string representations than C.
Interesting, but I think this article largely misses the point.
Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.
Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision;)
Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.
What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate.com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".
I find it ironic that Max Schaefer, who said those comments and develops Torchlight, was one of the two guys that originally created Diablo.
Interesting that Torchlight 2 will actually support multiplayer this time around!
I agree it sounds a bit like grasping to start bringing up environmental concerns here. But really, why not bring up the environment? Even if there is a small environmental impact, why should there be any impact at all? It all comes down to the fundamental principle of "if you don't need to do something, don't do it," or even more basely, the principle of simplicity. "Always-on" DRM is an unnecessary waste of power, bandwidth and server load. (Especially since I'm pretty sure this isn't just a heartbeat ping, the Battle.net server will actually be running the game state.)
well, there are other games to play for times like that.
I'm anti-piracy. I buy games. I don't receive pirated games and I don't give out pirated games. But in a LAN situation, is it really "piracy" if we all play off one copy? 1990-2000 says it isn't. In fact, Warcraft II (1995) and StarCraft (1998) say "go ahead, play a LAN with one copy, but if you want to go off and play on your own, you have to buy it." There is an explicit feature for this, called "spawn", in which Blizzard encourages players to share around a limited version that can only join multiplayer games, for this very purpose. If I go to a LAN and eight people are going to play a game one time, am I really expected to purchase a full copy of that game just for a couple of hours (given that my friend already purchased a copy?)
To put it another way, if eight of us go to a friend's house to watch a movie, that my friend legally paid for, are we expected to each purchase the DVD just so we can all sit down and watch the movie one time? "Of course not!" any sane person would say. It is a basic right that when I purchase a DVD, I can watch it with my friends or family without them each having to purchase a separate copy. So why is the situation different when it comes to PC games? Moreover, why are we not outraged that this "right" has been taken away from us, and instead we have people arguing on forums that LAN piracy is bad?
The main argument against LAN piracy is "sure, it's OK for everyone to play at a LAN, but the problem is that once everyone has pirated the game, they won't delete it -- they'll go home and play it on their own. If they want to do that, they should pay for it." I absolutely agree. That's why "spawn" was such a genius invention -- if the option is available, I don't see why honest players would choose piracy over a legitimate "spawn" (certainly much nicer than having to download cracks and viruses). I think "spawn" is the ideal compromise between game companies and customers. But, a compromise is not what they're after, which is why Blizzard is glad everybody has forgotten all about spawn and public opinion has now come so far along that people argue "of course they had to ban LAN, because there was too much piracy!"
You mean "hacking a server-side save system is non-trivial"?
Obligatory SMBC.
I think we are that stupid. I don't understand it, but most opinion I read online seems to be "Hiss! So much for 'don't be evil', Google are creepy sinister overlords. Good thing those hip folks from Apple are giving them a good kicking."
Something is really fucked up, because every story I read re-affirms my gut feeling that Google is still basically doing a lot of good for the world, in particular for the tech industry (sure, they are so huge now that they slip up every once in awhile), while Apple (who I see as Google's biggest competitor) is expanding without even a promise of being "good", taking away customer control, enforcing lock-in and suing everybody they can, yet somehow, they seem to get all the press. At the end of the day, we must be just as stupid as birds: "SHINEY!!! WANT."
It's one thing to use the money you make on one product to fund the development of another. It's quite different to abuse a monopoly position of one product to get into the market of another. (Specifically, the former is legal, the latter is not.)
Your Android vs IE analogy is therefore flawed.
If Microsoft made a shitload of money selling Windows, and then decided to use that money to build a web browser and make it available as a download on their website for anybody to install if they wanted to, that would be fine. Just as Google made a shitload of money selling search (ads), and then decided to use the money to build an operating system and made it available as a download on their site.
However, what Microsoft did was they built a web browser, and integrated it into the operating system. They bundled it into the OS so that everybody used it by default, and they integrated it so it couldn't easily be removed. That was anti-competitive, because they used the massive monopoly of Windows to gain a monopoly in the browser market. The equivalent would have been if Google had somehow installed Android onto the phone of anybody who used Google Search. Ridiculous... but the point is that Google in no way used their search market share to establish Android.
The fact that they used the money they made from search to build Android is perfectly valid. Companies very often make a loss on some products and subsidise it with profits from other products.
Surely Commander Keen is old enough that they don't need to protect the assets any more. Just give the entire game away for free.
Messy source code? Love it. Doesn't compile? Doesn't matter. Don't touch a thing. Just release everything onto the Internet. It will be like an archaeological find. I would love to have the original Keen code, even if it doesn't compile, just to see what it's like. Someone will get it working and ported to Linux within 24 hours.
Okay, when I said "Blowfish" I should have said "crypt_blowfish". Aside from that, did I say anything incorrect?
I originally said there could (for example) be problems with a length field because some people would use a signed value and other people would use an unsigned value. The GP said he didn't think people would fail to test values greater than 128. I was pointing out that it happens all the time that people don't test edge cases, and crypt_blowfish is a perfect example.
OK, firstly, I design programming languages and I agree with you in principle. My programming language allows NUL as a character in strings, and so should all modern languages. It is a valid character, as you say. It has a Unicode code point, U+0000.
However, in the context of C, there are historical reasons and technical reasons (simplicity is a priority) why I defended the NUL. Note that I didn't say it was the correct decision, just that it isn't exactly clear-cut that it was the wrong decision. When I say "text should not contain a 0 byte" I don't mean "programming languages should not accomodate text that does contain a 0 byte"; I mean it is never necessary for a non-binary text string to contain a 0 byte, so it is acceptable, I would argue, for any program to drop such characters. Therefore, it doesn't matter too much in practice that C doesn't allow this character in its strings.
Absolutely not, because that is a file system and files are binary things. They are a sequence of octets, with no octet value being any more special than any other. Those file systems which litter history should be dead. That is different, however, to a text string data type that doesn't really need to store those characters. I didn't say it was ideal, but it's acceptable. (Clearly it's acceptable, because we still use C today.)
Not quite true. You can't have binary strings -- true. So I think of text strings and binary "strings" being two completely different things (as they should be -- any modern language like Python 3 or Java does distinguish between them). A text string is what a char* is for, in C, and the string.h library. A binary "string", you should not use a NUL-terminated char*, you should keep the length yourself and use binary manipulation libraries like memcpy.
However, what you say about ASCII isn't true. Assuming you aren't going to be using the code point 0, many character encodings work fine with NUL-terminated strings: ASCII, Latin-1 and most importantly, UTF-8. This means that you can represent any Unicode string without a 0 byte (as long as your string doesn't include the NUL character). If you are using UTF-16, then you'll have a wchar_t* instead of a char*, and your terminator won't be the byte 0, it will be the wchar_t 0, so UTF-16 works fine as well.
Thanks. I'm not very good at short summaries.
Yes -- this is a good reason not to use NUL-terminated strings (which, once again, TFA missed). Remember: I never said NUL terminated strings were good, just that the article missed the point by blaming NUL strings for a different, unrelated problem, and not actually picking up on any of the problems with NUL strings.
If you need a 0 byte in your strings, then this won't work. However, to be technically correct, strings should contain text, and text should not contain a 0-byte. What about binary strings? Those should absolutely not be stored as NUL-terminated. Remember, nothing in C forces you to use NUL-terminated strings -- it just means you should not use the string.h functions on binary strings. Instead, you MUST separately keep the length around, as you do for an array of ints. Think of a binary string as an "array of chars" and not a NUL-terminated string, and there *shouldn't* be any trouble. (Yet as I pointed out with the MS certificate bug, there can still be trouble.)
Nothing was hard about it. It's just that you had to invent two new library functions (strnew and strfill) which are much higher-level than other C library functions (with the possible exception of strdup, which combines allocation and copying). You are now saying to your C users (in the hypothetical "C with length-delimited strings" language) "you must never manually manipulate your own strings -- only ever use these library functions." That is antithetical to the way C works. C programmers want absolute control over the representation of everything. If you want a higher-level language, use a higher-level language.
Then that would be incompatible, yes (and as I said in the original post, there was historically such an issue, as B chose 4 as the string terminator!) But that is just one potentially-incompatible design decision, versus the four I listed in my post for length-delimited strings. Other issues for length-delimitation: do you put the length in a struct with the pointer, or in the buffer with the data? Do you make it variable-width or fixed-width? If it's variable-width, do you use 0 or 1 as an extension bit? Do you limit the length to a maximum of 32 or 64 bits, or allow arbitrarily long length fields? If you limit it, what is the limit? If you don't limit it, how do implementations cope when the length is too long to fit in their standard 'size_t' type?
Again, I'm sure all of the above questions have sensible answers, but my original point stands: it is *not* *straight* *forward* and undoubtedly there would be at least as much confusion and bugs with a length-delimited string as there would be with a NUL-terminated string.
Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).
You might ask why C doesn't include closures and list comprehensions: if you want high-level language features, then C isn't the language for you.
Exactly. The real problem* is that C lets programmers fabricate data however they want.
*I say "problem" but it really is the whole point of C. It is a dangerous and powerful tool. To make it less dangerous would make it less powerful, and if you wanted such a language, there are plenty available.
Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.
Well, let's assume a hypothetical universe in which C is still exactly the same C, only with length-delimited strings (still the same level of safety, still malloc and free, still pretty much the same library, only the string functions are implemented differently, etc). Could you write a library that abstracts over the string representation without ever requiring the user to manually read or write the string? I think if you did that (and certainly, C++ does that), you would have a much higher-level library. That isn't what C is good for. C is for when you need low-level access to the underlying representation.
The beauty of using C (and there aren't many) is that you can write your own efficient string manipulation code. For example, if you know you are going to concatenate three strings, you can allocate enough space for all three, then manually copy the bytes over and seal it with a NUL. In C++, you would probably have a stringstream and push each of the strings onto the end, but it would mean the library is internally adjusting lengths and so on -- the programmer can't make the code do exactly what he asks; there is a layer of abstraction. So you could change C's string representation and then provide a high-level API for manipulating it, but someone is going to get pissed off that the library doesn't do exactly what he wants, and dive down and do it himself. It would be very un-C-like to provide that API.
To put it another way, if you were going to provide a high-level string API for C and tell programmers "never ever manipulate strings on your own; use this library," then you might as well use NUL-terminated strings anyway, since the library will handle it, and programmers will never make a mistake. But again, that would be very un-C-like.
So once again, it comes down to this: NUL-terminated strings aren't the problem with C. C is the problem with C: the fact that it gives programmers a lot of power. You might argue that we should stop using C to write programs that don't need that speed or power. But there's no point arguing that C should have been a higher-level language, because then it wouldn't be C.
Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.
It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register). It makes things more complicated.
No, it isn't too hard to solve. But it is non-trivial. Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion. Remember that in C, programmers fabricate their own strings (there is a minimal string library, but often you will see people just allocating memory for strings, populating them, and storing a '\0' on the end). If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong. The same applies to NUL.
Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 : .....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.
I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.
The only security issues mentioned were buffer overruns, with gets taking most of the blame. As I said above, only some NUL errors are buffer overruns and only some buffer overruns are NUL errors, and gets errors are not anything to do with NUL.
Good point.
Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made a decision on all of those, how many implementations would have gotten it wrong?
Not just implementations, but individual programs. Assuming that in this hypothetical universe in which C doesn't use NUL terminated strings, but still assuming that C is a low-level unsafe language in general, how would this have been any different? Unlike C++ or Java, in C, programs manually construct strings. So we wouldn't have people forgetting to NUL-terminate strings. We would instead have people forgetting to set the length field, or setting the wrong length, or being given a 257-byte string and writing a "1" in the length field due to wraparound (granted, that wouldn't often be a security risk, just a bad result). If they had decided to use a variable-length length field, people would have found some way to screw that up. I'm sure hackers would have found a way to inject a long length into a short string and thus read past the end.
At the end of the day, the problem is that C lets programmers do whatever they want with memory, not the NUL terminator. And you can't really say "they should have designed it better," because it is rather the point of C that it lets you do this.
They could have but they didn't (e.g., in Pascal, where strings actually are limited to 255 bytes). So, history has made some worse string representations than C.
Thanks! +1
+1
That's exactly how I read the title too.
What is wrong with all of us?
You assume that every PC gamer owns a console. That's a pretty weird assumption.
If you release a game on PC, I may buy it. If you release it only on console, there is a 0% chance I will buy it. Surely I'm not alone.
Interesting, but I think this article largely misses the point.
Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.
Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)
Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.
What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".