C++ requires you to recompile even if you add a private non-virtual member function. This should not be necessary as it does not change the interface, including not changing the memory layout or size of the object.
I have certainly made bogus cut&pasted implementations just because I wanted to avoid a recompile, when a new private method that did the common part would have been useful.
I would like to see the abilty to do "class X {int A,B,C'; void foo();...}" in a header file. Notice the ellipsis. This means "there is more stuff but if you have a pointer to the class you can call foo()". Right now doing "Class X;" is the same as "Class X {...};" so it appears the compilers would be perfectly capable of it. This would avoid a LOT of recompilation, and also make C++ more useful for long-term libraries where the calling code wants to work without recompilation.
It is useful to do operator overloading, as there are many things that have *not* been written (such as all kinds of variations on matrix math).
I do think it is overused however, and C++ is somewhat unreadable. I would like the following:
1. They cannot be virtual
2. Limit the set. For equality tests you should only be able to overload = for free, automatically. Do whatever is obvious for the other items.
3. Output should be done by a print(ostream) method instead of operatorostream.format("The answer is %x, %x, %12x", object1, object2, object3); would call object1.print(ostream), etc. The 'x' and '12' would be accessible from the ostream by the print functions. Hopefully the compiler features would be useful elsewhere too. The C++ i/o syntax is quite unreadable and lots of programmers resort to printf if they want to control number base or precision.
The problem with C for structures is that the necessary data for structures is visible, making it ugly. Basically you have to do at least this: struct Object { TypeInfo* info; }; // this does { i->info = info; return i}, may be a macro by using compiler features: extern Object* setInfo(Object*, TypeInfo*);
extern TypeInfo YInfo;// something else puts a pointer to Xinfo in this! struct Y { X base;...}; #define NEW_Y (Y*)setInfo((Object*)malloc(sizeof(Y)), & YInfo)
You can arrange a lot more macros, but it is never going to get pretty.
Of course, you might argue that SSL certs shouldn't be relied on for identification, but that's what users have been told to do; look for the little padlock, make sure it says "paypal.com" etc.
THEN DON'T SHOW THE PADLOCK FOR SELF-SIGNED CERTIFICATES
My god, you people are so incredibly stubborn. You will repeat this "reason" over and over and over and over, no matter how many times people like me point out the TRIVIAL way to completely fix your objection to self-signed certificates.
There are both WP7 and Android phones being sold for equivalent prices at my closest TMobile store. They are right there in front, displayed side by side, along with black berries and a lot of less-smart phones. The labels clearly state which are running Android or Windows. The users can see and choose. The hardware looks to me to be pretty much identical (in fact the phone I have came in both versions).
The people have a choice and it is pretty obvious how they have spoken.
It is possible to make it match overlong sequences, since they are patterns just like anything else.
However it probably should not. If anything interprets overlong sequences as anything other than erroneous encodings then it is a bug in that section. You can't transfer the blame to the regexp matching. Conversely you certainly do not want to match overlong encodings if the further step does not interpret them that way.
You seem to misunderstand where the efficiency comes from. It has NOTHING to do with Latin-1. The regexp is in fact compiled into numerous lookup tables that are indexed by the bytes. It is more efficient to match against Egyptian or any of the others using tens of thousands of 256-entry lookup tables than to match using only a dozen or so 2^21 entry tables. In fact all practical UTF-16 or UCS-4 regexp compilers work by splitting the codes into bit slices and using those rather than the code points as indexes into tables. The main advantage UTF-8 has is that you do not have to allocate memory for the converted copy, and much more obvious and useful handling of the overlong and other errors.
I'm sorry I was talking about compiled tables used to implement fast regexp and searches, not direct implementations of character matching.
An example of a lookup table like I was saying is if you wanted to match all characters in a character class, you could make an 8-bit lookup table for the first UTF-8 byte. Each entry either points to another lookup table for the second byte, or an "all true" or an "all false" indicator. In fact any practical method of matching subsets of Unicode works something like this which is why I really don't see any advantage in translating to UTF-32. The UTF-8 bytes are actually somewhat balanced toward frequency so that the most common characters are found with fewer lookups.
Yes I agree that the use of "length" is misleading (even for ASCII it was probably misleading as soon as there were proportional fonts).
A more important source of confusion is huge amounts of dated documentation that says "character" when it should say "byte". This leads the clueless to think that "character" is very important and they must always count these for all measurements, no matter how hard or ill-defined it is. For instance Linux man pages says strchr "returns a pointer to the first occurrence of the character c in the string" (and the argument c is an integer). The correct documentation is "returns a pointer to the first occurrence of the byte c in the string". But you can imagine the horrors that somebody with determination, rudimentary knowledge of UTF-8, and just enough programming talent to be dangerous if they decided to "fix" the library to make it obey the incorrect documentation.
The ONLY time there should be any interest in "glyphs" is when strings are rendered. Which is very, very, rarely. In addition it is impossible to get the actual Nth glyph without examining the entire string, and it can vary depending on the font, on the font layout software, and on whatever your concept of the serial order of the glyphs is.
If string[n] exists it should return the nth code unit from the encoding. I very much recommend against implementing it at all because it makes the marching morons write "string[n] = tolower(string[n])" and other horrors that make I18N impossible.
you don't have to constantly spend time composing characters from UTF-8 to compare to regex values and such.
Not true, you do not need to decode UTF-8 except for the character-set (square brackets in most syntaxes) match, and only if that set contains non-ascii. This can be done at the moment the matching is done, both for parsing the pattern and for reading the matched string, and there are plenty of regexp libaries that do so. Also pattern matching of 8-bit sequences can take a lot of advantage of lookup tables (you can fit approximately 2^13 8-bit lookup tables into the space that a UTF-32 lookup table would take) so it is usually advantageous to compile a UTF-8 regexp into a more complex one that is byte-based, if it is used multiple times.
Yes that would work. I don't like adding test variables, but this is equivalent to most other "how do I avoid a goto" solutions. I think it does indicate a type of control flow that should be supported by a language somehow.
The number of bytes in the UTF-8 string is about as accurate of a guess as to how wide a string will print.
To actually measure a string you need to add up the widths of all the glyphs and escapements and handle kerning and compositing characters. I suspect such a function would have "width" in it's name.
The 0/1/2 type measurements that are used to emulate older fixed-pitch terminals that did Japanese encodings where all Japanese characters were double-width are occasionally useful, however I very much doubt that strlen_utf8() is returning that value.
No, test_1 is false after foo is done, so this will not work. After foo is done the only thing that should be done is bar, the tests test_1 and test_2 may produce garbage or even crash (though I think in most examples they will just waste time).
I admit the exact problem is difficult to state as there are a lot of conditions and I probably did not list them all, but I have always ended up putting a goto in to fix it, and think it would be nice if some language came up with a solution.
The result of strlen_utf8() is useless. It could be the number of Unicode code points, or code points if the characters are decomposed (or any of the 4 types of normalization) or code units if converted to UTF-16 (with/without normalization), or it could be a glyph count, or a count of what people familiar with the languages would count as "characters", or it could be schemes where double-width characters count as 2 and invisible ones as zero to emulate old terminals, or any of a million other possibilities, without even adding questions of what to do with UTF-8 encoding errors. Thinking this value has some purpose is a sure sign that you do not understand Unicode and have not been doing serious string manipulation in your software.
Why they did not (and still don't) have named break/continue is a mystery. I would say 90% of the uses of goto are because of this. I also suspect that if the early C compiler could do a goto, it could have done a named break/continue.
Other use of goto is code like this, does anybody know of a plausable construct in any programming language? In this example bar is a complex mess of code that refers to dozens of local variables, thus making it into a function or duplicating it would make the code much more complex and hard to understand. In addition test_2 will crash or produce an undefined result if test_1 is true.
if (test_1) {
foo;
goto TEST_2;
} else if (test_2) {
TEST_2:
bar;
}
Some of the earliest C functions had setjmp/longjmp and these did work as your "primitive precursor to exceptions".
I would agree that UTF-16 instead of UTF-8 is the most costly current-day mistake.
Some other aspects of Unicode design might also be big mistakes, such as multiple ways of representing the "same" characters. This is going to be like case-independent filenames (does lower-case æ match uppercase or not?), but with a million different and very complex possible "case foldings" and everybody disagreeing about which set they use. Possibly they can learn from Unix and make different byte streams always mean different strings, but I have my doubts that some system programmers are that intelligent, since they also think UTF-16 is a good idea.
The 32-bit time_t is an old mistake, not a modern one. I think a bigger mistake in time is to keep underestimating how fine of a division you want of a second (original unix had 1) and to keep using powers of 10 for these divisions, which do not translate loss-less into floating point formats. Linux and Posix have about a dozen ways of representing time, each with a different unit for the sub-second portion, and all of them powers of 10.
Another mistake is to not have a primitive atomic-file-create call on Unix or Windows. This would appear to create an empty file that you could write to, but until the file is closed it would not appear in the file system, instead all processes would either not see the file or see exactly the previous version of the file. Every program wants this and in fact I think a system could be made where this is the only way to write a file, but the file systems do not support it directly, instead you must use very complex workarounds (or simpler workarounds that are flawed). There was a huge stink because new Linux filesystems made these workarounds fail, and this hostile reactions of some the the Linux guru's leads me to believe that this mistake is not going to be addressed soon.
Related to this, glibc and Windows refusing to put strlcpy into the standard library is probably a cause of enormous numbers of nul termination bugs, since programmers are lazy and will do stupid things because they don't have this call.
Although I agree, since backslash had a very long use as the escape character, you could say that the choice of slash by Multics/Unix was bad, too.
Period would have made a lot more sense, it would match the way hierarchies are indicated in virtually all programming languages.
The problem was that period made so much sense that it had already been incorporated into existing primitive filename conventions (usually as a separator between the name and the type, often called an "extension"). Unix had to be able to copy sets of files from other machines, which meant period had to be preserved, and thus could not be used for directory separator. (although it would be interesting if foo.c and foo.o meant a directory called foo with c and o files in it, I assume the overhead of creating many 1-entry directories for typical sets of files was considered prohibitive).
The PDP-11 did in fact set the non-zero flag for a very large set of operations, including memory to memory copy. So in fact the GP is correct, this is probably the most efficient way to copy a string on a PDP-11.
Well, actually, if both ends are completely random, and they bring a M-F cable, there is a 1/2 chance that the cable will work (as it works with both the M+F and F+M situations). Only in the F+F and M+M situations would you have to go back and get another cable.
So it seems like you can make a 1/2 chance of bringing the correct cable, like you said. If you assume that the work of returning the unused cable is non-zero (imagine you have to stand in line at Fry's return counter) then this could easily be a rational decision.
( A really good computer scientist might bring one, citing that while the worst case is two trips, the average case is one)
Seems to me the average number of trips would be 3/2. There is only a 1/3 chance that the cable they bring will work, and a 2/3 chance of two trips.
You may be confusing yourself with your previous clever example of why only two cables would be needed, but one of the possibilities requires *both* cables.
The Windows function is "strcpy_s" and is much less useful than strlcpy. In particular it is defined to actually "throw an exception" using a complex mechanism (since C does not have exceptions). This is really stupid, since it just turns a *possible* buffer overflow exploit into a *guaranteed* denial of service (anybody who actually did the work of catching the exception would also have been able to prevent the buffer overflow).
In reality nobody uses the exception mechanism, and rely on the default behavior. But this is also useless. It puts a nul at the start of the destination and returns an error indication. strlcpy instead puts the portion of the string that fits in the destination, and returns the length of the source. This makes the result useful for many purposes (for instance if the following code would never have looked at more than the buffer size of bytes anyway), and the return value is useful for allocating the correct-sized buffer.
The _s functions are typical designed-by-committee crap and can be ignored. The correct function is strlcpy. Both Microsoft and glibc maintainers should be ashamed at their behavior in not supporting strlcpy.
I think strncpy was intended to convert null-terminated strings to fixed-length null padded strings, as used in many places in the Unix kernel at the time it was invented, like filenames.
Yes, that is exactly it's purpose (see the dirent structure from early Unix for the most obvious example of a fixed-sized buffer).
The fact that strncpy sort-of works as a safe strcpy and that it's name is very similar to strcpy is a real unfortunate mistake and has led to unbelievable problems.
A proper save strcpy is strlcpy. But it is not in the Linux libc because there are idiot savants in control of that.
That would NEVER have been considered in the 1970's.
Such a structure would either have to be passed by copying it into two registers or stack locations, or a pointer to it would have to be passed resulting in a two-level indirection to get at the characters. We are talking about 1/2 Mhz machines with 16K of memory and perhaps 512 bytes of space for the stack. C did not support passing structures by value at that time.
What the article is proposing is more like:
struct string
{
short length;// two bytes
char buffer[];
};
where the allocated memory is actually sizeof(length)+length.
I do not think such a structure was EVER considered either. Nobody would have thought to "waste" a byte on the very few strings that were longer than 255. The choices were to use a terminating character, or use a single byte length. At the time many languages thought it was worth saving yet another byte by making the high bit in the last character indicate the end of the string (since nobody would ever need more than 127 different characters, right?).
C++ requires you to recompile even if you add a private non-virtual member function. This should not be necessary as it does not change the interface, including not changing the memory layout or size of the object.
I have certainly made bogus cut&pasted implementations just because I wanted to avoid a recompile, when a new private method that did the common part would have been useful.
I would like to see the abilty to do "class X {int A,B,C'; void foo(); ...}" in a header file. Notice the ellipsis. This means "there is more stuff but if you have a pointer to the class you can call foo()". Right now doing "Class X;" is the same as "Class X {...};" so it appears the compilers would be perfectly capable of it. This would avoid a LOT of recompilation, and also make C++ more useful for long-term libraries where the calling code wants to work without recompilation.
It is useful to do operator overloading, as there are many things that have *not* been written (such as all kinds of variations on matrix math).
I do think it is overused however, and C++ is somewhat unreadable. I would like the following:
1. They cannot be virtual
2. Limit the set. For equality tests you should only be able to overload = for free, automatically. Do whatever is obvious for the other items.
3. Output should be done by a print(ostream) method instead of operatorostream.format("The answer is %x, %x, %12x", object1, object2, object3); would call object1.print(ostream), etc. The 'x' and '12' would be accessible from the ostream by the print functions. Hopefully the compiler features would be useful elsewhere too. The C++ i/o syntax is quite unreadable and lots of programmers resort to printf if they want to control number base or precision.
The problem with C for structures is that the necessary data for structures is visible, making it ugly. Basically you have to do at least this:
// this does { i->info = info; return i}, may be a macro by using compiler features:
struct Object { TypeInfo* info; };
extern Object* setInfo(Object*, TypeInfo*);
extern TypeInfo Xinfo; ...};
struct X { Object base;
#define NEW_X (X*)setInfo((Object*)malloc(sizeof(X)), & XInfo)
extern TypeInfo YInfo; // something else puts a pointer to Xinfo in this! ...};
struct Y { X base;
#define NEW_Y (Y*)setInfo((Object*)malloc(sizeof(Y)), & YInfo)
You can arrange a lot more macros, but it is never going to get pretty.
That syntax problem is fixed (the same way C++ does it) in most C compilers (unless you turn on pendantic mode).
Of course, you might argue that SSL certs shouldn't be relied on for identification, but that's what users have been told to do; look for the little padlock, make sure it says "paypal.com" etc.
THEN DON'T SHOW THE PADLOCK FOR SELF-SIGNED CERTIFICATES
My god, you people are so incredibly stubborn. You will repeat this "reason" over and over and over and over, no matter how many times people like me point out the TRIVIAL way to completely fix your objection to self-signed certificates.
See above comment about the browser complaining if the SSL certificate changes.
BULL.
There are both WP7 and Android phones being sold for equivalent prices at my closest TMobile store. They are right there in front, displayed side by side, along with black berries and a lot of less-smart phones. The labels clearly state which are running Android or Windows. The users can see and choose. The hardware looks to me to be pretty much identical (in fact the phone I have came in both versions).
The people have a choice and it is pretty obvious how they have spoken.
It is possible to make it match overlong sequences, since they are patterns just like anything else.
However it probably should not. If anything interprets overlong sequences as anything other than erroneous encodings then it is a bug in that section. You can't transfer the blame to the regexp matching. Conversely you certainly do not want to match overlong encodings if the further step does not interpret them that way.
You seem to misunderstand where the efficiency comes from. It has NOTHING to do with Latin-1. The regexp is in fact compiled into numerous lookup tables that are indexed by the bytes. It is more efficient to match against Egyptian or any of the others using tens of thousands of 256-entry lookup tables than to match using only a dozen or so 2^21 entry tables. In fact all practical UTF-16 or UCS-4 regexp compilers work by splitting the codes into bit slices and using those rather than the code points as indexes into tables. The main advantage UTF-8 has is that you do not have to allocate memory for the converted copy, and much more obvious and useful handling of the overlong and other errors.
I'm sorry I was talking about compiled tables used to implement fast regexp and searches, not direct implementations of character matching.
An example of a lookup table like I was saying is if you wanted to match all characters in a character class, you could make an 8-bit lookup table for the first UTF-8 byte. Each entry either points to another lookup table for the second byte, or an "all true" or an "all false" indicator. In fact any practical method of matching subsets of Unicode works something like this which is why I really don't see any advantage in translating to UTF-32. The UTF-8 bytes are actually somewhat balanced toward frequency so that the most common characters are found with fewer lookups.
Yes I agree that the use of "length" is misleading (even for ASCII it was probably misleading as soon as there were proportional fonts).
A more important source of confusion is huge amounts of dated documentation that says "character" when it should say "byte". This leads the clueless to think that "character" is very important and they must always count these for all measurements, no matter how hard or ill-defined it is. For instance Linux man pages says strchr "returns a pointer to the first occurrence of the character c in the string" (and the argument c is an integer). The correct documentation is "returns a pointer to the first occurrence of the byte c in the string". But you can imagine the horrors that somebody with determination, rudimentary knowledge of UTF-8, and just enough programming talent to be dangerous if they decided to "fix" the library to make it obey the incorrect documentation.
You seem to have read my post exactly backwards.
The ONLY time there should be any interest in "glyphs" is when strings are rendered. Which is very, very, rarely. In addition it is impossible to get the actual Nth glyph without examining the entire string, and it can vary depending on the font, on the font layout software, and on whatever your concept of the serial order of the glyphs is.
If string[n] exists it should return the nth code unit from the encoding. I very much recommend against implementing it at all because it makes the marching morons write "string[n] = tolower(string[n])" and other horrors that make I18N impossible.
you don't have to constantly spend time composing characters from UTF-8 to compare to regex values and such.
Not true, you do not need to decode UTF-8 except for the character-set (square brackets in most syntaxes) match, and only if that set contains non-ascii. This can be done at the moment the matching is done, both for parsing the pattern and for reading the matched string, and there are plenty of regexp libaries that do so. Also pattern matching of 8-bit sequences can take a lot of advantage of lookup tables (you can fit approximately 2^13 8-bit lookup tables into the space that a UTF-32 lookup table would take) so it is usually advantageous to compile a UTF-8 regexp into a more complex one that is byte-based, if it is used multiple times.
Yes that would work. I don't like adding test variables, but this is equivalent to most other "how do I avoid a goto" solutions. I think it does indicate a type of control flow that should be supported by a language somehow.
The number of bytes in the UTF-8 string is about as accurate of a guess as to how wide a string will print.
To actually measure a string you need to add up the widths of all the glyphs and escapements and handle kerning and compositing characters. I suspect such a function would have "width" in it's name.
The 0/1/2 type measurements that are used to emulate older fixed-pitch terminals that did Japanese encodings where all Japanese characters were double-width are occasionally useful, however I very much doubt that strlen_utf8() is returning that value.
No, test_1 is false after foo is done, so this will not work. After foo is done the only thing that should be done is bar, the tests test_1 and test_2 may produce garbage or even crash (though I think in most examples they will just waste time).
I admit the exact problem is difficult to state as there are a lot of conditions and I probably did not list them all, but I have always ended up putting a goto in to fix it, and think it would be nice if some language came up with a solution.
You misunderstood.
The result of strlen_utf8() is useless. It could be the number of Unicode code points, or code points if the characters are decomposed (or any of the 4 types of normalization) or code units if converted to UTF-16 (with/without normalization), or it could be a glyph count, or a count of what people familiar with the languages would count as "characters", or it could be schemes where double-width characters count as 2 and invisible ones as zero to emulate old terminals, or any of a million other possibilities, without even adding questions of what to do with UTF-8 encoding errors. Thinking this value has some purpose is a sure sign that you do not understand Unicode and have not been doing serious string manipulation in your software.
Why they did not (and still don't) have named break/continue is a mystery. I would say 90% of the uses of goto are because of this. I also suspect that if the early C compiler could do a goto, it could have done a named break/continue.
Other use of goto is code like this, does anybody know of a plausable construct in any programming language? In this example bar is a complex mess of code that refers to dozens of local variables, thus making it into a function or duplicating it would make the code much more complex and hard to understand. In addition test_2 will crash or produce an undefined result if test_1 is true.
if (test_1) {
foo;
goto TEST_2;
} else if (test_2) {
TEST_2:
bar;
}
Some of the earliest C functions had setjmp/longjmp and these did work as your "primitive precursor to exceptions".
I would agree that UTF-16 instead of UTF-8 is the most costly current-day mistake.
Some other aspects of Unicode design might also be big mistakes, such as multiple ways of representing the "same" characters. This is going to be like case-independent filenames (does lower-case æ match uppercase or not?), but with a million different and very complex possible "case foldings" and everybody disagreeing about which set they use. Possibly they can learn from Unix and make different byte streams always mean different strings, but I have my doubts that some system programmers are that intelligent, since they also think UTF-16 is a good idea.
The 32-bit time_t is an old mistake, not a modern one. I think a bigger mistake in time is to keep underestimating how fine of a division you want of a second (original unix had 1) and to keep using powers of 10 for these divisions, which do not translate loss-less into floating point formats. Linux and Posix have about a dozen ways of representing time, each with a different unit for the sub-second portion, and all of them powers of 10.
Another mistake is to not have a primitive atomic-file-create call on Unix or Windows. This would appear to create an empty file that you could write to, but until the file is closed it would not appear in the file system, instead all processes would either not see the file or see exactly the previous version of the file. Every program wants this and in fact I think a system could be made where this is the only way to write a file, but the file systems do not support it directly, instead you must use very complex workarounds (or simpler workarounds that are flawed). There was a huge stink because new Linux filesystems made these workarounds fail, and this hostile reactions of some the the Linux guru's leads me to believe that this mistake is not going to be addressed soon.
Related to this, glibc and Windows refusing to put strlcpy into the standard library is probably a cause of enormous numbers of nul termination bugs, since programmers are lazy and will do stupid things because they don't have this call.
Although I agree, since backslash had a very long use as the escape character, you could say that the choice of slash by Multics/Unix was bad, too.
Period would have made a lot more sense, it would match the way hierarchies are indicated in virtually all programming languages.
The problem was that period made so much sense that it had already been incorporated into existing primitive filename conventions (usually as a separator between the name and the type, often called an "extension"). Unix had to be able to copy sets of files from other machines, which meant period had to be preserved, and thus could not be used for directory separator. (although it would be interesting if foo.c and foo.o meant a directory called foo with c and o files in it, I assume the overhead of creating many 1-entry directories for typical sets of files was considered prohibitive).
The PDP-11 did in fact set the non-zero flag for a very large set of operations, including memory to memory copy. So in fact the GP is correct, this is probably the most efficient way to copy a string on a PDP-11.
Well, actually, if both ends are completely random, and they bring a M-F cable, there is a 1/2 chance that the cable will work (as it works with both the M+F and F+M situations). Only in the F+F and M+M situations would you have to go back and get another cable.
So it seems like you can make a 1/2 chance of bringing the correct cable, like you said. If you assume that the work of returning the unused cable is non-zero (imagine you have to stand in line at Fry's return counter) then this could easily be a rational decision.
( A really good computer scientist might bring one, citing that while the worst case is two trips, the average case is one)
Seems to me the average number of trips would be 3/2. There is only a 1/3 chance that the cable they bring will work, and a 2/3 chance of two trips.
You may be confusing yourself with your previous clever example of why only two cables would be needed, but one of the possibilities requires *both* cables.
The Windows function is "strcpy_s" and is much less useful than strlcpy. In particular it is defined to actually "throw an exception" using a complex mechanism (since C does not have exceptions). This is really stupid, since it just turns a *possible* buffer overflow exploit into a *guaranteed* denial of service (anybody who actually did the work of catching the exception would also have been able to prevent the buffer overflow).
In reality nobody uses the exception mechanism, and rely on the default behavior. But this is also useless. It puts a nul at the start of the destination and returns an error indication. strlcpy instead puts the portion of the string that fits in the destination, and returns the length of the source. This makes the result useful for many purposes (for instance if the following code would never have looked at more than the buffer size of bytes anyway), and the return value is useful for allocating the correct-sized buffer.
The _s functions are typical designed-by-committee crap and can be ignored. The correct function is strlcpy. Both Microsoft and glibc maintainers should be ashamed at their behavior in not supporting strlcpy.
I think strncpy was intended to convert null-terminated strings to fixed-length null padded strings, as used in many places in the Unix kernel at the time it was invented, like filenames.
Yes, that is exactly it's purpose (see the dirent structure from early Unix for the most obvious example of a fixed-sized buffer).
The fact that strncpy sort-of works as a safe strcpy and that it's name is very similar to strcpy is a real unfortunate mistake and has led to unbelievable problems.
A proper save strcpy is strlcpy. But it is not in the Linux libc because there are idiot savants in control of that.
That would NEVER have been considered in the 1970's.
Such a structure would either have to be passed by copying it into two registers or stack locations, or a pointer to it would have to be passed resulting in a two-level indirection to get at the characters. We are talking about 1/2 Mhz machines with 16K of memory and perhaps 512 bytes of space for the stack. C did not support passing structures by value at that time.
What the article is proposing is more like:
// two bytes
struct string
{
short length;
char buffer[];
};
where the allocated memory is actually sizeof(length)+length.
I do not think such a structure was EVER considered either. Nobody would have thought to "waste" a byte on the very few strings that were longer than 255. The choices were to use a terminating character, or use a single byte length. At the time many languages thought it was worth saving yet another byte by making the high bit in the last character indicate the end of the string (since nobody would ever need more than 127 different characters, right?).