Slashdot Mirror


The Most Expensive One-Byte Mistake

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"

594 comments

  1. The Road Not Taken by symbolset · · Score: 5, Insightful

    Two roads diverged in a yellow wood,
    And sorry I could not travel both
    And be one traveler, long I stood
    And looked down one as far as I could
    To where it bent in the undergrowth;

    Then took the other, as just as fair,
    And having perhaps the better claim,
    Because it was grassy and wanted wear;
    Though as for that the passing there
    Had worn them really about the same,

    And both that morning equally lay
    In leaves no step had trodden black.
    Oh, I kept the first for another day!
    Yet knowing how way leads on to way,
    I doubted if I should ever come back.

    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood, and I—
    I took the one less traveled by,
    And that has made all the difference.

    - Robert Frost, 1920

    --
    Help stamp out iliturcy.
    1. Re:The Road Not Taken by IICV · · Score: 3, Interesting

      Everyone misunderstands that poem.

      Robert Frost had a fairly depressing outlook on life, and the point of the poem is that it doesn't matter what road you take.

      I mean, just pay attention to the narrative tense in the last stanza, the one people take to be so life-affirming and "do something different!". The narrator isn't saying "I did this, and I know it was important"; he's saying "I did this, and I think that in the future I'm going to tell people it was important".

      The narrator is a vain, shallow individual who frets about insignificant decisions like this, thinking that they will have some gigantic impact on his life, and then later on blows those choices up to be of earthshattering proportions. This is all despite the fact that half the poem is about how the roads are effectively identical; and in the end, he doesn't even tell us what was important about the path he took, just that it was the "one less traveled by" (which makes no sense! They were "just as fair", they had been "worn ... really about the same", they "both that morning equally lay".)

      Basically, if we apply this poem to the current situation, what it's saying is that in alternate 2011 we'd have an article about how null-terminated strings would have been better than Pascal strings. It doesn't matter what path you take, if you're the right kind of person you'll always blow up the significance of it in your mind later.

    2. Re:The Road Not Taken by j.+andrew+rogers · · Score: 2, Informative

      As a nitpick, this poem is not from 1920. I have an original copy that was inscribed by the owner in 1919.

      According to Wikipedia, the original poetry was published in 1916. The 1920 version was a second edition.

    3. Re:The Road Not Taken by jhoegl · · Score: 1

      Perhaps it means that regardless of which path you take, the one you do take(the decision you make), should be analyzed and reflected upon to verify it is inline with what you wish to accomplish.
      Of course, I could continue and counter your you vain and shallow remarks and how they reflect upon a person taking literal interpretation of a poem and scrutinizing it with the inability of the author to respond.
      But I digress.

    4. Re:The Road Not Taken by symbolset · · Score: 1

      OK fine. I quoted the 1920 publication because I didn't have access to the priors. All of these are out of copyright. Did I get the author right at least?

      --
      Help stamp out iliturcy.
    5. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      I'm pretty sure that's what grandparent post meant. So perhaps not everyone misunderstands that.

    6. Re:The Road Not Taken by definate · · Score: 1

      Good point. Just read the wiki, and quickly googled around, and given the additional evidence on top of the poem, and what Robert Frost himself alluded to, it makes perfect sense.

      Also, this is a way better ending.

      --
      This is my footer. There are many like it, but this one is mine.
    7. Re:The Road Not Taken by billstewart · · Score: 4, Funny

      Yes, you got the author right. The trick is that in the 1920 edition, he's taking the other road...

      --

      Bill Stewart
      New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    8. Re:The Road Not Taken by QuantumG · · Score: 1

      Where are threads like this in the space stories?

      --
      How we know is more important than what we know.
    9. Re:The Road Not Taken by Anonymous Coward · · Score: 1

      ...where Han shoots second.

    10. Re:The Road Not Taken by Player+03 · · Score: 1

      I was expecting that to end with a bunch of gibberish and possibly a segmentation fault.

    11. Re:The Road Not Taken by Player+03 · · Score: 0

      I was expecting this to end in a bunch of gibberish and possibly a segmentation fault.

    12. Re:The Road Not Taken by FrootLoops · · Score: 1

      I think the poem is written a little poorly, in the following sense: the narrator appears to switch between perspectives several times. The first perspective is someone who honestly cares about which road should be taken, and who is convinced that there is a difference. The second perspective is a cynic who wants to mock the first person. To be clear, I'd say the first perspective shows up in lines 3-8 and 14-15, with the cynic in 9-13 and 16-20. The switching lends itself to a misreading of the poem.

    13. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      He just stood there wondering what road to walk on until some one walked by and took the way they didn't being antisocial and called it the one less traveled by

    14. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      His DLC prices for new poem content are outrageous.

    15. Re:The Road Not Taken by cgomezr · · Score: 1

      Actually, I think the point is that often in life we just don't have the information to make a rational decision, so we rationalize it afterwards.

      It's not actually that "it doesn't matter what road you take", it's more like it's impossible to know what road to take, and it's impossible to know if there will or will not be consequences. But we like to act as if we actually knew.

    16. Re:The Road Not Taken by Anonymous Coward · · Score: 2, Funny

      Clearly the poem was originally two different poems written by two difference people: call them person 'H' for 'honest' and person 'C' for 'cynic'. At some later date, the 'H' text and the 'C' text were merged, with modifications, by scribes. In truth, we can't be sure whether the person traditionally held to be the author, Frost, was 'H' or 'C', or whether he, in fact, wrote any part of the text at all.

    17. Re:The Road Not Taken by looie · · Score: 5, Insightful

      Not sure where you took your "poetical exegesis" class, but you should ask for a refund.

      The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

      The poem is simply a reflection on how we, as individuals, make choices in life. Some of us choose to take the direction taken by most of those around us. That might be university, family, job, retirement in FL. Some of us choose to turn aside from that direction and try another path. Programming a PDP to play "Space Travel," for example. Or writing an operating system "just for fun."

      Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

      In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

      mp

      --
      "The secret to strong security: less reliance on secrets." -- Whitfield Diffie
    18. Re:The Road Not Taken by dward90 · · Score: 1

      It sounds like, to me, you're saying that "The Road Less Traveled" is actually Robert Frost pre-mocking hipsters in 1920. I actually quite like this interpretation.

      --
      My other sig is clever.
    19. Re:The Road Not Taken by Hatta · · Score: 1, Flamebait

      Everyone misunderstands every poem because poetry has no fixed meaning. It means whatever you want it to mean.

      If people wish to be plainly understood, they write in prose. If they wish to have their meaning debated while not actually communicating anything, they write poetry.

      --
      Give me Classic Slashdot or give me death!
    20. Re:The Road Not Taken by tibit · · Score: 1

      I have only a problem when people say "xxx misunderstands that poem". It's art. I don't give shits nor giggles about what the author wanted to achieve. If it needs explaining, or if I lack whatever context, too freaking bad. It's art. Everyone consumes it differently, some with chopsticks, some with bare hands, and some in microgravity. Yes, of course, every poem was written for a reason. Some authors will forget what it was, though, and some don't care if anyone else knows, though. If the author achieves some popularity within literary circles, there will be someone to write an analytical paper, or a biography. Then you can read what the paper's author/biographer thought that the author thought when writing the poem. I usually treat those 3rd person literary accounts with a sand bucket, not just a grain of sand.

      I do agree somewhat with your analysis, but it's -- I think -- just a coincidence.

      --
      A successful API design takes a mixture of software design and pedagogy.
    21. Re:The Road Not Taken by tibit · · Score: 1

      And that is, ladies, gentlemen, and basement dwellers, why we need posts by people with 4 digit UIDs.

      --
      A successful API design takes a mixture of software design and pedagogy.
    22. Re:The Road Not Taken by tehcyder · · Score: 1

      Well done, I hope whoever modded the OP up now regrets their stupidity.

      How anyone can interpret the lines " I took the one less traveled by,/ And that has made all the difference." to mean that it made no difference which one he took, is totally beyond me.

      "All the difference" is precisely the opposite of "no difference".

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    23. Re:The Road Not Taken by tehcyder · · Score: 1

      Everyone misunderstands every poem because poetry has no fixed meaning. It means whatever you want it to mean.

      If people wish to be plainly understood, they write in prose. If they wish to have their meaning debated while not actually communicating anything, they write poetry.

      That is the sort of rubbish you get on slashdot when people with no understanding of a particular art form try to treat it as though it's maths, I'm afraid.

      If you seriously think that poems like the Divine Comedy, Paradise Lost, Prelude or the Waste Land do not actually communicate anything, you have presumably never read, and certainly never understood them.

      Poetry, like all arts, functions in the realm of what Keats called "Negative Capability, that is when man is capable of being in uncertainties, Mysteries, doubts without any irritable reaching after fact & reason."

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    24. Re:The Road Not Taken by brusk · · Score: 2

      The meaning of the poem lies in the NUL character at the end.

      --
      .sig withheld by request
    25. Re:The Road Not Taken by coldsalmon · · Score: 1

      To me, the salient point has always been that we are doomed to a tiny fraction of possible experience. In that sense, the poem is inapposite here since it does not speak to the relative merits of one path over the other. The point is not that one path is superior to another, but that we always lose something by making any choice. Here is an interpretation of the poem by XKCD: http://xkcd.com/584/

    26. Re:The Road Not Taken by cecille · · Score: 5, Insightful

      Would anyone care you join me
      in flicking a few pebbles in the direction
      of teachers who are fond of asking the question:
      "what is the poet trying to say?"

      as if Thomas Hardy and Emily Dickinson
      had struggled but ultimately failed in their efforts -
      inarticulate wretches that they were,
      biting their pens and staring out of the windows for a clue.

      Yes, it seems that Whitman, Amy Lowell
      and the rest could only try and fail,
      but we in Mrs. Parker's third-period English class
      here at Springfield High will succeed

      with the help of these study questions
      in saying what the poor poet could not,
      and we will get all this done before
      that orgy of egg salad and tuna fish known as lunch.

      -- from Billy Collins "The Effort"

      --
      ...no two people are not on fire.
    27. Re:The Road Not Taken by corbettw · · Score: 1

      Poetry is about emotion, not ideas or meaning. Poets, like all artists, seek to get a response from people. And because everyone is an individual, with their own prejudices, personal histories, and personalities, that response is going to be different from person to person. There's no "right" response to art; even indifference is a response.

      This is why I personally love Pollack and other post-modern, abstract art: it's virtually guaranteed to make people stop and look at it, even if only to deride it as pointless.

      --
      God invented whiskey so the Irish would not rule the world.
    28. Re:The Road Not Taken by Hatta · · Score: 1

      I won't claim I have any understanding of poetry. But it's not for lack of trying. I was a really active participant in all my English classes from middle school through college, trying to figure out how this is supposed to work. Nobody has ever adequately explained to me why poets don't just come out and say what they mean, instead of hiding it behind layers of metaphor.

      I ask questions like "what does this mean?" "how did you figure that out?" "how do you know that is right, and that it doesn't mean this other thing?" "how do you know that it means anything at all, and the author isn't just having a good laugh at your expense?" Never once have I received a satisfactory explanation.

      At some point you just have to conclude that there's nothing to be found. It's all bullshit. The people who like poetry prefer the appearance of meaning to actual meaning. That's all there is to it.

      Poetry, like all arts, functions in the realm of what Keats called "Negative Capability, that is when man is capable of being in uncertainties, Mysteries, doubts without any irritable reaching after fact & reason."

      This is what I'm talking about. It certainly sounds impressive, but communicates nothing.

      --
      Give me Classic Slashdot or give me death!
    29. Re:The Road Not Taken by MuValas · · Score: 1

      You're ignoring the rest of the poem and focusing on the last two lines. You are overlaying your own beliefs onto the poem, and saying that one path was more traveled than the other. In fact, the poem states:
      "Then took the other, as just as fair",
      and "Though as for that the passing there / Had worn them really about the same"
      and "And both that morning equally lay / In leaves no step had trodden black."
      There is not one path that is "taken by most of those around us." In fact, the only indication that they are different is that one is talked about as having undergrowth, whereas the other was a bit "grassy". In other words, he actually took the one that seemed slightly nicer, which is pretty much the opposite what people state the poem is about.

      The poem is all about rationalizing your choices, something we are very good at doing, and why not? It's nearly impossible in most cases to figure out what choice was the best one in hindsight.

    30. Re:The Road Not Taken by tgd · · Score: 1

      Two roads diverged in a yellow wood,
      And sorry I could not travel both
      And be one traveler, long I stood
      And looked down one as far as I could
      To where it bent in the undergrowth;

      Then took the other, as just as fair,
      And having perhaps the better claim,
      Because it was grassy and wanted wear;
      Though as for that the passing there
      Had worn them really about the same,

      And both that morning equally lay
      In leaves no step had trodden black.
      Oh, I kept the first for another day!
      Yet knowing how way leads on to way,
      I doubted if I should ever come back.

      I shall be telling this with a sigh
      Somewhere ages and ages hence:
      Two roads diverged in a wood, and I—
      I took the one less traveled by,
      And that has made all the difference.

      - Robert Frost, 1920

      I sure hope that is out of copyright, you potential thief!

    31. Re:The Road Not Taken by DrBoumBoum · · Score: 1

      Actually, I think the real question is Why did the chicken crossed that road?

    32. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      > Poetry is about emotion, not ideas or meaning

      Egad, you really have no idea what you're talking about, do you?

    33. Re:The Road Not Taken by doug141 · · Score: 1

      How anyone can interpret the lines " I took the one less traveled by,/ And that has made all the difference." to mean that it made no difference which one he took, is totally beyond me.

      pay attention to the narrative tense in the last stanza. The narrator isn't saying "I took a path that was different and important"; he's saying "I took one of two equal paths, and I think that IN THE FUTURE I'm going to tell people it was different and important".

    34. Re:The Road Not Taken by Tarsir · · Score: 1

      Nobody has ever adequately explained to me why poets don't just come out and say what they mean, instead of hiding it behind layers of metaphor.

      Because hiding the meaning is precisely the point. It's not supposed to be a dissertation with a well supported thesis; it's a clever little puzzle that people enjoy composing and analyzing.

    35. Re:The Road Not Taken by Hatta · · Score: 1

      Then what's the point of teaching it in school? If it's only redeeming factor is that "eh, it's kinda fun for some people", then let them do it on their own time.

      What really gets me is when English teachers say things like "there's no right answer", and then proceed to mark your paper off for not having the right answer. It's like the whole thing is some sort of trap or cruel joke.

      --
      Give me Classic Slashdot or give me death!
    36. Re:The Road Not Taken by barlevg · · Score: 1

      The Wiki seems to indicate that this issue has been debated now for generations. I don't think we're going to settle it today.

    37. Re:The Road Not Taken by networkBoy · · Score: 0

      Yeah, but he was at the bottom of his class. only 4 people were below him.

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    38. Re:The Road Not Taken by Quirkz · · Score: 1

      If people wish to be plainly understood, they write in prose.

      I've definitely read more than enough unintelligible prose. Much of it accidentally so, but plenty of it intentionally vague. Joyce is an easy example, but it's not even remotely uncommon.

    39. Re:The Road Not Taken by Waffle+Iron · · Score: 3

      Whose code is this I think I know
      'Tis filled with buffer overflows
      His pointer is not stopping here
      As the megs of garbage data grow

      My CPU must think it queer
      To scan for null bytes not found here
      Between the stack and blocks of code
      Canary values, segfault near

      It gives the PC bell a quake
      To ask if there is some mistake
      The only other sound's the sweep
      Of swapping pages disk head shake

      The stack is swelling very fast
      But allocated buffer's past
      And megs to fill before a crash
      And megs to fill before a crash

    40. Re:The Road Not Taken by N0Man74 · · Score: 1

      pay attention to the narrative tense in the last stanza. The narrator isn't saying "I took a path that was different and important"; he's saying "I took one of two equal paths, and I think that IN THE FUTURE I'm going to tell people it was different and important".

      He didn't say the paths were two equal paths. He said that the two paths looked equal for as far as he could see from the point where the roads diverged, as he said "And looked down one as far as I could."

      Also, "hence" means from now on, not just some future.

      I don't really get this line of thinking. Are you suggesting that we should infer that he is writing a poem about how his choice was inconsequential, but he is going to go forth and mislead people into thinking that it did have consequence?

    41. Re:The Road Not Taken by Hatta · · Score: 1

      Good point. Prose is a necessary, but not sufficient condition for intelligibility.

      --
      Give me Classic Slashdot or give me death!
    42. Re:The Road Not Taken by OldSoldier · · Score: 1

      The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

      I've heard this too. I believe it was an NPR story. That story (whoever it was) was relayed by a friend of Frost's who said Frost was irritated at a colleague who behaved just as the commenter posted which inspired Frost (in part?) to write that poem. Sorry I can't find the link.

    43. Re:The Road Not Taken by SuricouRaven · · Score: 0

      Nope. That poem is copyrighted (In the US, anyway) until 70 years after the death of the author. That will be in 2033.

    44. Re:The Road Not Taken by bab72 · · Score: 1

      I never thought I'd see the day that we'd be discussing poetry on /.

      --
      Bab72 (Not my real name)
    45. Re:The Road Not Taken by doug141 · · Score: 1

      "hence" means from now on, not just some future.

      I don't really get this line of thinking. Are you suggesting that we should infer that he is writing a poem about how his choice was inconsequential, but he is going to go forth and mislead people into thinking that it did have consequence?

      The poem says "I shall be telling this with a sigh/ Somewhere ages and ages hence". Sounds like some distant future to me.

      It's not that his choice was "inconsequential." It's that he had no basis for deciding one path or the other based on information available at the time.

      Based on this interpretation, the thinking is: His choice was a random pick (not necessarily inconsequential) based on the available data, and he says long from now he's going to tell it with some.... embellishment.

    46. Re:The Road Not Taken by cforciea · · Score: 1

      I personally like the other other interpretation where the narrator is an optimist. He hasn't yet gone down the path long enough to really know that it has made any positive difference at all, but he's talking to himself in the last stanza and getting excited about the possibilities of the path he has chosen. Interestingly, this makes the poem itself less optimistic and more contemplative than the more common interpretation.

      Really, though, it is hard to claim that everyone misunderstands the poem unless you have a source for the claim that one interpretation is what was intended by the author. In the absence of other evidence, I can't see claiming that one interpretation is the "right" one when the poem itself is so clearly ambiguous (and possibly intentionally so?).

      Disclaimer: I know little about the life and works of Robert Frost, so maybe there is other evidence and I just don't know about it.

    47. Re:The Road Not Taken by omnichad · · Score: 1

      What really gets me is when English teachers say things like "there's no right answer", and then proceed to mark your paper off for not having the right answer. It's like the whole thing is some sort of trap or cruel joke.

      That very thing happened to me, even in college. I suppose while there's no right answer, there's plenty of wrong ones. I just don't know how to tell the difference.

    48. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      Does it matter what the poet/author meant? Perhaps to a scholar, but the beauty of poetry is it's ambiguity and it's openness to many meanings. More than one author/poet has admitted that they learned that something was in their poem that they had not "meant", because some reader saw that.

    49. Re:The Road Not Taken by drawfour · · Score: 2

      Since it was published before 1923, it's already in the public domain. See the footnote at the bottom of the wikisource page for the poem, and then you can follow the links from there if you care to read more.

    50. Re:The Road Not Taken by jthill · · Score: 1

      I think the poem makes more sense if you focus on his criteria for choosing rather than this specific choice. You can't fit a lifetime of choices into a poem, but you can exemplify how a man makes choices. Faced with a matter of pure preference, given the slightest evidence that one way is less traveled than another this one will take it, as having "perhaps the better claim". This specific moment is I imagine the moment he became self-aware.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    51. Re:The Road Not Taken by hedronist · · Score: 1

      Brilliant comment ... simply brilliant. I was going to try to come up with something about a null hypothesis, but I am stunned into silence by the beauty of this.

    52. Re:The Road Not Taken by dohnut · · Score: 1

      I tend to agree that the point is that the path you choose is ultimately inconsequential to the end that it "made all the difference." Where the difference being, clearly (in my mind), one's uniqueness. It also does not imply that the writer is better or worse for having chosen said path. The "sigh" he makes could be regret for himself or for those who chose the other path.

      Since everyone is unique, clearly the path(s) we have chosen "made all the difference" regardless of whether or not those paths were the ones less traveled.

      So, I don't think he's embellishing. He's simply stating a universal truth and it is just as true whether the statement be made at the moment he chose or "ages and ages hence." The choices you make in life are what ultimately define you.

      --
      Stupider like a fox! - H.S.
    53. Re:The Road Not Taken by Mr.+Slippery · · Score: 1

      Because hiding the meaning is precisely the point. It's not supposed to be a dissertation with a well supported thesis; it's a clever little puzzle that people enjoy composing and analyzing.

      No, it is not. A poem is most certainly not a riddle, and any so-called poet who attempts to pervert poetry in such a manner ought to be keel-hauled.

      A poem is an expression with emotional content, an attempt to illustrate or convey a state of consciousness. As Emerson tells us, "For it is not metres, but a metre-making argument that makes a poem,--a thought so passionate and alive that like the spirit of a plant or an animal it has an architecture of its own, and adorns nature with a new thing. The thought and the form are equal in the order of time, but in the order of genesis the thought is prior to the form. The poet has a new thought; he has a whole new experience to unfold; he will tell us how it was with him, and all men will be the richer in his fortune."

      A poem does not "hide" behind metaphor, it uses metaphor a means of communication. Now, in order to understand a metaphor, you need some background knowledge about the metaphier; but that's not the poet hiding anything from you. When I say "Oh, that's like Darmok and Jalad at Tanagra," and you don't know who Darmok and Jalad are, we have a communications fail, but not because I'm hiding anything. Maybe I shouldn't expect you to know who Darmok and Jalad, or maybe your education has been deficient, but there's no attempt to create a puzzle.

      (Some recent thoughts about poetry here.)

      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
    54. Re:The Road Not Taken by smelch · · Score: 1

      Yes, the problem I've always had with poetry is that the meanings I get are often very, very different from the rest of my peers. I usually say something rational based on the content and the world when asked about the meaning of a poem that makes perfect sense to me, and get very lukewarm responses. Then somebody else will say they take the meaning as something wild-assed derived from a feeling they had when they read it, not based entirely within the context, and logically not sound. If their answer really reflected the meaning of the text, then the text is full of broken metaphors, and this is called clever, or a good observation.

      No, it's not. It's a tangent sprung up from the original topic. If you are going use a poem to start a discussion, fine. I understand that. But when you ask about the meaning of a poem in a classroom it often sounds like stoners guessing at the meaning of life, based not on what they've seen of the world, but whatever damned thing pops in to their mind. When they reach an idea that sounds "deep" they conclude that must be the meaning. It doesn't make sense.

      --
      If I can just reach out with my words and touch a butthole, just one, it will all be worth it.
    55. Re:The Road Not Taken by geekoid · · Score: 1

      You might have a point if the Author himself had never commented on the poem, but he did and we know what he meant.
      Anyone who says otherwise is probably sitting in a coffee Shop finding 'deep meaning' in their latte.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    56. Re:The Road Not Taken by geekoid · · Score: 1

      *cough1916cough*

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    57. Re:The Road Not Taken by blair1q · · Score: 1

      And carrying a BFG that wasn't available in the original.

    58. Re:The Road Not Taken by quickOnTheUptake · · Score: 1

      It would be funny, if not so close to truth.

      --
      Mod points: Guaranteed to remove your sense of humor.
      Side effects may include gullibility and temporary retardation
    59. Re:The Road Not Taken by Tarsir · · Score: 1

      (...) it's a clever little puzzle that people enjoy composing and analyzing.

      No, it is not. A poem is most certainly not a riddle, and any so-called poet who attempts to pervert poetry in such a manner ought to be keel-hauled.

      That was a metaphor. I meant that relative to a poet's other options for communicating, like essays and dissertations and other prose, a poem is a puzzle because it doesn't go to the same lengths to set forth ideas clearly and avoid ambiguity. I see to my chagrin that Hatta, the GP, said almost exactly that in a previous post. You're right though; neither of us considered the requirement to convey emotional content, which prose might not do as effectively as a poem.

    60. Re:The Road Not Taken by eyenot · · Score: 1

      I read and re-read the parent and have to agree. The Frost poem is about an individual whose self-absorption leaves them without an ability to make decisions about the external world. They're so pent up in fascination with the fact that first impressions can be broken by comparisons, that they fail to even relate the decision that was ultimately made. Instead, they go a step further in their demented anxiety and begin self-absorbing to the point of getting hung up on things that haven't even happened, yet. Then they go on to suppose that they'll never get back to the reality of immediacy and relevance. The actual action taken post-narration by the narrator could have been to sit down on the spot and die, for all you know, as a perfect example of analytical paralysis -- which is what the poem IS about, the antithesis of decision-making, not (as you clearly located affixed to a surface someplace betwixt your haunches, detached with a wet sucking noise and hurled at your audience without care or compassion) "a reflection on how we, as individuals, make choices".

      --
      "Stratigraphically the origin of agriculture and thermonuclear destruction will appear essentially simultaneous" -- Lee
    61. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      Seems obligatory in this case.
      http://xkcd.com/451/

    62. Re:The Road Not Taken by Mr.+Slippery · · Score: 1
      .

      I meant that relative to a poet's other options for communicating, like essays and dissertations and other prose, a poem is a puzzle because it doesn't go to the same lengths to set forth ideas clearly and avoid ambiguity

      Ah, but ambiguity what makes both quantum computing and poetry work. An essay, ideally, means only one sharp well-defined thing, where a poem (or a literary work of prose) can simultaneously hold many meanings.

      "The tao that can be told / is not the eternal Tao," says the ol' master in an famous Chinese poem, and "The facts are useful and real . . . . they are not my dwelling . . . . I enter by them to an area of the dwelling" says crazy Uncle Walt. These statements are ambiguous not because Lao Tzu or Whitman were careless, or because they were creating puzzles, or because they were not capable essayists (can't speak for Lao Tzu, but Whitman could write clear prose), but because ambiguity was part and parcel of their subject matter.

      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
    63. Re:The Road Not Taken by LinuxIsGarbage · · Score: 1

      Barely a 4 digit UID. And barely posts.

    64. Re:The Road Not Taken by IICV · · Score: 1

      I think the poem makes more sense if you focus on his criteria for choosing rather than this specific choice.

      No, it doesn't. What was his criteria for choosing one path over the other? That it was "less traveled by".

      Which path was less traveled by? Neither.

      Though as for that the passing there
      Had worn them really about the same

      The poem is simpler and "nicer" if you reduce everything down to a single life affirming sound bite, but then what's the point of having an entire poem?

      If it still doesn't make sense to you, pretend he's talking about socks. There's the black socks, and the navy socks. He picks the ones he uses less, and says that "ages and ages hence, I shall be telling this with a sigh; two socks diverged in a drawer, and I - I took the ones less traveled by. And that has made all the difference."

      You'd laugh at how pretentious and shallow he was being!

      And that's the point of this poem.

    65. Re:The Road Not Taken by Anonymous Coward · · Score: 0

      I like how the unicorn shot lasers out of his eyes at the robot council and traveled back in time, too!

    66. Re:The Road Not Taken by jthill · · Score: 1

      Besides lalalaing past my (and others') point, you're pretending the rest of the poem doesn't exist and now rewriting its metaphor?

      Sorry, I think there's something about what's really in this poem that disturbs you, deeply.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    67. Re:The Road Not Taken by hoggoth · · Score: 1

      The real wisdom is that most people, upon reaching a fork in the road, sit down and watch TV, eat some Bon-Bons, and forget where they had meant to go until it gets dark and goes to sleep only to repeat the same the next day.

      Take either path. He says they look equally fair. Just pick one and GO. Your destiny lies on the path, not at the fork.

      To use an above poster's example: Linus Torvalds sat down and wrote Linux instead of reading about what other people were doing.

      Ok, gotta go, I have a hundred tabs open to different Slashdot articles I want to read...

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    68. Re:The Road Not Taken by IICV · · Score: 1

      You said "focus on his criteria for choosing; by doing that we can see this that and the other".

      My response was:

      1. His criteria for choosing is that he "took the road less traveled by"
      2. But he also said that "the passing there/Had worn them really about the same"
      3. Therefore, his criteria for choosing was meaningless; both roads were equally traveled by.
      4. Furthermore, from my original reply: he doesn't actually know that taking this particular path will make much of a difference; he talks about how in the future he imagines he'll tell people that, but it hasn't actually happened yet.

      How is that "lalalaing" past your point? You said "if A then B C D"; I said "well, A isn't true and here's why". I think that meets your point head on, honestly.

      And yes, there is something in this poem that disturbs me deeply: it's the fact that people take it as a shallow, affirming soundbite just because they like the couplet at the end, and completely skip over the rest of it. There's some interesting observations in there, about the way indecision can make unimportant things loom in the imagination and the way everyone is a hero in his own story, but nobody ever sees them because they're too busy sitting around feeling self-satisfied that they take the road less traveled by - when the point of the poem is that in the end, it doesn't matter!

    69. Re:The Road Not Taken by IICV · · Score: 2

      Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

      In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

      You know, just because you have a positive interpretation of the poem doesn't mean that it's more supported by the text than a negative one.

      Basically, Robert Frost was trolling, and you got bit by it. Why do you think he ended the poem with such a great couplet? Even though it makes such little sense in the context of the rest of the poem? (the dude's thinking about how he's going to talk about it in the future, the rest of the poem is about how the paths are equal) Because he knew it would catch people's attention, and that then they'd look in to the poem some more and see the dissonance. We've just gotten to the point where the popular interpretation is so positive people just ignore the incongruities.

      Look: the narrator is vain and shallow because he's dithering about a minor choice in his life, and in his head it's this giant, life-altering moment.

      Imagine if someone said to you "It took me a long time to decide if I should wear navy socks or black socks this morning" - you'd think they were kinda silly for even thinking about it.

      If they said "I'll tell people about this day - the day I wore black socks, and not navy socks - I'll tell them with a sigh, that I took the pair less traveled by, and that has made all the difference", you'd think they were, well, vain and shallow. Their choice in socks is the most important thing ever! Oh em gee!

      Look, all those things you got out of the poem, those positive life-affirming things about making choices and stuff - that's all great. It really is. There's definitely a place for that in everyone's life.

      But that's not what this poem is about. You're projecting what you want to see onto the poem, instead of taking it in as a blank slate and seeing what the author wrote.

      I mean, I know what that's like. I was disappointed the first time I read the poem. I'd heard people - people like you, in fact - talk about how it's all about taking the road less traveled and being your own person and taking chances, so when I realized I could just read it on my own I was kinda excited, I thought it was gonna be awesome with him thinking about it and then striking out on his own.

      But no, he doesn't! I was prepared for the "road less traveled" to be some third option, not going through either of the paths but striking out on his own. Instead, the narrator sits and dithers and thinks about it and just picks one basically at random. I mean, what bullshit is this? If you're going to take the road less traveled, then it damn well better be less traveled! If you're picking from two clearly laid out choices that other people have walked through, that's not the road less traveled - there's no chances, there's no being your own person, none of that stuff. You're just picking which footsteps to follow in, and then rationalizing it afterwards as having been "the road less traveled".

      The last lines are ironic, and Robert Frost is spinning in his grave singing the trolololo song.

      (funnily enough, the poem also predicts hipsters before they were popular)

    70. Re:The Road Not Taken by jthill · · Score: 1

      Dude, you just did it again. Your nr 3 is non sequitur. You want that "meaningless" result so hard it shuts off your brain, and your rant about self-satisfied people is straight out of left field. You brought that all yourself. People will use any excuse for being self-satisfied, there's nothing special about this one. Fatheaded people take pride in stupidity, in intellect, in talent, in mendacity, any character trait at all. But "telling this with a sigh, ages and ages hence" is just superbly ambiguous. It's a Rorschach test, the important thing is what you see in it. One thing I see in it is the possibility that he knows he's let himself in for the ire of dyed-in-the-wool, abusive conformists. But conscious mockery or Palinesque unconscious parody? I just don't see that in the poem, the character trait he's discovering is legitimately consequential.

      Did I just work myself around to saying it could encompass the mockery you're seeing? I think I did.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    71. Re:The Road Not Taken by rhalstead · · Score: 1

      Frost had it right. Each and every decision we make will in some way effect the rest of our lives, even though they may be a simple as a fork in the road, they may have a significant impact later on as did EOF.

    72. Re:The Road Not Taken by actionslacks42 · · Score: 1

      I took a class in college that tried to explain this of course reading James Joyce. Being a lover of prose I did find it frustrating to read a story and think I understood it and the characters only to go to class and find out that it meant something completely different with a lot of the meaning hidden in literary references to things I had never read.

      The interesting part of the class though for me was the introduction to the concept of Abductive Reasoning.
      http://en.wikipedia.org/wiki/Abductive_reasoning

      "The term refers to the process of arriving at an explanatory hypothesis. Peirce said that to abduce a hypothetical explanation a from an observed surprising circumstance b is to surmise that a may be true because then b would be a matter of course"

      The answer is 42.
      Question?
      * The answer to the meaning of life.
      * No stupid. It is obviously 41 + 1.
      * No it is 42 + 0
      * Etc...

      For us literal minded people at least we have an equation.

    73. Re:The Road Not Taken by pgn674 · · Score: 1

      He's my first cousin, 5 times removed, or my great-great-great grandmother's cousin. Related by blood.

  2. Missed the point by mgiuca · · Score: 5, Informative

    Interesting, but I think this article largely misses the point.

    Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

    Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

    Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

    What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

    1. Re:Missed the point by Anonymous Coward · · Score: 5, Informative
    2. Re:Missed the point by mgiuca · · Score: 1

      Thanks! +1

    3. Re:Missed the point by MrEricSir · · Score: 1

      "...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

      Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

      --
      There's no -1 for "I don't get it."
    4. Re:Missed the point by snowgirl · · Score: 3, Interesting

      Not to mention the argument for "because space was at a premium" is specious, because either you had a 8-bit length prepended to the string, or you had an 8-bit special value appended to the end of the string. Both ways result in the same space usages.

      From what I read in the summary, (didn't read TFA) this whole thing sounds like a propaganda piece supporting the idea that we should use length+string, by presenting it as "this should have been a no-brainer but the idiots making C screwed up."

      As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    5. Re:Missed the point by Anonymous Coward · · Score: 0

      Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class. You shouldn't need to traverse strings to determine their lengths.

      I'd agree that C's elegance stems partially from pointers, meaning address+length string must be implemented higher up, meaning C++. oy!

    6. Re:Missed the point by snowgirl · · Score: 2

      "...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

      Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

      This would still make the strings incompatible, because you would only have a 127-byte string length before the "continuation bit" comes into play and you need to switch to a 15-bit string length. All the previous code written with longer-than-127-byte strings would be incompatible.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    7. Re:Missed the point by e9th · · Score: 2

      My personal fave is strncpy(), which will silently not terminate the string if the buffer is too small, but if you give it a huge buffer it punishes you by NUL padding the string all the way to the end of the buffer.

    8. Re:Missed the point by snowgirl · · Score: 4, Informative

      I'm correcting myself here... apparently they weren't considering going with a 255-byte limit, but a 65535-byte limit, which would have increased the size overhead by one.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    9. Re:Missed the point by dirtyhippie · · Score: 1

      What is so undesirable about making a string larger than a pointer?

      Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve. Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

      And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

    10. Re:Missed the point by mgiuca · · Score: 2

      They could have but they didn't (e.g., in Pascal, where strings actually are limited to 255 bytes). So, history has made some worse string representations than C.

    11. Re:Missed the point by mgiuca · · Score: 2

      Good point.

      As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

      Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made a decision on all of those, how many implementations would have gotten it wrong?

      Not just implementations, but individual programs. Assuming that in this hypothetical universe in which C doesn't use NUL terminated strings, but still assuming that C is a low-level unsafe language in general, how would this have been any different? Unlike C++ or Java, in C, programs manually construct strings. So we wouldn't have people forgetting to NUL-terminate strings. We would instead have people forgetting to set the length field, or setting the wrong length, or being given a 257-byte string and writing a "1" in the length field due to wraparound (granted, that wouldn't often be a security risk, just a bad result). If they had decided to use a variable-length length field, people would have found some way to screw that up. I'm sure hackers would have found a way to inject a long length into a short string and thus read past the end.

      At the end of the day, the problem is that C lets programmers do whatever they want with memory, not the NUL terminator. And you can't really say "they should have designed it better," because it is rather the point of C that it lets you do this.

    12. Re:Missed the point by dbc · · Score: 5, Informative

      Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

      Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

      The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

      Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

    13. Re:Missed the point by alta · · Score: 1

      It was all good to the end there, then you started sending me these coded ssl certs, and I think you just hacked my computer. damn you smart people and your buffer overflows.

      --
      Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
    14. Re:Missed the point by Rakishi · · Score: 1

      Fail, just fail.

      Also, have a look at how mysql deals with varchars. There is no 255 byte limit

      Before Mysql 5.0.3 the limit was 255 and 65535 afterward.

      when length exceeds that value, you just go to 2 bytes of length, etc.

      It does this because each column defines the maximum length for the varchar and the number of bytes used for length is fixed for each column. This however is also overhead, this information for the size of the length field needs to be stored for each variable. In C this means that each variable now has even more overhead (the actual amount depending on how you encode such information).

    15. Re:Missed the point by mgiuca · · Score: 1

      Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.

      What is so undesirable about making a string larger than a pointer?

      It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register). It makes things more complicated.

      Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve.

      No, it isn't too hard to solve. But it is non-trivial. Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion. Remember that in C, programmers fabricate their own strings (there is a minimal string library, but often you will see people just allocating memory for strings, populating them, and storing a '\0' on the end). If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong. The same applies to NUL.

      Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 : .....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.

      Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

      I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

      Where did you stop reading?

      The only security issues mentioned were buffer overruns, with gets taking most of the blame. As I said above, only some NUL errors are buffer overruns and only some buffer overruns are NUL errors, and gets errors are not anything to do with NUL.

    16. Re:Missed the point by phantomfive · · Score: 1

      And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

      The point was: the security problems the article mentions (buffer overflows/underflows) aren't actually caused by NULL terminated strings, they are caused by buffers that are allocated too small. If the buffer is too small, it won't matter if the string is measured at the beginning or terminated at the end. (It can be fixed by measuring the size of the buffer, but that is a different topic).

      However there is a real security problem, as the GP described, although it really was a problem of mixing two standards, instead of a problem with NULL terminations.

      The whole issue of which is better is a lot like big-endian or little-endian byte order: there are arguments both ways, but really it doesn't matter all that much.

      --
      "First they came for the slanderers and i said nothing."
    17. Re:Missed the point by dirtyhippie · · Score: 1

      Sigh... mysql's varchar is just one example of breaking the 255 byte "limit". There are limitless ways to do this. See for example http://en.wikipedia.org/wiki/Variable-length_quantity

    18. Re:Missed the point by Anonymous Coward · · Score: 0

      it would have had the insane limit of 255-byte strings

      I understood the following to mean 2 bytes for the length - 1 byte saved for no magic marker = 1 byte extra. So, 32K or 64K bytes total

      Using an address + length format would cost one more byte of overhead than an address + magic_marker format

    19. Re:Missed the point by arth1 · · Score: 3, Informative

      That's still an arbitrary limit.

      The advantages that I see for counted length are:
      - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
      - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
      - It makes reverse searches faster.
      - You can put binary in a string.
      But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register. Or having to convert text blocks to print them. Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in; you have to update a text length counter for every byte you make available. And... and...

      Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

      All in all, having worked with both systems, I find more advantages with null-termination.

      There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

      I'd still take NULL-terminated for most purposes.

    20. Re:Missed the point by dirtyhippie · · Score: 1

      Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion.

      Is it, or are you just used to dealing with NUL-terminated strings?

      If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong.

      That's what libraries are for :-)

      Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 : .....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.

      Again I say: libc.

      I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

      Heh, a fair point. But if string handling is done in a library by the developer of the OS, and they don't get it right, nobody's going to buy their OS. "Joe average" programmer doesn't have to do it at all, they just call the moral equivalent of strlen(), strdup(), strchr(), strbrk() etc.

    21. Re:Missed the point by msobkow · · Score: 0

      The C language is just a meta-assembler for the PDP instruction set that hung around a lot longer than the machine. It's not an abstract language, as anyone who coded for a PDP can tell you.

      Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

      I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information. What you really want to do is use a higher level language like C++ with StringBuffer and MemoryBuffer objects that keep track of not only the in-use size, but the allocated size of a buffer.

      Oh yeah, those objects do exist. Doh!

      Maybe he should RTFM.

      --
      I do not fail; I succeed at finding out what does not work.
    22. Re:Missed the point by Anonymous Coward · · Score: 0

      The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been.

      What if they had chosen 255 as a string terminator.

    23. Re:Missed the point by Anonymous Coward · · Score: 0

      Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

      I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

      Another problem with C. It does not natively define a 8-bit datatype so people use the char type when they need an 8-bit value.

    24. Re:Missed the point by 0123456 · · Score: 1

      There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc.

      So you have a 255 byte string. You append one byte to it. What do you do now?

      Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

      Sure, you can force everyone to use library calls for all their string operations, but C was intended to be cheap, dirty and fast, which is why there is so much direct string access in C code. If you told them they'd have to use library calls they'd just write their own code instead for better performance, and get it wrong anyway.

    25. Re:Missed the point by MrEricSir · · Score: 2

      If we were to switch now, is that the compatibility you're referring to? Well sure.

      But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

      --
      There's no -1 for "I don't get it."
    26. Re:Missed the point by mgiuca · · Score: 2

      Is it, or are you just used to dealing with NUL-terminated strings?

      Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.

      That's what libraries are for :-)

      Well, let's assume a hypothetical universe in which C is still exactly the same C, only with length-delimited strings (still the same level of safety, still malloc and free, still pretty much the same library, only the string functions are implemented differently, etc). Could you write a library that abstracts over the string representation without ever requiring the user to manually read or write the string? I think if you did that (and certainly, C++ does that), you would have a much higher-level library. That isn't what C is good for. C is for when you need low-level access to the underlying representation.

      The beauty of using C (and there aren't many) is that you can write your own efficient string manipulation code. For example, if you know you are going to concatenate three strings, you can allocate enough space for all three, then manually copy the bytes over and seal it with a NUL. In C++, you would probably have a stringstream and push each of the strings onto the end, but it would mean the library is internally adjusting lengths and so on -- the programmer can't make the code do exactly what he asks; there is a layer of abstraction. So you could change C's string representation and then provide a high-level API for manipulating it, but someone is going to get pissed off that the library doesn't do exactly what he wants, and dive down and do it himself. It would be very un-C-like to provide that API.

      To put it another way, if you were going to provide a high-level string API for C and tell programmers "never ever manipulate strings on your own; use this library," then you might as well use NUL-terminated strings anyway, since the library will handle it, and programmers will never make a mistake. But again, that would be very un-C-like.

      So once again, it comes down to this: NUL-terminated strings aren't the problem with C. C is the problem with C: the fact that it gives programmers a lot of power. You might argue that we should stop using C to write programs that don't need that speed or power. But there's no point arguing that C should have been a higher-level language, because then it wouldn't be C.

    27. Re:Missed the point by mini+me · · Score: 1

      The C string has its place, but what I never understood is why the C standard library hasn't also included a string type. Something like the following with all of the accompanying bound checking functions to go along with it.


      struct string
      {
          size_t length;
          char *buffer;
      };

      There are several third party libraries that do just that, but it seems like something worthy of being there out of the box.

    28. Re:Missed the point by mgiuca · · Score: 2

      I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information.

      Exactly. The real problem* is that C lets programmers fabricate data however they want.

      *I say "problem" but it really is the whole point of C. It is a dangerous and powerful tool. To make it less dangerous would make it less powerful, and if you wanted such a language, there are plenty available.

    29. Re:Missed the point by mgiuca · · Score: 2

      Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).

      You might ask why C doesn't include closures and list comprehensions: if you want high-level language features, then C isn't the language for you.

    30. Re:Missed the point by mgiuca · · Score: 1

      Then that would be incompatible, yes (and as I said in the original post, there was historically such an issue, as B chose 4 as the string terminator!) But that is just one potentially-incompatible design decision, versus the four I listed in my post for length-delimited strings. Other issues for length-delimitation: do you put the length in a struct with the pointer, or in the buffer with the data? Do you make it variable-width or fixed-width? If it's variable-width, do you use 0 or 1 as an extension bit? Do you limit the length to a maximum of 32 or 64 bits, or allow arbitrarily long length fields? If you limit it, what is the limit? If you don't limit it, how do implementations cope when the length is too long to fit in their standard 'size_t' type?

      Again, I'm sure all of the above questions have sensible answers, but my original point stands: it is *not* *straight* *forward* and undoubtedly there would be at least as much confusion and bugs with a length-delimited string as there would be with a NUL-terminated string.

    31. Re:Missed the point by inflex · · Score: 1

      Yeah, I take a stab at strncpy and a lot of others in my little open-source/aka-incomplete book about C... "C of Peril" - http://www.pldaniels.com/c-of-peril/

    32. Re:Missed the point by yuhong · · Score: 2

      I think it was intended to convert null-terminated strings to fixed-length null padded strings, as used in many places in the Unix kernel at the time it was invented, like filenames.

    33. Re:Missed the point by stderr_dk · · Score: 4, Insightful

      Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

      Yeah, clearly PHK doesn't knows anything about memory allocation. (Except for the malloc library he wrote for FreeBSD...)

      Maybe he should RTFM.

      I don't have a FreeBSD system at hand, but I wouldn't be surprised if the malloc page was written by PHK.

      --
      alias sudo="echo make it yourself #" ; # https://pipedot.org/~stderr & http://soylentnews.org/~stderr
    34. Re:Missed the point by Graff · · Score: 1

      So you have a 255 byte string. You append one byte to it. What do you do now?

      Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

      I would think that the more logical way to handle this is to handle the string in chunks. The first byte (8 bits) of the string is the length, if all of the length bits are set then at offset 256 from the length byte you have another length byte and another section of the string. Rinse, repeat.

      Yes, for very large strings this isn't as efficient as simply using 16 bits at the front of the string (65536 positions in 2 bytes vs 512 positions in 2 bytes) but for small strings - probably the most common case - it is just about as efficient as a null-terminated string.

    35. Re:Missed the point by EvanED · · Score: 1

      Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option.
      Update the library to... do what? Take into account multi-byte sequences? strlen doesn't and shouldn't do this, and the goal of the count field of counted strings should absolutely not be to count the number of characters -- it should be the buffer size. And UTF-8 doesn't change that.

    36. Re:Missed the point by dirtyhippie · · Score: 1

      Is it, or are you just used to dealing with NUL-terminated strings?

      Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.

      They are simpler IFF you never use the null value. Do you have any files on your system which have NUL bytes in them? Hint: yes. See for example http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=574104 . I would actually argue that the additional complexity in the design of the address+length formatted strings would wind up resulting in simpler code because with a good system library implementing it, nobody would ever be tempted to just do it themselves. C might be a low level language, but that doesn't mean every wheel has to be reinvented over and over again.

      As for your example of concatenating 3 strings: of course you can copy it yourself without a stringstream or StringBuffer or what have you. Just make yourself a string of the right size, then fill the byte ranges by hand.

      string concat (string a, string b, string c) {
          string ret = strnew( strlen(a)+strlen(b)+strlen(c) );
          strfill(ret, 0, a);
          strfill(ret, strlen(a), b);
          strfill(ret, strlen(a)+strlen(b), c);
          return ret;
      }

      What's so hard about that?

    37. Re:Missed the point by dirtyhippie · · Score: 1

      To put it another way, if you were going to provide a high-level string API for C and tell programmers "never ever manipulate strings on your own; use this library," then you might as well use NUL-terminated strings anyway, since the library will handle it, and programmers will never make a mistake. But again, that would be very un-C-like.

      So once again, it comes down to this: NUL-terminated strings aren't the problem with C. C is the problem with C: the fact that it gives programmers a lot of power. You might argue that we should stop using C to write programs that don't need that speed or power. But there's no point arguing that C should have been a higher-level language, because then it wouldn't be C.

      If you had a mistake-less library, you wouldn't want it nul-terminated. It makes certain simple operations like strlen() take more than constant time, for example. The better implementation for performance is length+data, every time.

    38. Re:Missed the point by rossdee · · Score: 1

      64K text should be enough for anybody?

      OK so Bill Gates didn't say that, but somebody at Microsoft did - after all Notepad was limited to 64K files even when Windows machines had gigabytes of ram.

    39. Re:Missed the point by mwvdlee · · Score: 1

      Although it might have been hardware-optimized years later, that way of storing strings would have been really inefficient back when C was designed.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    40. Re:Missed the point by mgiuca · · Score: 2

      They are simpler IFF you never use the null value. Do you have any files on your system which have NUL bytes in them? Hint: yes.

      Yes -- this is a good reason not to use NUL-terminated strings (which, once again, TFA missed). Remember: I never said NUL terminated strings were good, just that the article missed the point by blaming NUL strings for a different, unrelated problem, and not actually picking up on any of the problems with NUL strings.

      If you need a 0 byte in your strings, then this won't work. However, to be technically correct, strings should contain text, and text should not contain a 0-byte. What about binary strings? Those should absolutely not be stored as NUL-terminated. Remember, nothing in C forces you to use NUL-terminated strings -- it just means you should not use the string.h functions on binary strings. Instead, you MUST separately keep the length around, as you do for an array of ints. Think of a binary string as an "array of chars" and not a NUL-terminated string, and there *shouldn't* be any trouble. (Yet as I pointed out with the MS certificate bug, there can still be trouble.)

      string concat (string a, string b, string c) {
              string ret = strnew( strlen(a)+strlen(b)+strlen(c) );
              strfill(ret, 0, a);
              strfill(ret, strlen(a), b);
              strfill(ret, strlen(a)+strlen(b), c);
              return ret;
      }

      What's so hard about that?

      Nothing was hard about it. It's just that you had to invent two new library functions (strnew and strfill) which are much higher-level than other C library functions (with the possible exception of strdup, which combines allocation and copying). You are now saying to your C users (in the hypothetical "C with length-delimited strings" language) "you must never manually manipulate your own strings -- only ever use these library functions." That is antithetical to the way C works. C programmers want absolute control over the representation of everything. If you want a higher-level language, use a higher-level language.

    41. Re:Missed the point by Anonymous Coward · · Score: 0

      this

      Its actually very useful, when you are dealing with fixed size buffers used to pass strings around. You must deal with the non nul terminated case, but you also never pass any uninitialized memory..

    42. Re:Missed the point by Anonymous Coward · · Score: 0

      They actually meant the string "pointer" would have a 16-bit length, for a max string length of 65535. This uses 2 bytes for the length instead of the one null byte used for termination, for a one-byte difference. So you're wrong about the length being limited to 255.

    43. Re:Missed the point by arglebargle_xiv · · Score: 1

      Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

      I've found that end-of-card was an even more effective delimiter than your $FF.

    44. Re:Missed the point by Anonymous Coward · · Score: 0

      No, he wrote "one more byte", so as the zero terminator would go, the length would actually be two bytes, i.e. maximum string length of 65535.

    45. Re:Missed the point by beelsebob · · Score: 1

      Very nicely put. Just to add to this a little. The "correct" solution to avoiding this problem is to use a higher level language that abstracts strings away from their internal representation, and stops the user from fucking up by either having the wrong length or forgetting to null terminate. Of course this gets back to the original point of the article –the people who originally sorted out C's string representation had extremely limited RAM to work with, and hence wanted to go the low level dangerous way, rather than the high level safe way.

      Lesson:
      Use C when you need something low level.
      Use something higher level when you don't.

      Finally, even when using C, use abstractions, if you have the space/time for them.

    46. Re:Missed the point by dkf · · Score: 1

      The bottom line is that in an application programming language strings need to be atomic, as they are in Python.

      They also need to be defined in terms of abstract characters, which Python doesn't get quite right. There most certainly shouldn't be multiple types of string from the perspective of the programmer, and there definitely shouldn't be multiple types of string literal. (Python 3 is better than Python 2.* in this regard, and both beat Perl at a practical level.)

      C though? It does what it does. As long as you don't mistake a 'char*' for a real string or a 'char' for a real character (or fluff the buffer handling) you're OK.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    47. Re:Missed the point by dkf · · Score: 1

      But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

      But was that a feature added because lots of code was using NUL-terminated strings? (Hardware and software have co-evolved.)

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    48. Re:Missed the point by JasterBobaMereel · · Score: 1

      C does not have strings .... It has a char and a pointer to a char - this means you have low level control of everything (and you have to cope with the consequences)

      C++ Does have strings, they are objects/classes and so the implementation is hidden and can be changed arbitrarily - This means you don't need to worry about the implementation, but lose some of your control ...

      Pick a language for what it is good at, don't complain when the hand-holding language won't let you do something, or the low level language won't stop you ...

      --
      Puteulanus fenestra mortis
    49. Re:Missed the point by Anonymous Coward · · Score: 0

      What is so undesirable about making a string larger than a pointer?

      A "string" is a composite data type, not a fundamental one. It's a set of rules and assumptions about how you arrange and work with the basic data type (usually a 'char'). Underneath the hood, a string is simply a pointer to a linear array of values.

      Imagine a scenario where you are storing single byte values; 0-255 is the range each position in the string array can represent. So if you use the first byte to represent length, you're at a 255 limit. Using your suggestion, we'll add another byte for representing that. Now, where do we add that byte. We can't add it at position -1 without reallocating and copying the entire array in memory. We can't insert it between position 0 and 1, because again we have to reallocate and then shift the data. We will have to implement the strings in such a fashion that they can grow at the end of the array. In which case, why go through all that extra effort of using a length indicator when you can just null-terminate?

      There's actually a lot more to the whole discussion, but that's a scratch on the surface for you to think about.
      As for "security issues" those are, as usual, a result of poor coding. All the issues presented by a null-term'd string can show up with a length-indicated string as well. If it really bothers you, C has this really nice feature called "overloading". Use it and build your own string functions which use length-indication instead of null termination.

    50. Re:Missed the point by Pieroxy · · Score: 1

      C++ Does have strings, they are objects/classes and so the implementation is hidden and can be changed arbitrarily - This means you don't need to worry about the implementation, but lose some of your control ...

      You can't change the way Strings are stored in C++ without breaking things pretty badly for many programs. The problem with C++ is that it's based on C, and you can tinker with memory behind the scene. In Java, you can change the way Strings are stored in memory while being 100% sure you won't break a thing (if your implementation works obviously.)

    51. Re:Missed the point by snowgirl · · Score: 1

      Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

      You've made an error here. There are some Unicode codepoints that are 32-bits long in UTF-16 (surrogate pairs). So, no. Not even with UTF-16 can you get a hardcoded version. However, if you go to UTF-32, then you will have enough space to ensure that every codepoint is represented in the same size value, and thus it's a simple matter of (buffersize in bytes / 4) = number of codepoints. (Note: still not the total number of "characters" as some are combining... and thus an "a" + "combining accent" is one character)

      "But who would waste all that space just to make calculations faster?" Actually, Perl represents all Unicode in UTF-32 (native byte-alignment) and renders out to UTF-8 when printed. Precisely because space is relatively cheap, and processing time is still often the current limiting factor.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    52. Re:Missed the point by mcvos · · Score: 2

      Allow me to summarize that for the tl;dr crowd:

      C's "everything is a pointer" approach gives you the power to easily do lots of cool stuff, and adding length to a string would break that elegance. But using NUL-terminated strings creates a lot of security problems, not merely limited to buffer overflows, which are really caused by C's backward memory allocation.

    53. Re:Missed the point by Zorpheus · · Score: 1

      The big advantage of Pascal strings, which use a counted length, over C strings was: In Pascal you don't need to think about buffer sizes. The compiler can just handle everything since it knows how much memory is needed for the string, and reserves only the memory really needed for it.

    54. Re:Missed the point by snowgirl · · Score: 3, Insightful

      If we were to switch now, is that the compatibility you're referring to? Well sure.

      But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

      And if it had been decided to be 1-byte length + data, and everyone used it like that, and assumed that the full 8-bits are available for the length, then when we switch to variable-byte length encoding, it would create an incompatibility. The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding.

      "They could have just used variable length encoding from the beginning." Sure, and they could have programmed everything in Java from the start... the idea of a variable length encoding would have been over-engineering the problem that they were facing.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    55. Re:Missed the point by Anonymous Coward · · Score: 0

      Also: If you make it variable width, must the width exactly match the log2 of the length or can it be larger?
      Do you have to support something insane like a length value of 1 encoded as 4096 bits?

    56. Re:Missed the point by TheRaven64 · · Score: 1

      A long is, by C language definition, large enough to hold a pointer that has been cast into it

      No it isn't. Such a type was not added until C99, which defined intptr_t and uintptr_t. If you make this assumption, then your code will break on Win64 and on some mainframe systems.

      --
      I am TheRaven on Soylent News
    57. Re:Missed the point by TheRaven64 · · Score: 2

      They also need to be defined in terms of abstract characters

      How big is an abstract character? When C was created, the choices were basically ASCII or EBCDIC, so 7 bits. Then you got 8-bit encodings, but they were all incompatible. When OpenStep was published, Objective-C strings were defined as ordered collections of unicode characters, which were 16-bit values. Modern versions of the unicode specification require more than 16 bits for the entire range, so you end up needing 32 bits for each character. In a high-level language, you can just have a character type that you periodically redefine to be bigger and let the VM / runtime sort it out. In a low-level language like C, you break the ABI every time you do that.

      --
      I am TheRaven on Soylent News
    58. Re:Missed the point by jeremyp · · Score: 1

      Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.)

      Actually, they were suggesting a two byte length, hence the one byte of overhead because a length + data string does not need a terminator. Two bytes would have been adequate on a PDP 11 because addresses were only 16 bit.

      --
      All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
    59. Re:Missed the point by TheRaven64 · · Score: 1

      The C standard defines the standard library, not just the language. This includes things like qsort() and wchar_t, but doesn't define a string_t...

      --
      I am TheRaven on Soylent News
    60. Re:Missed the point by lurcher · · Score: 1

      The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory.

      Other than the segmented environments where a pointer doesn't cover all the memory. But that is just a nit pick.

    61. Re:Missed the point by Anonymous Coward · · Score: 0

      Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer.

      The article says it would "cost one more byte" - one more than the NUL terminator, which makes a two-byte length. That would be a size_t on a 16-bit computer.

    62. Re:Missed the point by TheRaven64 · · Score: 1

      You can in Objective-C. NSString is a class cluster, which means that you can create a new storage system as long as it can return the length and the character at a given index. For efficiency, you also want to define a method that fetches a range of characters. LibICU defines a very similar abstraction. In fact, I've written implementations of both in terms of the other. This is one place where a slightly higher-level language has a big performance win. In C++, you have to copy string data to pass it to and from ICU, while in Objective-C you just implement an ICU UText abstract type in terms of NSString and then an NSString in terms of UText for the result. For a small string, there's not much difference, but when you're doing a unicode regex search on a large body of text, then the C++ approach results in a lot of cache churn and redundant copying.

      The problem is not that C++ is based on C, it's that it uses bad abstractions and optimises in the wrong places for its common uses. Oh, the Objective-C solution isn't perfect either, by the way. It was standardised when unicode text was still using 16-bit characters. Newer versions of the unicode standard support more characters (something like 2^24, I think), so you need to resort to UTF-16 surrogate pairs.

      --
      I am TheRaven on Soylent News
    63. Re:Missed the point by TheRaven64 · · Score: 1

      I know it's new, only being 12 years old, but uint8_t and int8_t have been part of C since C99...

      --
      I am TheRaven on Soylent News
    64. Re:Missed the point by Anonymous Coward · · Score: 0

      stncpy() does not pad. It terminates on either exhausting the buffer, or having copied a NUL.

    65. Re:Missed the point by Zorpheus · · Score: 1

      Correcting myself, it's the runtime engine doing this, not the compiler.
      Strings in pascal are just normal variables. For example you can write something like:
      string a,b;
      a = "Hello ";
      b = "world!";
      If (a+b == "Hello world!") ....
      (that is not pascal code, it's just for illustration)

    66. Re:Missed the point by Anonymous Coward · · Score: 0

      Never mind security - it's just incredibly inconvenient that a string can never contain a null byte. You can't safely read a file into a string - or even just a single line unless you sanity-check it first. You can't use a string to store a bitmap image, or a received packet of data, or even the result returned from a database unless you're absolutely sure the database has the same NUL-byte restriction.

      What's amazing is that the C standard library string functions are so often used in new code. I guess this is because none of the alternative string libraries have grabbed enough mindshare to be accepted as standard. And, as so often, because programmers divide into those who aren't aware of the traps and gotchas, and those who think they are elite enough to avoid them.

    67. Re:Missed the point by Anonymous Coward · · Score: 0

      C99 fixed that already. C++ however still has that issue.

    68. Re:Missed the point by Anonymous Coward · · Score: 0

      It's funny that people mention Pascal and 255 byte string limit and at the same time they complain about limited memory that justifies the NULL delimited strings. Either the memory is limited and you can't afford strings longer than 20-30 chars or the memory isn't limited and you are using 32 bit string length like the one in any Pascal that was in use after 1990.

      What everyone forgets is that as soon as strings become dynamically allocated you are now wasting the extra byte for NULL because you still have to keep track of the size of allocated memory and you have buffer overrun as a built in feature due to handling strings as NULL delimited arrays.

      In "new" Pascal it is easy to use strings for memory allocations to store any data and the benefit is automatic memory management - when strings goes out of scope the memory is released.

    69. Re:Missed the point by catmistake · · Score: 1

      There is no tuple. There is only NUL.

    70. Re:Missed the point by jeremyp · · Score: 3, Interesting

      That's still an arbitrary limit.

      An arbitrary limit equal to the virtual machine size of the computer that was originally targeted.

      The advantages that I see for counted length are:
      - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
      - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
      - It makes reverse searches faster.
      - You can put binary in a string.

      - It all but eliminates the possibility of buffer overruns for strings.

      But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

      But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

      Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in;

      With modern character encodings, you can't guarantee that whatever string format you use. Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone. This is why almost all modern languages have some variation on length + data for their strings and utilities for manipulating raw byte buffers.

      Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option.

      Except that strlen() has always and still does count the number of C chars before the null byte. This is enshrined in the C99 standard. UTF-8 has not changed the implementation of strlen(). Also, gcc and probably many other compilers will normally optimise things like strlen() to a few lines of assembler rather than a call to libc, so you'd have to recompile anyway if it does change.

      Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

      All in all, having worked with both systems, I find more advantages with null-termination.

      There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

      I'd still take NULL-terminated for most purposes.

      Most modern languages have a proper string type and I would always take that over null terminated char sequences. You can bet that Java's internal implementation of String uses length+data.

      --
      All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
    71. Re:Missed the point by Anonymous Coward · · Score: 0

      Page 25, item 5.2, I can't see if there's really a space in the example.

      Otherwise, I mostly knew what was in the book, but it's a good idea anyway. Keep the good work.

    72. Re:Missed the point by mgiuca · · Score: 1

      Thanks. I'm not very good at short summaries.

    73. Re:Missed the point by Joce640k · · Score: 1

      Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class. You shouldn't need to traverse strings to determine their lengths.

      Going the other way has lots of problems, too.

      eg. Given this line of text: "Content-Length: 12345", how do you pass a pointer to the "12345" to a function which converts strings to integers?

      As a design decision, it was FAR better to trade some runtime efficiency for the flexibility that null-terminated strings give you.

      You'd know that if you'd written some real programs instead of just sitting through an algorithms class.

      --
      No sig today...
    74. Re:Missed the point by Anonymous Coward · · Score: 0

      From TFA:
      "Slashdot Sensation Prevention Section

      We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

      In other words, Ken, Dennis, and Brian did the right thing."

    75. Re:Missed the point by Joce640k · · Score: 1

      You can't change the way Strings are stored in C++ without breaking things pretty badly for many programs.

      Sure you can.... you just need a suitable conversion operator and the compiler will do the rest, generating temporary std::strings as needed.

      It may not be the most efficient way but I assume you have a *very* good reason for not using std::string so the tradeoff would need to be measured for your particular case.

      --
      No sig today...
    76. Re:Missed the point by Anonymous Coward · · Score: 0

      As for buffer overflows I'd rather kill that bastard who invented downwards growing stacks.

    77. Re:Missed the point by smpoole7 · · Score: 1

      "Not to mention the argument ... is specious ..."

      Excellent point. A byte at the front or a byte marking the end, same thing either way.

      Great points in this whole thread, as far as I'm concerned. I'm more of a hardware guy, I guess, but I love C because it's light, fast and fun to use. When I need OOP, I use C++. And speaking as more of a hardware guy, the article also completely and utterly misses the most glaringly obvious point -- namely, that the worst mistake of all time was standardizing on the 80x86 series of processors. The fact that the latest incarnations of the 80x86 family have trouble with end-marked memory segments is more a function of the processor, not the programming language.

      To quote Andy Schulman from memory (who in turn, I believe, was quoting someone else): "no other processor has so arcane a protection mechanism."

      And to quote one of the creators of BASIC (again, from memory): "only time will tell how thoroughly unfortunate it was that the 80x86 series was chosen for the PC family." (And remember, BASIC uses length info for strings.)

      The bitwise alignment that the article refers to is the result of the 80x86 family growing over the years, adding features, while needing to maintain reverse compatibility with older code. C is by no means the only language affected by it, and the article bemoaning the fact that a C string must be "scanned" in order to determine its length is silly. Every language/processor combination has limitations (try compiling for a little embedded processor some time!). There will always be tradeoffs.

      --
      Cogito, igitur comedam pizza.
    78. Re:Missed the point by Dog-Cow · · Score: 1

      That is not true. In Windows 95 through Windows ME, there was a 64K limit because GDI was still 16 bits, and the text control used by Notepad was (and is) implemented on top of GDI. Windows NT (all the way back to 3.1) never had this limit. By the time Windows machines had gigabytes of RAM, Windows 2000 and Windows XP were the dominant versions in use.

    79. Re:Missed the point by Anonymous Coward · · Score: 0

      No, a zero test is the basic component of all comparator tests (a == b is evaluated as a-b==0), and inequalities are implemented as sign tests (a > b is evaluated as a-b>0). The reason for this is that the arithmetic units are required anyway, so having to include a specific n-bit AND port is a waste of space.

      Besides, the most efficient counter is a countdown to zero. Why? Because of the zero test. Even without the zero test, you would use the overflow bit to detect the zero crossing (and compensate for the obi-wan error).

    80. Re:Missed the point by bjourne · · Score: 2

      "But who would waste all that space just to make calculations faster?" Actually, Perl represents all Unicode in UTF-32 (native byte-alignment) and renders out to UTF-8 when printed. Precisely because space is relatively cheap, and processing time is still often the current limiting factor.

      Which is not always true either. Often the limiting factor is the space in the cpu cache, since ram access is relatively expensive.

    81. Re:Missed the point by Carewolf · · Score: 1

      You never use std::string in C++, you would need a very good reason to do so, since no one else does. You use either C-strings or one of the abundance of string implementations,there is at least one for each framework or API. The good thing about the all the string-implementations is that most of them are very similar to, but much better than std::string, and usually they can be constructed with no memory copy, which makes conversation between them completely free at runtime.

    82. Re:Missed the point by Anonymous Coward · · Score: 0

      FYI: The C language does not define that a "long" be the same size as a pointer, the choice is platform-specific, which is why another poster has mentioned size_t

    83. Re:Missed the point by Anonymous Coward · · Score: 0

      The choice of null terminated strings is MUCH older than C or even B.

      On the PDP-1, strings were declared as ASCIZ (ascii null terminated). This was a packed format due to the 12 bit word values, the first byte in first word, second byte in second word, third byte split (4 bits each) and put in the high order 4 bits of each word, high order 4 bits in the first word, low order in the second. If any byte was NULL, then the string was terminated. Another declaration (ASCII) was the same format (null terminated) but not packed.

    84. Re:Missed the point by alangmead · · Score: 1

      Saying "PDP" instruction set makes them sound all the same. C is very similar to the PDP-11 instruction set, Unfortunately, it was produced after C was developed. I strain to find similarities between the PDP-7 instruction set and C. Most of the PDP-11-isms that people see in C are the post-(increment|decrement) instruction variations and the MOV variations that dereference an address register.

    85. Re:Missed the point by Lonewolf666 · · Score: 1

      Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

      A lot of those questions can be handled by a clear specification of the string types. Things like "Would the length be first or last?" or "how long is the length field" obviously belong in the specification (and someone who does not think that far should not be allowed to design programming languages other people have to use).

      Some more of these are a general problem if you mix languages and port to other systems. Sure, "a char* is a char*" works for strings. But how many programs handle only strings? Most likely, you have some integers as well, and then the endianness of the length field (which is probably some sort of integer) comes back as a more general question about the endianness of integers. So you have to deal with that problem anyway.

      So I think the downsides of Pascal style strings are overrated. My two cents on how a useful specification for the 16bit systems of the 1970s could look:
      1) A "string" consists of a length field at the beginning and ASCII characters after the length field.
      2) The length field is an unsigned 16bit integer. Endianness depends on the architecture (whatever the CPU normally uses). For exchange between computers, the application programmer is responsible for creating a unified format.

      This string format would be only one byte longer for the same number of characters (you don't need the NUL byte at the end) and allow pretty long strings of 64kByte.
      If that extra byte really hurts, add a second "shortstring" data type with 8bit unsigned integer as length field.

      --
      C - the footgun of programming languages
    86. Re:Missed the point by Chrisq · · Score: 1

      and the goal of the count field of counted strings should absolutely not be to count the number of characters -- it should be the buffer size. And UTF-8 doesn't change that.

      That is really only a convention. In practice you need both length functions for different purposes.

    87. Re:Missed the point by John+Bresnahan · · Score: 1

      The C standard defines the standard library, not just the language.

      But the language and standard library predates the standard by decades

    88. Re:Missed the point by Chrisq · · Score: 1

      That's still an arbitrary limit.

      Why not have a length for the length field .... and then just in case that is not enough a length for this field ... and then ...

    89. Re:Missed the point by Anonymous Coward · · Score: 0

      I'm surprised that you missed the most important point as well. See, when dealing with C character arrays, there are two size values that matter: the allocated size and the actual string length. Functions writing to a string need the allocated size, while functions reading a string need its actual length. Or are you going to suggest that for each and every character array, the allocated size must always be the actual length? In that case, you are inviting a massive overhead for all string manipulations. Try implementing strsplit() in C using both NUL-terminated and length-defined strings and you'll see what the problem is.

      The "1-byte saved" storage argument is facetious and misleading. Computational complexity is a much stronger argument, especially on the sub-1 MHz machines of the day.

    90. Re:Missed the point by Lonewolf666 · · Score: 1

      If you choose strings with length count, the format of the length field should be fixed in the specification. If necessary, use multiple string types. As in
      "shortstring": 8bit unsigned length field, holds up to 255 characters
      "string": 16bit unsigned length field, holds up to 65535 characters
      "longstring": 32 bit unsigned length field, holds up to 2^32-1 characters

      In general, I think making the length of "integer" depend on the architecture is a stupid idea, because of breakage when the size of the address space changes. Better to have a separate type "pointer" and add larger integer types as needed. That way, application programmers have a stable definition and systems programmers have a pointer type that adapts in size.

      --
      C - the footgun of programming languages
    91. Re:Missed the point by Anonymous Coward · · Score: 0

      Wikipedia is your friend: "most special characters had two or three punches (zone [12,11,0,or none] + digit [2-7] + 8);"
      But then the trail ends, It is a special character, I can get as far as that. But which it is specifically, I can't find.

      The world of cards is fascinating! : http://www.cs.uiowa.edu/~jones/cards/codes.html

    92. Re:Missed the point by e9th · · Score: 1
      RTFM, dolt.

      If the length of src is less than n, strncpy() pads the remainder of dest with null bytes.

    93. Re:Missed the point by scharkalvin · · Score: 1

      The problem with a NULL terminated string (or ANY terminating character) is that it limits strings to ASCII characters. You can't have binary strings because the terminating character is always a valid string character.

    94. Re:Missed the point by gtall · · Score: 1

      "how do you pass a pointer to the "12345" to a function which converts strings to integers?" Easy, pass a pointer to the string and the function is smart enough to start after the count.

      You'd know that if you'd written some real programs instead of just sitting through an algorithms class.

    95. Re:Missed the point by Anonymous Coward · · Score: 0

      The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation.

      Lisp was used as both a systems and application programming language, and it seems not to have these problems.

    96. Re:Missed the point by Anonymous Coward · · Score: 0

      *Whoosh*
      That was part of his point, that WASN'T a buffer overflow.

    97. Re:Missed the point by arth1 · · Score: 2

      But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

      But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

      Um, that's pretty much what I said, isn't it? The "instead having to maintain a counter which eats up a valuable register" part.

    98. Re:Missed the point by mgiuca · · Score: 1

      Not quite true. You can't have binary strings -- true. So I think of text strings and binary "strings" being two completely different things (as they should be -- any modern language like Python 3 or Java does distinguish between them). A text string is what a char* is for, in C, and the string.h library. A binary "string", you should not use a NUL-terminated char*, you should keep the length yourself and use binary manipulation libraries like memcpy.

      However, what you say about ASCII isn't true. Assuming you aren't going to be using the code point 0, many character encodings work fine with NUL-terminated strings: ASCII, Latin-1 and most importantly, UTF-8. This means that you can represent any Unicode string without a 0 byte (as long as your string doesn't include the NUL character). If you are using UTF-16, then you'll have a wchar_t* instead of a char*, and your terminator won't be the byte 0, it will be the wchar_t 0, so UTF-16 works fine as well.

    99. Re:Missed the point by arth1 · · Score: 1

      Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone.

      find /dev /var/run -type c -o -type p

    100. Re:Missed the point by Rockoon · · Score: 1

      However, to be technically correct, strings should contain text, and text should not contain a 0-byte.

      Why not?

      These C "strings" are precisely the reason that ascii 0 is treated as special, and not the other way around.

      C doesnt have a string type, so programmers hacked up something dirt-simple to implement them. Now people like you think that a byte with a value of 0 actually SHOULD be special, even though its yet one more value for the byte to take on.

      Would you defend a file system that did not accurately store the length of files, that instead used an end of file marker? Before you respond too quickly, note that history is littered with those file systems. Nobody uses them anymore because they were, like C's "strings", just something hacked up to be "simple."

      Every language that has ever taken strings seriously has opted for the minimum of a length + pointer structure. That tells us that ascii 0 is not special as you suggest. Not at all.

      --
      "His name was James Damore."
    101. Re:Missed the point by zzsmirkzz · · Score: 0

      the idea of a variable length encoding would have been over-engineering the problem that they were facing.

      You say this like it is a bad thing. Over-engineering, as you put it, would of been probably been the best thing they could of done if they went this route because, as you pointed out, 255-byte strings are rather short. It doesn't take much forward-thinking to anticipate the problem and correct it. I would call that engineering, not over-engineering.

    102. Re:Missed the point by mgiuca · · Score: 1

      OK, firstly, I design programming languages and I agree with you in principle. My programming language allows NUL as a character in strings, and so should all modern languages. It is a valid character, as you say. It has a Unicode code point, U+0000.

      However, in the context of C, there are historical reasons and technical reasons (simplicity is a priority) why I defended the NUL. Note that I didn't say it was the correct decision, just that it isn't exactly clear-cut that it was the wrong decision. When I say "text should not contain a 0 byte" I don't mean "programming languages should not accomodate text that does contain a 0 byte"; I mean it is never necessary for a non-binary text string to contain a 0 byte, so it is acceptable, I would argue, for any program to drop such characters. Therefore, it doesn't matter too much in practice that C doesn't allow this character in its strings.

      Would you defend a file system that did not accurately store the length of files, that instead used an end of file marker? Before you respond too quickly, note that history is littered with those file systems.

      Absolutely not, because that is a file system and files are binary things. They are a sequence of octets, with no octet value being any more special than any other. Those file systems which litter history should be dead. That is different, however, to a text string data type that doesn't really need to store those characters. I didn't say it was ideal, but it's acceptable. (Clearly it's acceptable, because we still use C today.)

    103. Re:Missed the point by Big_Breaker · · Score: 1

      That's a pretty long string - generally ~20 pages of single spaced text and would rarely cause trouble. You could also just have a "big_string" type that could be longer just as we have int and long_int.

      When I first learned C I thought the string style was odd. My first instinct was to have the length as the first byte or two as well but you learn the way it is and go with it.

      Anyone is free to reimplement the string libraries with this alternative, just as everyone is free to switch from a qwerty to a Dvorak keyboard. In both cases most people are content with things as they are, imperfect as they are.

    104. Re:Missed the point by Anonymous Coward · · Score: 0

      A long is, by C language definition, large enough to hold a pointer that has been cast into it.

      That's actually not true. The C99 standard says a long must be at least 32 bits in size and that it must be at least as big as an int, but that's it. The C99 standard defines the intptr_t type as the integer type that can store a pointer (which is typically typedef'd to long, of course). Actually Wikipedia says that IA-64 Windows follows the IL32P64 model, which breaks your assumption that longs can hold pointers.

      The POSIX standard may require that long is big enough to hold a pointer, I'm not sure. POSIX does have some extra restrictions that the C standard doesn't have, but I'm not as familiar with it.

    105. Re:Missed the point by KiloByte · · Score: 2

      On any sane platform, "long" has that property, and can also hold the machine word or more. Win64 is not sane.

      Because of it, the standard had to add intptr_t which is the only type portably known to be of same width as void* and char* (but _not_ necessarily the same as a pointer to any other type!). Of course, MSVC doesn't follow standards and doesn't have this type nor stdint.h/inttypes.h at all.

      Thus, your code will not work there, and will cut pointers. To make things worse, it will actually work if your pointers are within the first 2GB of address space and not on the stack.

      I do fully agree with your core point, though: Pascal strings would suffer from all these problems as well, and not only.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    106. Re:Missed the point by KiloByte · · Score: 1

      BSD has strlcpy() which works sanely, but Ulrich Drepper refused all requests to add it to glibc.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    107. Re:Missed the point by EvanED · · Score: 1

      Oh, I fully agree. But the one that matters for safety purposes (at least sidestepping the "a multi-byte sequence got cut off in the middle" problem) is the byte count, not the logical character count. And safety is why, the argument goes, we'd have been better off with counted strings.

      (In fact, even "get me the number of logical characters" isn't unambiguious. Is it the number of Unicode code points, or the number of glyphs? I.e. do you count "combining acute accent" followed by "e" as one or two characters? It's two code points, but to the character it's only one. So there's at least three.)

    108. Re:Missed the point by Rakishi · · Score: 1

      Fail once again, do you even know what you're talking about or do you just google whatever comes first then post it?

      That scheme has none of the advantages mentioned in the article over a NUL ending. The length is encoded across the data stream and so is not known beforehand. Furthermore it adds a constant 12.5% overhead which is even more than the overhead from other schemes. It's actually rather close to a NUL ending approach to integers which don't have a set of reserved unused bytes.

    109. Re:Missed the point by KiloByte · · Score: 2

      UTF-16 is not fixed width. It combines all disadvantages of UTF-8 and UCS4 while having no advantages of either.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    110. Re:Missed the point by blane.bramble · · Score: 2

      No, that is just the stupid. The obvious fix is to store the length of the string *after* the string itself.

    111. Re:Missed the point by Anonymous Coward · · Score: 0

      DOS calls did not NUL terminate. Instead they used a '$' character to terminate a string. That is a far worse choice. C++ creates a string class that reduces the issues with C style strings, now that we have ample memory and CPU processing power. I think gets() would still cause buffer overflows for strings more than 255 characters, even if the string format were a length octet+array tuple.

    112. Re:Missed the point by jelle · · Score: 1

      Sigh? So you don't think it's a big deal that for example for a strcat() of two 100 bytes you now need to copy _both_ strings, because the length field at the beginning now takes up more space, requiring the first string to be moved? How is that efficient?

      C doesn't _require_ you to use NUL terminated strings, you can do whatever you feel is cute. A 'char *' is nothing but a pointer, which you can use as a string, but you don't have to, nor the reverse. If you like prefixing a string with a length (vlc coded or not) in your code, you can feel free do to so... Or use C++ and use std::string, because many (all?) implementations of it store the size separately. And yes, std::string has a "max_size()"...

      If you don't use NUL terminated octet strings, you just won't be able to easily use many C libraries, because the rest of the world uses NUL terminated strings. And the reason for that is not a 'mistake', the author is wrong.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    113. Re:Missed the point by Anonymous Coward · · Score: 0

      12-11-0-8-9 == 120 base10 - 78 base 16 - LE 360 instruction According to my System/370 Reference Summery

    114. Re:Missed the point by solkimera · · Score: 2

      Java's String implementation is: char array, offset, length. That way when doing a substring, the resulting string uses the same char array.

    115. Re:Missed the point by Anonymous Coward · · Score: 0

      A 12-11-0-7-8-9 punch is clearly a set of holes punched in a column on an ibm card.

    116. Re:Missed the point by pilott · · Score: 1

      Wrong. strncpy always writes exactly n bytes (implementation details aside, the end result is the buffer is either a copy of the source string or 0). RTFM

    117. Re:Missed the point by ceswiedler · · Score: 2

      Are you suggesting strlen() should return the number of UTF-8 characters, not the number of bytes? That's insane... the entire point of UTF-8 is that stuff like strlen() can treat it as a narrow string. If you want to have a function for returning the number of printable characters in a UTF-8 string, that's going to be a separate function, and isn't any easier or harder with sized strings v.s null-terminated strings.

    118. Re:Missed the point by microbox · · Score: 1

      Sure, one could go to UTF-16 instead, but then there's a lot of wasted space

      Actually, if you think about it, this might not work either. It depends on whether you want to count delete characters or not. Most instances when you want the number of characters in a string, you want to consider delete characters -- and just think for a moment how you might need to do that.

      --

      Like all pain, suffering is a signal that something isn't right
    119. Re:Missed the point by kriston · · Score: 1

      "I think the state of programming is so bad now that people wouldn't test it. A major security bug in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch."

      It's beside the point, but the "security bug in Blowfish" is nothing of the kind. It is actually a security bug in a specific implementation of Blowfish, namely crypt_blowfish, that originally comes from the John the Ripper software.

      --

      Kriston

    120. Re:Missed the point by jelle · · Score: 1

      What you're looking for i 'strlcpy()' (or on windows maybe 'strncpy_s()'). BSD has strlcpy(), I don't know why glibc (Linux) doesn't have it by default. Actually, I do sort of know: The glibc project leader doesn't like it: http://sources.redhat.com/ml/libc-alpha/2002-01/msg00002.html

      Ulrich left it out because he either doesn't like the (BSD) people who came up with strlcpy() (http://www.gratisoft.us/todd/papers/strlcpy.html), or it is what he implies, that he is convinced it encourages lazy programming and causes unexpected failure modes if the programmer forgets that strlcpy() can truncate a string (well he literally says something else, but if he literally means what he says then he just plainly doesn't understand where source strings can come from in the real world).

      Besides the fact that strlcpy() has a return value that indicates truncation, he (incorrectly) assumes that ignored truncation can only be bad. I say incorrectly, because there are plenty of cases where a string truncation by strlcpy() is no problem. One important one of them is when the source string is only too long for the buffer if it is either erroneous (and a truncated version of the string is not more problematic later on), or it is specially crafted to try to cause a buffer overflow.... Honestly, I think the missing strlcpy() in glibc is more about personalities than anything else. Oh well...

      The irony is that, as a result, some (many?) programmers are now using strncpy() and assuming it does the same thing as strlcpy(), causing more problems than strlcpy() even could have caused in Ulrich's imagination.

      The (in case of overflow) not NUL-terminating of strncpy() means the destination is not guaranteed a string. It clearly was not meant to have a string as destination, but just a buffer (that's what the man page says too btw). It's padding with zeros to prevent 'leakage' of previous data in the destination, in case later the destination buffer is forwarded/stored as a whole (not as a string).

      I think the main problem with that funtion is its name, it does not do what the name implies. A better name would be memstrcpy(dst,src,count), because it really is a memcopy where the source is a string and the destination is a buffer (it will never read past the end of the source string but always write the entire destination buffer).

      I wonder how many people don't have a clue about this. For example, even "{char buf[10]; strncpy(buf,something,9);}" is still not safe, because buf[9] is uninitialized and therefore not guaranteed to be NUL... If you want to use strncpy() and the destination buffer is later treated as a string, you really have to make sure yourself that the last char in your buffer is NUL. When you do use strncpy(), you can check afterwards if the string did fit (it's going to be NUL if it did, not NUL if it didn't), and take appropriate action... But that's a lot of code if you just want a string copy with length truncation, aka strlcpy()...

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    121. Re:Missed the point by rubycodez · · Score: 1

      http://www.bitsavers.org/pdf/ibm/27xx/GA27-3005-3-2780_Data_Terminal_Description_Aug71.pdf

      all the meaningful combinations are on pages 10-13. EBCDIC had various combinations involving zero to six punches, you'd have to backspace and overpunch for many of them.

      My university got rid of the keypunches, readers, verifiers and sorters my sophomore year, 1983; I had used them to help my roommate who was poor typist, he had an intro to computer class of some sort as a requirement for his Business Management degree and had to write some COBOL programs of a few dozen lines. Funny to check his logic he would lay the card in a line on the floor, with "branches" and "loops" looking like spurs on a model railroad.

    122. Re:Missed the point by spitzak · · Score: 1

      Absolutely agree, the FA gets it all wrong. At no time, on machines with 8K of memory, did anybody EVER consider using a 16-bit length. The choices were between using 1 byte for a string terminator, or one byte for a string length. The choices were either unlimited strings but one character cannot be in them, or strings limited to 255 characters. There were no other choices considered, anybody suggesting wasting more than one byte on every string constant would have been ignored.

      Some alternatives were worse than either Pascal or C, such as using the high bit in the last character to indicate the end of the string. There was also strong desire to compress strings to 6 bits or less per character (such compression is why filenames and identifiers were often case-insensitive and restricted to number, letter, and underscore). The fact that such things were seriously considered and implemented should show just how insane it would have been at that time to consider using 16 bits for the length.

      Anybody using Pascal at the time will tell you that C strings are a HUGE, HUGE, INFINITE win over the length limitation of Pascal. The RIGHT decision was made.

      Even if 16-bit numbers were somehow used, they would have been stored at the start of the string (NOBODY would ever consider passing anything larger than a pointer as an argument, C did not support passing structures by value at that time because it was considered prohibitively expensive). Since the PDP-11 could put these at odd addresses, code would certainly have taken advantage of this to pack strings together, making portability to machines that needed the length on even-only much more difficult. And attempting to change 16 bit to 32-bit length would have produced a huge fight, and probably resulted in two string constant types because of the need to support huge piles of back-compatability.

      However I disagree that his example of embedded nuls is the worst. The gets style where buffers overflowed have been responsible for far more breaches and crashes than embedded nuls. Embedded nuls only break if there is actually a different function which uses something else to measure the string, such inconsistency is a source of bugs no matter what two schemes are used.

    123. Re:Missed the point by Anonymous Coward · · Score: 0

      They don't suggest to have taken a one-byte length, as this would consume exactly the same amount of memory as a zero-terminated string: instead of the zero at the end you store the length at the beginning.

      This implies that they were thinking about a two-byte length, which indeed needs an additional byte compared to a zero-terminated string.

    124. Re:Missed the point by spitzak · · Score: 1

      Python 3 does NOT get it right or better. It gets it MUCH worse and this is why everybody doing serious work is sticking to Python 2.x

      A Python "unicode" string cannot store arbitrary data. The only way to convert a UTF-8 string to it is to do a lossy conversion because errors must turn into codes that match things that correct UTF-8 encoding turns into as well. We should all know that lossy conversion is a serious flaw but too many people, incluing Theo, believe some magic pixie is going to make it physically impossible for invalid encodings to appear in a byte array. Nobody thinks there is magic making the byte zero not be able to be stored into the memory used by a C string, and the fact that people act as though this much more complex rule is going to be enforced by the laws of nature indicates a serious disconnect from reality, possibly due to an extreme desire to convince themselves that all the work they did to make their "wide characters" was not wasted.

      The idea that "Unicode" is done by making a list of "characters" that can each be individually and atomically looked at is what is preventing I18N from ever working. Unicode contains precomposed and decomposed characters, it contains joins and splits and direction indicators and language tags and BOM symbols, and many languages have varying ideas of what order the various glyphs are in a word. Every single person who has foisted "wide characters" and "UCS-2" and what Windows and Python call "Unicode strings" (which are often UCS-4 or UTF-16 except when it is mistakenly interpreted as UCS-2) are actively hurting internationalization and adoption of Unicode, probably more than the most redneck USA-First programmer could ever have done. And the shameful thing is that they believe they are helping, while making it harder!

      Strings should be arrays of bytes with a length (maybe the length should be in bits). Do you want to look at the "characters"? You use an ITERATOR!!!! You can use DIFFERENT iterators depending on how you want the "characters": ie precomposed, decomposed, all kinds of normalization, and they can correctly decode UTF-8 (or UTF-16) in-place and they can report encoding errors exactly and unambigously so they are preserved and cannot be used for security violations. The expression string[n] should be ILLEGAL and undefined (this would also fix the reason C++ std::string cannot share memory between identical strings). The only things you can do is copy and concatenate strings, and split them at an iterator.

    125. Re:Missed the point by spitzak · · Score: 1

      Although I mostly agree with you, I have to say that if you think strlen() should return the number of "characters" in UTF-8 then you are seriously mistaken. Such idiot-savant thinking has caused more trouble than anything else in getting Unicode to work.

      strlen returns the number of fixed-sized units in the string. All other answers are useless.

    126. Re:Missed the point by spitzak · · Score: 1

      No you do not need the number of "characters", EVER!!!!

      Thinking that Unicode is made of a string of "characters" is WRONG, WRONG, WRONG!

      Read up on precomposed and decomposed characters, and all the other nuances, before you make an idiot statement like this again.

      A simple one: do you count the BOM at the start of a UTF-16 string as one of the characters? Answer: it does not matter what your answer is, it will be wrong in half the situations.

    127. Re:Missed the point by spitzak · · Score: 1

      Are you sure Perl does not just store 8-bit strings?

      Transcoding UTF-8 to UTF-32 is a huge WASTE of processing time, not a saving. You are seriously over-estimating the number of times that somebody needs to get to the N'th code point without looking at the N-1 code points before it. Possibly by many orders of magnitude. All other operations on a string do not take any more time with UTF-8 than UTF-32.

    128. Re:Missed the point by TemporalBeing · · Score: 1

      The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation.

      Lisp was used as both a systems and application programming language, and it seems not to have these problems.

      But you need an AI to program Lisp, so your point is?

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    129. Re:Missed the point by TemporalBeing · · Score: 1

      Of course, MSVC doesn't follow standards and doesn't have this type nor stdint.h/inttypes.h at all.

      From what I understand, MSVC in VS2010 does have stdint.h. I don't know how complete it is, but they at least finally added it.

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    130. Re:Missed the point by TemporalBeing · · Score: 1

      "how do you pass a pointer to the "12345" to a function which converts strings to integers?" Easy, pass a pointer to the string and the function is smart enough to start after the count.

      You'd know that if you'd written some real programs instead of just sitting through an algorithms class.

      Except that doesn't work in a address+length world. You have to build the pointer in address+length format, but you can't do that without traversing it.

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    131. Re:Missed the point by Anonymous Coward · · Score: 0

      BCPL was the predecessor that used one-byte counted strings. It's true that you cannot easily suffix counted strings without writing a count in the middle of the string; but you cannot easily prefix null-terminated strings without writing a null in the middle of the string. Back in the days of PDP-11's with 128KB of memory, there were very few strings longer than 255 characters. (I know, I wrote a program and had to re-write it overnight to use 2-byte-count strings, in BCPL, when I found my dataset of 10,000 strings had a few that were longer than 255 characters.

    132. Re:Missed the point by spitzak · · Score: 1

      That would NEVER have been considered in the 1970's.

      Such a structure would either have to be passed by copying it into two registers or stack locations, or a pointer to it would have to be passed resulting in a two-level indirection to get at the characters. We are talking about 1/2 Mhz machines with 16K of memory and perhaps 512 bytes of space for the stack. C did not support passing structures by value at that time.

      What the article is proposing is more like:

          struct string
          {
              short length; // two bytes
              char buffer[];
          };

      where the allocated memory is actually sizeof(length)+length.

      I do not think such a structure was EVER considered either. Nobody would have thought to "waste" a byte on the very few strings that were longer than 255. The choices were to use a terminating character, or use a single byte length. At the time many languages thought it was worth saving yet another byte by making the high bit in the last character indicate the end of the string (since nobody would ever need more than 127 different characters, right?).

    133. Re:Missed the point by spitzak · · Score: 1

      I think strncpy was intended to convert null-terminated strings to fixed-length null padded strings, as used in many places in the Unix kernel at the time it was invented, like filenames.

      Yes, that is exactly it's purpose (see the dirent structure from early Unix for the most obvious example of a fixed-sized buffer).

      The fact that strncpy sort-of works as a safe strcpy and that it's name is very similar to strcpy is a real unfortunate mistake and has led to unbelievable problems.

      A proper save strcpy is strlcpy. But it is not in the Linux libc because there are idiot savants in control of that.

    134. Re:Missed the point by spitzak · · Score: 1

      The Windows function is "strcpy_s" and is much less useful than strlcpy. In particular it is defined to actually "throw an exception" using a complex mechanism (since C does not have exceptions). This is really stupid, since it just turns a *possible* buffer overflow exploit into a *guaranteed* denial of service (anybody who actually did the work of catching the exception would also have been able to prevent the buffer overflow).

      In reality nobody uses the exception mechanism, and rely on the default behavior. But this is also useless. It puts a nul at the start of the destination and returns an error indication. strlcpy instead puts the portion of the string that fits in the destination, and returns the length of the source. This makes the result useful for many purposes (for instance if the following code would never have looked at more than the buffer size of bytes anyway), and the return value is useful for allocating the correct-sized buffer.

      The _s functions are typical designed-by-committee crap and can be ignored. The correct function is strlcpy. Both Microsoft and glibc maintainers should be ashamed at their behavior in not supporting strlcpy.
       

    135. Re:Missed the point by Your.Master · · Score: 1

      The point of the article was what the ideal string encoding would be, with the argument that the magic terminator wasn't it. We can think of an infinite number of bad encodings, but I'm pretty sure the article was trying to come up with a good one.

      But anyway, the article argued that it would cost one more byte to do address + length. The entire premise of the article, then, is of a 2-byte length. 65k characters ought to be enough for anyone... or in any case, it's not unreasonable to require a library for ultra-massive buffers like that (such a library could also allow it to be non-contiguous, which the rare cases of ultra-long strings can benefit from, memory-wise).

    136. Re:Missed the point by Anonymous Coward · · Score: 0

      Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch.

      Neat. Thanks for the history lesson (via google).

    137. Re:Missed the point by Your.Master · · Score: 1

      Every time? No.

      One performance advantage of NULL-terminated strings is you can trivially maintain two independent representations of the same string, one of which has a static prefix.

      char *str2 = str1 + prefix_length;

      Any non-reallocating modification of one string instantly affects the other.

      Is that kind of academic? You betcha'. Most of this is. If your performance is seriously gated by strlen, then you should use counted strings for that operation even in C (nobody will stop you).

    138. Re:Missed the point by mini+me · · Score: 1

      That would NEVER have been considered in the 1970's

      That doesn't explain why it doesn't come standard in 2011. C++ gets to have two different kinds of strings, why can't C?

    139. Re:Missed the point by Anonymous Coward · · Score: 0

      Oops, you're right. It's strncat that doesn't pad. But strncpy does. Whoever thought that one up should be shot.

      Um. So let me rephrase : 'strncpy shouldn't pad'. But I guess we agree on that. ;-)

    140. Re:Missed the point by arth1 · · Score: 1

      The system strlen() should of course behave like it always has, but you should be able to easily replace it with a function that does (like Kragen's strlen_utf8()) for when you need the length in actual characters and not the number of bytes.

      That's not so simple on systems where the assumption has always been that the two are the same and you can just use the char counter.

    141. Re:Missed the point by blair1q · · Score: 1

      It's not just specious, it's completely stupid.

      Having a length parameter limits the length of your strings. As you note, a 1-byte length field is no different from a 1-byte terminator in terms of space.

      But a 1-byte length field means you'll have to go through hoops to have a string longer than 256 bytes.

      A terminator means you have to go through hoops to escape that character if you need it to appear in the string, but that's a problem you only have to solve once, whereas the length field has to be re-implemented every time someone makes a bigger string.

    142. Re:Missed the point by darkwing_bmf · · Score: 1

      my_string : strings := "Content-Length: 12345";
      content_length : integer := my_function (my_string (17 .. my_string'length));

    143. Re:Missed the point by spazdor · · Score: 1, Informative

      would have, could have.

      --
      DRM: Terminator crops for your mind!
    144. Re:Missed the point by AuMatar · · Score: 1

      Because everyone gets C strings, and introducing it would be no gain. C provides a minimal library on purpose. If you want a new string type, roll your own or use one of dozens of already written libraries.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    145. Re:Missed the point by Rockoon · · Score: 1

      You are confusing a String type with a Text type.

      Its not the rest of the worlds fault that C programmers decided that a String is an array of Character. The Character part is one of the problems.. and is contrary to what was considered a String prior to C.

      --
      "His name was James Damore."
    146. Re:Missed the point by Darinbob · · Score: 1

      The fundamental concept hidden behind almost all of computer science is the tradeoff. If someone says "clearly A is better than B" then they're being superficial. It's better to say "A is preferred over B in this particular context and this particular goal". Even bubblesort has its uses and is the clear winner in some contexts.

      So when anyone says counted strings are better than null terminated strings, they are demonstrating ignorance or bias. There is a tradeoff here, there is no clear winner between the techniques. It depends upon the context and it depends upon what your goals are.

      There is no ideal string encoding that works for all contexts and goals.

    147. Re:Missed the point by Darinbob · · Score: 1

      C allows you to use whatever style of strings you want. You just have to do the work yourself, write your own libraries, etc. There are counted string libraries already if someone wants to use them.

    148. Re:Missed the point by Darinbob · · Score: 1

      And there already is a standard function for number of characters in a string, mblen(). I've found this stuff is vastly simpler to use than wchar_t and saves a ton of space.

    149. Re:Missed the point by spitzak · · Score: 1

      You misunderstood.

      The result of strlen_utf8() is useless. It could be the number of Unicode code points, or code points if the characters are decomposed (or any of the 4 types of normalization) or code units if converted to UTF-16 (with/without normalization), or it could be a glyph count, or a count of what people familiar with the languages would count as "characters", or it could be schemes where double-width characters count as 2 and invisible ones as zero to emulate old terminals, or any of a million other possibilities, without even adding questions of what to do with UTF-8 encoding errors. Thinking this value has some purpose is a sure sign that you do not understand Unicode and have not been doing serious string manipulation in your software.

    150. Re:Missed the point by Estanislao+Mart�nez · · Score: 1

      UTF-16 is not fixed width. It combines all disadvantages of UTF-8 and UCS4 while having no advantages of either.

      ...unless you're encoding Chinese, Japanese or Korean text, in which case it's much more compact than UTF-8.

    151. Re:Missed the point by darkwing_bmf · · Score: 2

      Really? How is Ada less powerful than C?

    152. Re:Missed the point by arth1 · · Score: 1

      The result of strlen_utf8() is useless. It could be the number of Unicode code points, or code points if the characters are decomposed (or any of the 4 types of normalization) or code units if converted to UTF-16 (with/without normalization), or it could be a glyph count, or a count of what people familiar with the languages would count as "characters", or it could be schemes where double-width characters count as 2 and invisible ones as zero to emulate old terminals, or any of a million other possibilities, without even adding questions of what to do with UTF-8 encoding errors. Thinking this value has some purpose is a sure sign that you do not understand Unicode and have not been doing serious string manipulation in your software.

      Yet the bumblebee flies.
      When the purpose is to (for example) break text so it actually fits on a fixed width output device, it sure works well. The algorithm is simple enough that you know what it does, which deflates your objections.

      Those who plug their fingers in their ears and say "NA-NA, can't do it, can't do it", and then don't do it are the ones causing problems.

      Which reminds me, when, by the way, will we get UTF-8 support on slashdot?

    153. Re:Missed the point by MrEricSir · · Score: 1

      "The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding."

      So you're comparing one thing that didn't happen to another thing that didn't happen, and calling that an incompatibility? Seems like a pointless exercise.

      --
      There's no -1 for "I don't get it."
    154. Re:Missed the point by Anonymous Coward · · Score: 0

      UTF-16 is not fixed width. It combines all disadvantages of UTF-8 and UCS4 while having no advantages of either.

      ...unless you're encoding Chinese, Japanese or Korean text, in which case it's much more compact than UTF-8.

      If you're only encoding CJK characters, yes. ASCII-heavy formats like HTML and modern XML-based word processor documents still typically come out smaller in UTF-8, even when all the "text" is CJK.

      As for fixed-widthness ... well, hands up anyone who has ever actually used a single character outside the BMP ...?

    155. Re:Missed the point by spitzak · · Score: 1

      The number of bytes in the UTF-8 string is about as accurate of a guess as to how wide a string will print.

      To actually measure a string you need to add up the widths of all the glyphs and escapements and handle kerning and compositing characters. I suspect such a function would have "width" in it's name.

      The 0/1/2 type measurements that are used to emulate older fixed-pitch terminals that did Japanese encodings where all Japanese characters were double-width are occasionally useful, however I very much doubt that strlen_utf8() is returning that value.

    156. Re:Missed the point by WhiteDragon · · Score: 1

      Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

      Yeah, clearly PHK doesn't knows anything about memory allocation. (Except for the malloc library he wrote for FreeBSD...)

      Maybe he should RTFM.

      I don't have a FreeBSD system at hand, but I wouldn't be surprised if the malloc page was written by PHK.

      hehe, WTFM :-D

      --
      Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?
    157. Re:Missed the point by Anonymous Coward · · Score: 0

      The bottom line is that in an application programming language strings need to be atomic, as they are in Python.

      Please stop abusing Python for application programming.

      I am sick and tired of launching crucial system administration tools on my computer, only to have to wait 30 seconds or more before the GUI eventually appears. These painfully slow programs are almost invariably written in Python; the handful of programs that still perform well are almost invariably written in C or C++.

      I'm sure it runs really fast on your turbocharged desktop, but some of us are using economical, environmentally-friendly, low-powered computing devices, and our daily experience is that Python is most emphatically not fast enough to be used for general-purpose application development.

    158. Re:Missed the point by shutdown+-p+now · · Score: 1

      In Standard Pascal, strings are char arrays (no special type), and arrays are not max(255).

      In Borland dialects of Pascal (and derivatives), string was made a special primitive type, with byte-prefixed-length representation. But that was ages ago. All modern dialects use int32-prefixed-length instead.

    159. Re:Missed the point by shutdown+-p+now · · Score: 1

      ... this is why everybody doing serious work is sticking to Python 2.x

      We have already argued the correct handling of UTF-8 on /. with you in the past, so I won't touch on this topic. But as for the statement quoted above, I'm pretty sure that you're wrong. Most people stick to Python 2.x simply because it is better supported and/or they have 2.x-only dependencies and/or their hoster only offers 2.x. You're the only person from whom I've heard complaints about Python 3 handling of Unicode strings so far.

    160. Re:Missed the point by mgiuca · · Score: 1

      It's beside the point, but the "security bug in Blowfish" is nothing of the kind. It is actually a security bug in a specific implementation of Blowfish, namely crypt_blowfish

      Okay, when I said "Blowfish" I should have said "crypt_blowfish". Aside from that, did I say anything incorrect?

      I originally said there could (for example) be problems with a length field because some people would use a signed value and other people would use an unsigned value. The GP said he didn't think people would fail to test values greater than 128. I was pointing out that it happens all the time that people don't test edge cases, and crypt_blowfish is a perfect example.

    161. Re:Missed the point by Anonymous Coward · · Score: 0

      Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone.

      find /dev /var/run -type c -o -type p

      Which produces blocks of characters (paths). The individual characters are meaningless until you have the whole block, and find knows the length of the path before it wrote it to the stream but Unix lacks the capacity to transfer that information without layering another program and protocol on top.

      Streams are always evil, bad and wrong anyway. There is no circumstance (outside of a hardware buffer for something like a sound card) where streaming couldn't be replaced with a packet/message based communication that would be as-fast/faster and easier to work with.

    162. Re:Missed the point by jrumney · · Score: 1

      Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution

      To be one byte larger than the NUL terminated version, as the article states, it would have to be a two byte length, still arbitrary limiting the length of strings, but to 65535 characters, not 255.

    163. Re:Missed the point by KiloByte · · Score: 1

      As a roguelike player/developer, I can tell you many of the glyphs above BMP are hawt.

      I see people abusing "mathematical" letters a lot, too.

      Not to mention Chinese and Japanese family names -- some people care about writing them properly as well.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    164. Re:Missed the point by Anonymous Coward · · Score: 0

      Poul-Henning Kamp, the author of Varnish, never misses an opportunity to promote himself – usually at the expense of (much more talented) others.

      Yesterday, it was a rant against Knuth in "You're Doing It Wrong" (PHK's contribution is merely about taking advantage of "locality" – and G-WAN does it much better):
      http://queue.acm.org/detail.cfm?id=1814327

      Today this is the turn of Ken Thompson, Dennis Ritchie, and Brian Kernighan, because PHK is unhappy with C (this may explain why Varnish sucks).

      Let's review PHK's article pertinence (published on the ACM and Slashdot sites, nothing less):

      0) "I have not found any record of the decision, which I admit is a weak point in its candidacy: I do not have proof that it was a conscious decision."

      All those who have found started by searching in the first place:

      http://cm.bell-labs.com/who/dmr/chist.html

      "In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator."
      – Dennis M. Ritchie

      BCPL being the parent of B which was the parent of C, it was a "conscious" decision and it was not "weak".

      PHK is wrong twice for a single point (and guilty of not doing his homework) – a good start.

      1) "Using an address + length format would cost one more byte of overhead than an address + magic_marker format"

      Right, if you limit the length of your strings (which zero-terminated strings do not do).

      And, unlike those he criticizes, PHK is obviously not considering the computational and bookkeeping overheads necessary to maintain the length field.

      2) "If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL"

      The ending byte(s) problem is the same with an extra 'length' character.
      So much for "the solution"...

      3) "gets(3), which 'assume the buffer will be large enough', are a problem"

      Typical junk from a sub-standard programmer: using the wrong tool for the job.

      Maybe PHK would feel safer with C# or Java: they trade efficiency for pampering users – at the cost of gazillion of critical security holes in the language implementation.

      4) "Experience shows that such proposals go nowhere because backwards compatibility with the PDP/11 and the finite number of programs written are much more important than the ability to write the potentially infinite number of programs in the future in an efficient and secure way."

      And this is (shamelessly) written by the author of Varnish (the "Web server accelerator") which is much slower than mere "Web servers" like G-WAN, Nginx, Lighttpd, etc.

      http://gwan.com/imgs/apache-traffic-server_g-wan_lighttpd_nginx_varnish.png

      "PHK-the-politician" whould be better inspired to spend some time learning how to write decent code. All he could risk would be then to become technically relevant, one day, maybe – if he ever started to work hard.

      Pierre.

      PS: Interestingly, anything posted with my email address as a comment for PHK's article is classified as SPAM ("Spam activity has been detected. Your comment has not been posted.") and rejected.

      "PHK-the-great" in action. No wonder why Varnish is used by MSFT "strategic partners"... the same causes lead to the same results: junk.

    165. Re:Missed the point by snowgirl · · Score: 1

          struct string

          {

              short length; // two bytes

              char buffer[];

          };

      First, there's anachronism here. The "char foo[];" notation is new, far newer than the first C implementations were.

      Also, the size_t (or rather, "int") in the original implementations was 16 bits, and the distinction between "short" and "long" did not occur until they introduced 32-bit machines. I mean, really... when the processor had 8-bit and 16-bit registers, why do you need two flavors of "int"?

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    166. Re:Missed the point by rjstanford · · Score: 1

      Although - ideally - if your language of choice was trying to have a String type, it would allow you to use string[n] to refer to the "standard" (glyph at the n'th position of the precomposed iterator), and let you use a library to see the String as a byte[]. Since, as you mentioned, you should always use the glyphs as glyphs and never as bytes, unless you were performing actual IO on them (which is comparatively rare in application development).

      --
      You're special forces then? That's great! I just love your olympics!
    167. Re:Missed the point by rjstanford · · Score: 1

      I'd still take NULL-terminated for most purposes.

      When massively hardware constrained, and when all programmers were good (and programs 100% debuggable/testable), that did make sense.

      Using a length marker rather than a NUL makes 50% of your string handling slightly more complicated, and in exchange it makes a relatively common error that often has truly disastrous security and stability consequences completely disappear. That's a good deal, at least from a "software engineering" standpoint. That doesn't mean that you can afford to take the deal, but if/when you can, you probably should.

      --
      You're special forces then? That's great! I just love your olympics!
    168. Re:Missed the point by snowgirl · · Score: 1

      Are you sure Perl does not just store 8-bit strings?

      Transcoding UTF-8 to UTF-32 is a huge WASTE of processing time, not a saving. You are seriously over-estimating the number of times that somebody needs to get to the N'th code point without looking at the N-1 code points before it. Possibly by many orders of magnitude. All other operations on a string do not take any more time with UTF-8 than UTF-32.

      After digging around a lot, I came across the answer, using Devel::Peek, I was able to determine, no. You're correct, it uses UTF-8 internally.

      There are other reasons to transcode to UTF-32 beyond just getting the N'th code point though. For instance by using UTF-32 internally, you ensure that you don't generate any overlong representations when generating UTF-8. As well, you don't have to constantly spend time composing characters from UTF-8 to compare to regex values and such. (Testing for the property IS_HEBREW means you have to turn each character into an integer value and then test the character properties.)

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    169. Re:Missed the point by snowgirl · · Score: 1

      I was actually going to comment about all this myself, then after I had written a whole paragraph of convoluted backtracking based on what is a glyph vs. character vs. codepoint, I decided to give up. lol.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    170. Re:Missed the point by snowgirl · · Score: 1

      "The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding."

      So you're comparing one thing that didn't happen to another thing that didn't happen, and calling that an incompatibility? Seems like a pointless exercise.

      Someone else presented the original hypothetical... I just pointed out that the second hypothetical would have occurred and produced incompatibilities.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    171. Re:Missed the point by rjstanford · · Score: 1

      Amazing how many of those confusions would have also been eliminated with one additional byte - the value returned by strsize() would have been much more obvious. As with the difference between weight and mass, when two different concepts commonly share the same value its easy to confuse them - but terribly confusing later when the differences become critical.

      --
      You're special forces then? That's great! I just love your olympics!
    172. Re:Missed the point by rjstanford · · Score: 1

      C does not have strings .... It has a char and a pointer to a char - this means you have low level control of everything (and you have to cope with the consequences)

      Actually, I'd say its worse than that; it has a byte - which is confusingly named char, but no Character. It also has a series of routines designed to work on arbitrary length byte arrays, some of which confusingly start with the letters str. Then C coders get to sniff at anyone who things that char is a Character, or that a char* (which is what the str...() routines all work on) is a String.

      --
      You're special forces then? That's great! I just love your olympics!
    173. Re:Missed the point by rjstanford · · Score: 1

      C doesn't _require_ you to use NUL terminated strings, you can do whatever you feel is cute. A 'char *' is nothing but a pointer, which you can use as a string, but you don't have to, nor the reverse.

      If C had used byte instead of char, and something like bzs (for null terminated byte sequence) instead of str in its standard library names, we'd all be much happier. Not only would people not thing that a char* was a string (since its an argument to strcpy), but one of the evolutions of C might have presented a real string type designed at a time when memory was less precious than it once was.

      Oh well.

      --
      You're special forces then? That's great! I just love your olympics!
    174. Re:Missed the point by rjstanford · · Score: 1

      This is true for many real "Strings". Its not true at all for arbitrary length blocks of bytes that don't happen to contain a NUL (also called "Strings" by many C programmers, a practice encouraged by the names of the str___() functions). And one thing that C does really well, as it should, is manipulate tons of arbitrary length blocks of bytes.

      --
      You're special forces then? That's great! I just love your olympics!
    175. Re:Missed the point by Anonymous Coward · · Score: 0

      Runtime engine?
      Pascal is compiled.

    176. Re:Missed the point by Anonymous Coward · · Score: 0

      Theo? ITYM Guido.

    177. Re:Missed the point by spitzak · · Score: 1

      you don't have to constantly spend time composing characters from UTF-8 to compare to regex values and such.

      Not true, you do not need to decode UTF-8 except for the character-set (square brackets in most syntaxes) match, and only if that set contains non-ascii. This can be done at the moment the matching is done, both for parsing the pattern and for reading the matched string, and there are plenty of regexp libaries that do so. Also pattern matching of 8-bit sequences can take a lot of advantage of lookup tables (you can fit approximately 2^13 8-bit lookup tables into the space that a UTF-32 lookup table would take) so it is usually advantageous to compile a UTF-8 regexp into a more complex one that is byte-based, if it is used multiple times.

    178. Re:Missed the point by spitzak · · Score: 1

      You seem to have read my post exactly backwards.

      The ONLY time there should be any interest in "glyphs" is when strings are rendered. Which is very, very, rarely. In addition it is impossible to get the actual Nth glyph without examining the entire string, and it can vary depending on the font, on the font layout software, and on whatever your concept of the serial order of the glyphs is.

      If string[n] exists it should return the nth code unit from the encoding. I very much recommend against implementing it at all because it makes the marching morons write "string[n] = tolower(string[n])" and other horrors that make I18N impossible.

    179. Re:Missed the point by spitzak · · Score: 1

      Yes I agree that the use of "length" is misleading (even for ASCII it was probably misleading as soon as there were proportional fonts).

      A more important source of confusion is huge amounts of dated documentation that says "character" when it should say "byte". This leads the clueless to think that "character" is very important and they must always count these for all measurements, no matter how hard or ill-defined it is. For instance Linux man pages says strchr "returns a pointer to the first occurrence of the character c in the string" (and the argument c is an integer). The correct documentation is "returns a pointer to the first occurrence of the byte c in the string". But you can imagine the horrors that somebody with determination, rudimentary knowledge of UTF-8, and just enough programming talent to be dangerous if they decided to "fix" the library to make it obey the incorrect documentation.

    180. Re:Missed the point by snowgirl · · Score: 1

      I suppose that it is true that one can decompose any multibyte code range match into an 8-bit match sequence...

      But LUTs don't have to be the entire range, and the LUT doesn't need to exceed the size of the Unicode space. Why can't I use a LUT that is based on character_class[codepoint >> 7]? And then only contains 0x2200 elements?

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    181. Re:Missed the point by spitzak · · Score: 1

      I'm sorry I was talking about compiled tables used to implement fast regexp and searches, not direct implementations of character matching.

      An example of a lookup table like I was saying is if you wanted to match all characters in a character class, you could make an 8-bit lookup table for the first UTF-8 byte. Each entry either points to another lookup table for the second byte, or an "all true" or an "all false" indicator. In fact any practical method of matching subsets of Unicode works something like this which is why I really don't see any advantage in translating to UTF-32. The UTF-8 bytes are actually somewhat balanced toward frequency so that the most common characters are found with fewer lookups.

    182. Re:Missed the point by snowgirl · · Score: 1

      I'm sorry I was talking about compiled tables used to implement fast regexp and searches, not direct implementations of character matching.

      An example of a lookup table like I was saying is if you wanted to match all characters in a character class, you could make an 8-bit lookup table for the first UTF-8 byte. Each entry either points to another lookup table for the second byte, or an "all true" or an "all false" indicator. In fact any practical method of matching subsets of Unicode works something like this which is why I really don't see any advantage in translating to UTF-32. The UTF-8 bytes are actually somewhat balanced toward frequency so that the most common characters are found with fewer lookups.

      As I said, you can reinterpret a match for a single unicode codepoint into a match for the UTF-8 sequence that would be equivalent. However, it would fail to match overlong sequences, so if you slip up, one might be able to get around your regex blocking access do any http path that includes ".." by using overlong codes. I know of at least one implementation that made an error of this sort, and introduced a vulnerability. True scrubbing overlong codes probably isn't as resource (cpu time) intensive as transcoding to UTF-32, but they are still in the same O(n) category. The LUTs you're talking about vs the LUTs that I envision being available for UTF-32 still both have O(1) efficiency as well. (One can use a LUT on the MSB of a UTF-32 value as well (or rather knowing that no UTF-32 value over 0x10FFFF is valid, (codepoint >> 13) for a 9-bit index and an upper bound of 272). Of course, chopping up the pie can be done any number of ways to be more efficient for the specific data set intended.)

      You're also making an assumption about frequency that is deeply dependent upon Latin-1 being very common. There are a number of languages for which this assumption does not apply. Specifically, if we were running lookups on Gothic, Cuneiform, Egyptian hieroglyphics, etc, the process would actually perform 4 times more lookups than simple Latin-1 for nearly every character in the string.

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    183. Re:Missed the point by agbinfo · · Score: 1

      Agreed.

      Also, isn't the \0 terminated string a library implementation more than a language implementation. The char* points to a string of chars. The effective size of the string is library dependent. If you don't like that strlen() looks for a terminating \0, just use a different library.

      The fact that the std library still uses \0 terminated strings seems to indicate that it wasn't such a bad decision.

      Now that I've commented, I'll go read the article and maybe change my mind.

    184. Re:Missed the point by spitzak · · Score: 1

      It is possible to make it match overlong sequences, since they are patterns just like anything else.

      However it probably should not. If anything interprets overlong sequences as anything other than erroneous encodings then it is a bug in that section. You can't transfer the blame to the regexp matching. Conversely you certainly do not want to match overlong encodings if the further step does not interpret them that way.

      You seem to misunderstand where the efficiency comes from. It has NOTHING to do with Latin-1. The regexp is in fact compiled into numerous lookup tables that are indexed by the bytes. It is more efficient to match against Egyptian or any of the others using tens of thousands of 256-entry lookup tables than to match using only a dozen or so 2^21 entry tables. In fact all practical UTF-16 or UCS-4 regexp compilers work by splitting the codes into bit slices and using those rather than the code points as indexes into tables. The main advantage UTF-8 has is that you do not have to allocate memory for the converted copy, and much more obvious and useful handling of the overlong and other errors.

  3. Why not both? by Anonymous Coward · · Score: 0

    When you look at std::string it uses both, and is better for it; many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.

    1. Re:Why not both? by MrEricSir · · Score: 1

      The wonderfully-named GString in GLib works the same way.

      The downside to this approach however is it requires some extra steps when retrieving a string from a C-based API. And of course if the external C-based library has a string handling bug, you're back to square one.

      --
      There's no -1 for "I don't get it."
    2. Re:Why not both? by c0lo · · Score: 2
      I'll argue that's the correct decision at a such low-level as C.

      1. with NULL-terminated strings, there's no distinction (other than in the string.h and related library) between a char * and a other_type *. Inventing a "string" type in C (not C++) would have made the compiler more complex (see footnote **)
      2. because char * is no different than other_type* , I can pass the address in the middle of the string char * for processing. Not so much for a std::string. How does it matter? Well, take parsing for example (the most trivial strtok) not only that one will need an extra string-len prefix, but you'll need to keep a separate "curr_pos".

      If you have a NULL-terminated char* string, one can invent/use a std::string (or GString, or NSString, or Pascal-string). The reverse is not true: having the compiler accepting only Pascal-strings, it's not possible to start using the NULL-terminated convention.

      many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.

      While in other cases (when you pass a std::string by-value and invoke the copy constructor, which tends to happen a lot), you have a hefty performance penalty.

      Footnote ** - Dennis M. Ritchie on the C history.

      C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    3. Re:Why not both? by Psychotria · · Score: 2

      Aside from your apparent confusion between NULL-terminated (0x00) and NUL-terminated ('\0') I completely agree.

    4. Re:Why not both? by EvanED · · Score: 1

      While in other cases (when you pass a std::string by-value and invoke the copy constructor, which tends to happen a lot), you have a hefty performance penalty.

      And calling strcpy a bunch of times for the hell of it also does.

      What sort of situations are you running into when you are passing a std::string by value and you don't need the copy for correctness (because you're going to modify one)?

    5. Re:Why not both? by c0lo · · Score: 1

      And calling strcpy a bunch of times for the hell of it also does.

      What sort of situations are you running into when you are passing a std::string by value and you don't need the copy for correctness (because you're going to modify one)?

      If I need a copy, then (and only then) I need a copy and I'll be paying the price for a strcpy.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    6. Re:Why not both? by EvanED · · Score: 1

      Um, so do that for strings too.

      Let me rephrase: why aren't you passing by const reference except where you need a copy?

    7. Re:Why not both? by c0lo · · Score: 1

      Example 1. Have you tried to use a std::vector (as a collection of constant strings)? Granted, you are able to switch to other constructs (like vector>), but ... this is one example of the perils of "Pascal-strings" and one of the unintended consequence as an added complexity to deal with in the "language" if Pascal strings would be the only way to do it.

      Example 2. Try boost::spirit. There are tricks to convince it NOT to use std::string as grammar attributes, but by default the "string by copy" is used.

      My points:
      1. sometimes you need to use a constructs (or libraries) that works "by copy"
      2. if you have NUL-terminated strings as the "natural type" in the language, you can create your own constructs for Pascal-strings (length prefixed). If your one an only representation for the string is the P-string, you can't go and use the C-style string.

      --
      Questions raise, answers kill. Raise questions to stay alive.
  4. Hehe, ACM mentions Slashdot by Compaqt · · Score: 1

    That's the way it happens in Soviet Russia, too.

    Seriously, though, it's hard to know what language you as a system administrator should use for something like a data logger that has to run continuously (or cron every minute or so) other than C, but then there's the security problem that some user will come up with some weird filename hack to subvert the system.

    --
    I'm not a lawyer, but I play one on the Internet. Blog
    1. Re:Hehe, ACM mentions Slashdot by phantomfive · · Score: 1

      Isn't the filename hack a problem for any language? My simple method for avoiding it is putting all filenames in single-quotes, and filtering out any single-quotes from user input.

      --
      "First they came for the slanderers and i said nothing."
  5. not just a memory issue by Anonymous Coward · · Score: 0

    Doesn't the magic marker method give you string lengths limited only by available memory and not by the size of the piece of memory devoted to length?

    1. Re:not just a memory issue by Anonymous Coward · · Score: 0

      As pointed out in one of the comments on TFA: no,string length is not limited to available memory. It allows for arbitrary-length strings, aka stream processing, even strings larger than available memory.

  6. Mistake? by Anonymous Coward · · Score: 0

    I wouldn't call this a mistake. The paradigm of programming in C is largely based on nuances like this. It makes you write code in a certain way that, in my opinion, is better suited for certain situations. The alternative mentioned in the summary would have made it a bit closer to OO programming as far as strings go, which one can argue would have been better, but I prefer to have differences like this in lower-level languages.

  7. Maybe a better candidate by phantomfive · · Score: 5, Interesting
    C. A. R. Hoare, the inventor of Quicksort, also invented the NULL pointer. Something he apologized for:

    I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

    --
    "First they came for the slanderers and i said nothing."
    1. Re:Maybe a better candidate by AK+Marc · · Score: 0

      A project is never finished.

      Sadly, I bet you make a mint as a project manager. I've never worked with a competent project manager. Quipps like that are directly opposite of all project management best practices, but believed by all poject managers I've worked with "in the wild." A project must end or it's a program, not a project. A program never ends. A project must have a defined end before it starts or it is not a project. But being wrong about everything must make one a good project manager, since I've never seen a project manager (usually making three times the next highest paid on their team) that was ever right on anything, ever.

    2. Re:Maybe a better candidate by GrandTeddyBearOfDoom · · Score: 1

      The problem is people programming in low level languages who lack the mental discipline to do so. You may give excuses of time, lack of anyone better, etc. but fundamentally low level programming requires a disciplined and trained mind, and we never gave training such minds the priority it deserved: we just produced programmers the quick and easy way.

      Hoare has nothing to apologise for. If NULL references weren't there, we'd be forced to jump backward somersaults through random hoops in order to achieve what they manage, which is to temporarily divorce reference from meaning. This is crucial to human thought and it was correct to have it in C, it just gives undisciplined programmers sufficient rope to very artistically hang themselves. What should have been written was a stdref library that abstracts all functionality of C references besides the NULL pointer and programmers taught that by default. Trying to take NULL out of C would be like trying to take 0 out of mathematics. Have a go, but don't expect things to be anywhere near as elegant.

      --
      -- The Grand Teddy Bear has Spoken: "Windows 8 Source Code Available NOW! more disgusting than your pr..."
    3. Re:Maybe a better candidate by phantomfive · · Score: 1

      Sadly, I bet you make a mint as a project manager.

      lol I hope someday I can find out. I am but a programmer, and I grabbed a quote from fortune because I was tired of my other sig.

      since I've never seen a project manager (usually making three times the next highest paid on their team) that was ever right on anything, ever.

      Maybe the problem is you suck as a programmer. Just a thought.

      --
      "First they came for the slanderers and i said nothing."
    4. Re:Maybe a better candidate by RamblinWreck33 · · Score: 1

      without null, I can't imagine what would have been the case for implementing so many programming constructs today. Most languages have some type of isempty() which can b seen as a continuation of null, and I for one wouldn't want to implement any sort of list without it. Programming in assembly, you don't get NULLs (at least not MIPS) and that's one of the difficulties (among many).

    5. Re:Maybe a better candidate by phantomfive · · Score: 1

      As much as I hate to say it, C# has an interesting feature where most variable-types can not be set to NULL, unless you specifically declare them to be 'nullable' (when you declare the variable, not the type). This avoids a large number of situations where NULL is more of a hinderance than a help, and in the cases where you actually need it, it's available.

      Objective C has a standard object called 'nil,' which you can insert into lists and such, but can be sent messages. This prevents several types of NULL pointer bugs, although the language also does allow traditional NULL pointers.

      You can most certainly have NULL pointers in assembly, even MIPS. Everything in C compiles down to assembly, after all.

      --
      "First they came for the slanderers and i said nothing."
    6. Re:Maybe a better candidate by caerwyn · · Score: 1

      nil and NULL are identical; they're just casts of 0. The reason that you can send messages to nil is that the objc_msgSend() function (the runtime bit that does the actual message lookup and call on objects) does a NULL check for you and immediately returns 0 if you're messaging nil/NULL.

      --
      The ringing of the division bell has begun... -PF
    7. Re:Maybe a better candidate by mike.mondy · · Score: 2

      Huh? Not allowing a "null" pointer or otherwise illegal pointer value makes no sense. Either the pointer represented by all zeros is a valid pointer or it's an illegal value. If it's not treated as somehow special or illegal, it's by definition valid. Which would not be nearly as useful as having it be illegal. At best, it would be treated like any other random bit pattern in a pointer -- maybe pointing to legal memory and maybe not. In most languages close to assembly, a valid all zeros pointer would probably point to the beginning of memory; in virtual memory systems, it would probably point to the beginning of the process's space. IIRC, Algol had "references" that were not pointers. Actually Algol is the language with pass-by-name and thunking where it's infamously impossible to write a swap(a,b) function that could handle something like swap(A[i], i).

      The word reference doesn't always mean exactly the same thing as pointer (See C++). I imagine Hoare did not mean "null pointer" when he said "null reference".

      Trivia: Multics had multiple illegal pointers. I think they were 0 for null, -1 for new-process, and -2 for disconnect process. (Terminal login sessions had a single process). The "new_process" command that threw away your (single) munged process and gave you a fresh clean one was implemented in maybe two lines of code looking vaguely like: declare pointer shiva = addr(-1); *shiva = 666;

    8. Re:Maybe a better candidate by phantomfive · · Score: 1

      Check out C#. It does a pretty good job. Just because you don't see how it can make sense, doesn't mean it makes no sense.

      --
      "First they came for the slanderers and i said nothing."
    9. Re:Maybe a better candidate by RightSaidFred99 · · Score: 1

      But C# has null. Just because you have value types/nullable/(int?) doesn't mean null can be easily dispensed with in a language with the concept of "a reference to something". That's just a difference between value types and reference types, and C# supports both.

      Unless you want to pass huge objects around, you "need" references. I put quotes around need as I'm sure there's some fancy solution, but why bother.

    10. Re:Maybe a better candidate by AK+Marc · · Score: 1

      I do suck as a programmer, that's why I don't do that for a living. But I work on ISP sized networks on a regular basis, so I deal with project management on a daily basis and I have a masters degree in IT project management, so I am very accuate in identifying the projects that will fail before they even start. And any competent project manager could do so and either fix the problem before they get unmanagable or refuse to take a job that's a guaranteed failure. But they never do. Telling the person signing the checks that they are incompetent without getting fired takes more skill than any PM I've ever seen.

    11. Re:Maybe a better candidate by Barefoot+Monkey · · Score: 1

      nil and NULL are identical; they're just casts of 0. The reason that you can send messages to nil is that the objc_msgSend() function (the runtime bit that does the actual message lookup and call on objects) does a NULL check for you and immediately returns 0 if you're messaging nil/NULL.

      Be careful about thinking of null as 0. Null pointers and zero pointers are different concepts that just happen to coincide sometimes. Although it's usually the convention on x86 that zero pointers are considered null, it's not strictly the case in general. Quite often there's a completely different convention, like 0 being a valid pointer and -1 being null. Part of the confusion comes from the fact that C uses the '0' character as a symbol for null to avoid having an extra keyword. The C code "char *pointer = 0;" doesn't necessarily give the pointer a zero value; it assigns whatever value the compiler is configured to treat as null.

    12. Re:Maybe a better candidate by Zenin · · Score: 1

      It's a damn shame he never filed a patent on the NULL pointer, he could have made bank!

      Of course, if the new "patent reform" law passes he'll have another chance to be "first to file" the new patent on the NULL pointer and qualify to sue nearly everyone that's touched code in the last forty years. Whoohoo!!

      --
      My /. uid is better then your /. uid
    13. Re:Maybe a better candidate by jeremyp · · Score: 1

      You can't insert nil into an Objective-C collection. You are thinking of NSNull.

      nil is effectively the same as C's NULL but the message dispatch system catches the case where the receiver is nil and effectively turns the dispatch into a no-op (except for setting the return value to 0).

      --
      All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
    14. Re:Maybe a better candidate by Anonymous Coward · · Score: 0

      'caused a billon dollars' true but it's probally still not even a fraction of the cost lost due to the human operators being lazy, distracted talking etc.

    15. Re:Maybe a better candidate by Dog-Cow · · Score: 1

      I believe that the GP was referring to Nil, not nil. Nil (with the uppercase N) is an object in Objective C. NULL/nil cannot, in fact, be added to standard collection classes, such as NSArray to NSDictionary (as a key), but Nil can be.

    16. Re:Maybe a better candidate by phantomfive · · Score: 1

      lol I'm sorry your job sucks, but I can do nothing for you.

      --
      "First they came for the slanderers and i said nothing."
    17. Re:Maybe a better candidate by Anonymous Coward · · Score: 0

      I don't see how this is a bad idea. A pointer can become invalid. If you *know* it is going to be invalid, why not set it to a "known invalid" state instead of leaving it invalid and letting the program choke on it because there's no way to check for its validity.

    18. Re:Maybe a better candidate by quietwalker · · Score: 1

      The only reason he's calling it a mistake is because he's a computer scientist. There's a deep desire to deliberately keep theory and practical usage separate. That's why a pure implementation of scheme, for example, has no mechanisms for input or output. It's why some languages like APL ignored the concept of using syntactical sugar, and instead required mathematical symbols and a specialty keyboard which did not actually exist when the language was created. It was never really meant to be run - only written on paper.

      So, there's a deliberate difference between computer scientists and computer programmers. I recently explained it to my managers like this: Say someone needs to connect two systems with a cable, but they didn't say whether the ends were male or female. A computer scientist will figure out the number of possible combinations and bring 4 cables. A competent computer scientist will eliminate the redundancy, and bring 3. A computer programmer will bring 2 (since m-m connects to f-f to produce m-f).

      ( A really good computer scientist might bring one, citing that while the worst case is two trips, the average case is one - if he understands the domain. A programmer will do the same thing for the same reason, but have no logical justification.)

      Don't get me wrong, living in the land of theory lets one produce great results - look at quicksort, as mentioned by the parent, but sometimes the desire to escape to pure mathematics produces practical issues.

    19. Re:Maybe a better candidate by DetriusXii · · Score: 1

      without null, I can't imagine what would have been the case for implementing so many programming constructs today. Most languages have some type of isempty() which can b seen as a continuation of null, and I for one wouldn't want to implement any sort of list without it. Programming in assembly, you don't get NULLs (at least not MIPS) and that's one of the difficulties (among many).

      I like using the Scala's Option Monad (it's the Maybe Monad in Haskell). Nulls are bad because the compiler can't usually check against them, and sometimes you can't tell if a called function will return NULL as its return value. With the Maybe monad, defined as Maybe q = Just q | None, the compiler will check when Maybe is being passed around and give warnings saying that the None subtype should be handled. It works elegantly to handle issues where sometimes NULL has some semantic meaning but also catches NULL errors at compile time rather than catching NULL errors at run time.

    20. Re:Maybe a better candidate by spitzak · · Score: 1

      ( A really good computer scientist might bring one, citing that while the worst case is two trips, the average case is one)

      Seems to me the average number of trips would be 3/2. There is only a 1/3 chance that the cable they bring will work, and a 2/3 chance of two trips.

      You may be confusing yourself with your previous clever example of why only two cables would be needed, but one of the possibilities requires *both* cables.

    21. Re:Maybe a better candidate by Have+Brain+Will+Rent · · Score: 1

      With all due respect to Hoare (and he deserves lots of it) the nul byte isn't the problem. The problem is almost always people who don't know how to program (or do and are too lazy) and thereby create problems - entirely preventable problems.

      We still see software updates being made to fix bugs originating in, for example, buffers getting overwritten by strings too long for the buffer, because the programmer is too lazy to spend 2 minutes putting in a check. Instead of that one person putting in 2 extra minutes millions of people get to spend 1 minute installing the fix and the debuggers spend god knows how much time trying to track down the source of the problem in the first place.

      To anticipate some of the usual objections:

      "it will make the code inefficient" - let's talk about the 99.9% of the cases where that doesn't matter
      "I don't have time for that" - yeah, you do, you aren't that important buddy
      etc.

      --
      The tyrant will always find a pretext for his tyranny - Aesop
    22. Re:Maybe a better candidate by spitzak · · Score: 1

      Well, actually, if both ends are completely random, and they bring a M-F cable, there is a 1/2 chance that the cable will work (as it works with both the M+F and F+M situations). Only in the F+F and M+M situations would you have to go back and get another cable.

      So it seems like you can make a 1/2 chance of bringing the correct cable, like you said. If you assume that the work of returning the unused cable is non-zero (imagine you have to stand in line at Fry's return counter) then this could easily be a rational decision.

    23. Re:Maybe a better candidate by Have+Brain+Will+Rent · · Score: 1

      Sigh... should have proofread one more time (maybe I'm making my own point about lazy programmers - yeah, yeah, that's the ticket)... In the parent "the nul byte isn't " should have read "things like the nul byte and null pointer aren't".... mea culpa

      --
      The tyrant will always find a pretext for his tyranny - Aesop
    24. Re:Maybe a better candidate by Estanislao+Mart�nez · · Score: 1

      without null, I can't imagine what would have been the case for implementing so many programming constructs today. Most languages have some type of isempty() which can b seen as a continuation of null, and I for one wouldn't want to implement any sort of list without it.

      The fundamental problem that most programming languages that have a "null" value make is that they fail to distinguish between what ought to be two different types: basic types and nullable types. The basic idea is that Foo and nullable Foo should be two different types, such that the first one is always guaranteed to refer to some Foo, and only the second one may be null. You can always cast a value of one of these two types to the other; the cast from Foo to nullable Foo always succeds (and is in fact a noop), but the cast from nullable Foo to Foo will produce a runtime error if the runtime value is null.

      To address your example, a simply linked list is simply a record that contains two fields: a content field and a nullable field for the next record. If the content field is not of a nullable type, then none of the list elements can be null.

      This stuff is trivial to implement. It doesn't eliminate null pointer errors, but it makes them happen precisely at the points where you needed to have null checking. As things are in most language today, a null pointer will often get passed to and returned from dozens of functions or methods before it reaches one that actually uses it and gets stuck with an error that was caused somewhere else.

      You may want to look at how data types work in Haskell, which has a more general solution that lumps nullability with enumerated and union types.

    25. Re:Maybe a better candidate by AK+Marc · · Score: 1

      My job is fine. Every project manager sucks and my interactions with them suck as well. I like designing things with multi-million dollar price tags, even if part of a project that will fail because of incompetent management. I still do my part well and enjoy it, even if the whole of the implementation will be late and over budget.

    26. Re:Maybe a better candidate by shutdown+-p+now · · Score: 1

      As much as I hate to say it, C# has an interesting feature where most variable-types can not be set to NULL, unless you specifically declare them to be 'nullable' (when you declare the variable, not the type). This avoids a large number of situations where NULL is more of a hinderance than a help, and in the cases where you actually need it, it's available.

      Wish this was true, but it isn't, quite.

      C# type system divides all types into reference types and value types. Object, string etc are reference types; int, long, DateTime etc are value types. Unlike Java - which also has a similar system - C# allows user to define new types in both categories.

      Now, all reference types are implicitly nullable. If you write "String s", it can be null. This is largely because null is the default value, used whenever something is needed for initialization purposes (e.g. for static fields, or for array elements).

      All value types are not nullable by default (same as Java). The default value for a value type is, effectively, "all bits zero". However, there is also a generic value type Nullable<T>, for which C# has syntactic sugar - T?. Now this one can have either null value, or an actual value of type T.

      However, since all reference types are nullable - and most types are reference types - this doesn't help all that much in practice.

    27. Re:Maybe a better candidate by shutdown+-p+now · · Score: 1

      The word reference doesn't always mean exactly the same thing as pointer (See C++). I imagine Hoare did not mean "null pointer" when he said "null reference".

      In the context of ALGOL, it does mean the same thing. And it's precisely what Hoare meant. For an example of how null references are handled properly, see type "option" in ML, "Option" in Scala, or Maybe in Haskell.

      The point is that the value domain of the type should not by default have some special magic value for which most operations are suddenly invalid (and will fail at runtime!). It's defeating the purpose of type checking. The proper way to do so is to force programmer to 1) explicitly mark all places where nulls may occur, and 2) check for null before trying to do anything with nullable value. This is best done on type system value, where for every type T, you have a distinct type that represents "T or null".

      Note that this all has little to do with physical representation. For pointers/references, it's logical to represent a nullable pointer in the same way as a non-nullable one, and treat all-bits-zero as null. For something like nullable int, you'd need a separate flag to indicate null. But this is something that the compiler should take care of, not the programmer. The point is that type system should not let you do crazy stuff without explicit unsafe casts.

    28. Re:Maybe a better candidate by shutdown+-p+now · · Score: 1

      Null is not an inherent difference between value and reference types. The difference is that, for reference types, whenever you use them in declaration, there's always an implicit level of indirection there. But it does not require that the value domain of the reference includes null.

    29. Re:Maybe a better candidate by shutdown+-p+now · · Score: 1

      The problem isn't null. The problem is that every reference type gets null by default.

    30. Re:Maybe a better candidate by phantomfive · · Score: 1

      Oh yeah, I misunderstood it.

      --
      "First they came for the slanderers and i said nothing."
    31. Re:Maybe a better candidate by caerwyn · · Score: 1

      Hmm, yes, that might be a better reading of his intention. That said, it's probably worth pointing out that 'nil' is used *far* more often than 'Nil'; few (if any) standard Cocoa APIs return Nil. The only time I've really seen it used is specifically to be placed into collections.

      --
      The ringing of the division bell has begun... -PF
    32. Re:Maybe a better candidate by caerwyn · · Score: 1

      They might be different in theory; they are not in practice. In addition, in Objective-C, C, and C++, NULL is, in fact, (void *)0L. This is not something that is likely to *ever* change, given the absolutely enormous body of code that assumes that (!pointer) is identical to (pointer == NULL); this is not something limited to x86.

      Also, char *pointer = 0; being anything other than 0 is rubbish. There are systems for which 0 can contain valid data, and therefore you must be able to assign 0 to a pointer.

      --
      The ringing of the division bell has begun... -PF
    33. Re:Maybe a better candidate by Estanislao+Mart�nez · · Score: 1

      I don't see how this is a bad idea. A pointer can become invalid. If you *know* it is going to be invalid, why not set it to a "known invalid" state instead of leaving it invalid and letting the program choke on it because there's no way to check for its validity.

      You're missing the point. The important idea here is that a reference that might be invalid should not have the same type as one that's guaranteed to be valid, and that the latter one should be the common case. I.e., the language should not force you to accept a nullable pointer just because you want a pointer; you should be able to write code that assumes that the pointer is not null, and the compiler should enforce that such code can't be called with a null pointer. If the caller has a nullable pointer, then the caller's going to have to prove it's not null before passing it to your function.

    34. Re:Maybe a better candidate by Barefoot+Monkey · · Score: 1

      They might be different in theory; they are not in practice.

      No, it is a purely practical matter. It's the difference between code working reliably and not.

      In addition, in Objective-C, C, and C++, NULL is, in fact, (void *)0L.

      NULL is a macro who's definition is implementation-defined. It is most commonly defined as 0 or 0L, but I have also seen (void *)0 and even (char *)0. I just looked at string.h from gcc 4.3.4 and it is defined as 0. You're looking at this and thinking (void *)0 is reinterpreting the number 0 as a pointer. It is not, and none of the languages that you mention allow you to convert integers into pointers. 0L, in this context, is implicitly a pointer which may or may not have a numeric value of 0 (look at DOS - it had at least different kinds of pointers with different sizes, some of which were tuples and therefore didn't evaluate to single numeric values at all).

      This is not something that is likely to *ever* change, given the absolutely enormous body of code that assumes that (!pointer) is identical to (pointer == NULL); this is not something limited to x86.

      It changes constantly, but it's more of an implementation thing than a platform thing. (Of course, the platform plays a big role in the implementer's decision.)

      Also, char *pointer = 0; being anything other than 0 is rubbish. There are systems for which 0 can contain valid data, and therefore you must be able to assign 0 to a pointer.

      Those are the very systems which tend to define null as something other than a pointer to memory at address 0. C, C++ and Objective-C do not allow you to assign integer values to pointers anyway - literal zero integers implicitly evaluate to null pointers (whatever those are) which can be anything. Let me give you an example.

      int zero = 0;
      int null_is_zero = (0 == *(void **)(&zero)); /* nonzero if and only if null is defined as zero. Might break if size of int and void* are different. */
      char *pointer = 0; /* pointer is null, which is who-knows-where */
      pointer = zero; /* error. zero has a value of 0, but there is no implicit integer-to-pointer conversion. */
      pointer = 1; /* error - you can't assign integers to pointers
      pointer = *(char **)(&zero); /* dangerous because sizeof(char*) might be different to sizeof(int). pointer now points to memory with address 0, and may or may not be a null pointer */

    35. Re:Maybe a better candidate by Anonymous Coward · · Score: 0

      You didn't address access to an invalid pointer (for instance: memory you don't own) at all with that statement, only access to a specific invalid pointer (null).

  8. "typical and rational IT or CS decision" by NapalmV · · Score: 0

    They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

    1. Re:"typical and rational IT or CS decision" by perpenso · · Score: 2, Insightful

      They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

      Programmer for 20+ years here, BS and MS in CS. I used to share such opinions. Then I went to business school. I really enjoyed business school in part because I was constantly amused by how ignorant and wrong I had been regarding such opinions. May I be bold enough to suggest that the portrayal of MBAs in popular and nerd cultures are about as accurate as the portrayal of programmers in popular and non-nerd cultures.

      None of the above should be interpreted to mean that business school makes one appreciate Dilbert any less. Dilbert is actually pretty popular with MBA types and their professors as well.

    2. Re:"typical and rational IT or CS decision" by RyuuzakiTetsuya · · Score: 1

      kind of agree with you and the parent.

      I mean, I think what the parent's saying about management is largely true. They're out to save their own asses, but you're right that there's something else going on behind the scenes. They really do need to look after our own interests too. It wouldn't add up if it turned out somehow you had a rockstar manager but completely incompetent boobs for subordinates.

      My old boss was friggin' awesome on this point. She made it a point to highlight the accomplishments of her team and not just present it as if she was some sort of managerial genius.

      --
      Non impediti ratione cogitationus.
    3. Re:"typical and rational IT or CS decision" by g4b · · Score: 1

      it is always the realization of exactly that particular point: it's never about you alone.

      so actually the problem which sometimes seems to fade-in into our Wesen (essence / mind) while thinking about management, is mostly based personality issues we unconciously perceive from them.

      humans always tend to see positions of decision as something higher, than the positions of creativity or creation (or taking the waste away).
      i cant think of any better picture, as the one of god being sad about his folks wanting a king like other nations, because kings tend to forget that they serve the whole, instead they start to think, the whole serves them.

      we all have our prides. and we all tend to forget the value of others. management is important. but there is no sense in paying them more than double sallaries, or endorsing their ego further. its a job like every other, and it can be a nobrainer job with a lot of social interaction or self presentation only. best managers are the ones, which know, they are supported by their department, know it is not about them, it is their job, so it becomes about them for those, who they represent. being above others means to serve them.
      But we made a mythos out of this job in the last 50 years. And many management types stop listening and only want to crawl up the ass of the next best guy. This makes the drink bitter.

      It's not the science behind it, or any other reason of "we dont need them and would do it better anyway" like "nerdy" cultures might think (and we are part of).

      (btw. i am not so sure about "popular culture" (who is that?) condemning nerds at all - it always seemed, like, we do that to ourselves, frankly. and if you think everybody hates you, girls dont want a guy who has brainz as muscles and stuff, no wonder we react so arrogantly to our friendly business school mates, telling them how stupid they are and what they do is trivial and stuffz.)

  9. This was modded offtopic by symbolset · · Score: 2, Insightful

    Slashdot is lost.

    --
    Help stamp out iliturcy.
    1. Re:This was modded offtopic by BenJCarter · · Score: 1

      Amen.

      --
      For in politics, as in religion, it is equally absurd to aim at making proselytes by fire and sword. - Publius
    2. Re:This was modded offtopic by Crimson+Wing · · Score: 0

      Fixed. It's since been modded +5 Insightful. :)

      Maybe there's hope yet.

      --
      Sig? What's that? Oh, 'signature'...and it's supposed to be witty? Right...
    3. Re:This was modded offtopic by Anonymous Coward · · Score: 0

      I don't think it works if the same person who posted it complains.

    4. Re:This was modded offtopic by badboy_tw2002 · · Score: 0

      So you posted, complained about the mod, and then posted again to congatulate yourself. a) get out more, b) if we're never again subject to your so poignant and timely wit, _that's a good thing_.

    5. Re:This was modded offtopic by Anonymous Coward · · Score: 0

      * refresh *

      Nope, it's still here.

    6. Re:This was modded offtopic by An+ominous+Cow+art · · Score: 1

      Yeah. It's pretty obvious to me that Frost was writing about a CPU lamenting its poor branch prediction implementation.

    7. Re:This was modded offtopic by hey! · · Score: 1

      Sounds more to me like he was lamenting his investment in the company that made the innovative CPU after it was crushed in the market by Intel's marketing clout backing the x86 architecture ("and that has made all the difference.").

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  10. Which is worse? by Anonymous Coward · · Score: 0

    Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?

    I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C. :-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.

    1. Re:Which is worse? by The+Dawn+Of+Time · · Score: 1

      And of course, in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, your attitude is approximately as useful as a purse to a fish.

    2. Re:Which is worse? by snowgirl · · Score: 1

      Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?

      I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C. :-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.

      I don't know why you have to bring sibolezes into this...

      --
      WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
    3. Re:Which is worse? by Anonymous Coward · · Score: 0

      Just think of how unnecessary it would be to interview C programmers if they didn't have to solve crazy string manipulation problems.

      FTFY.

    4. Re:Which is worse? by Dadoo · · Score: 1

      I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

      --
      Sit, Ubuntu, sit. Good dog.
    5. Re:Which is worse? by AuMatar · · Score: 1

      Since NUL is an unprintable character with no meaning, there's no reason to do that. Now you may have needed a byte pointer with NULs in it, but that's not the same as needing a string with it.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    6. Re:Which is worse? by c0lo · · Score: 1

      And of course, in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, your attitude is approximately as useful as a purse to a fish.

      Hmmm... let's not stop mid-way.

      in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, you are approximately as useful as a purse to a fish.

      FTFY: in a world who doesn't give a dam' about professionalism, being one is useless.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    7. Re:Which is worse? by c0lo · · Score: 1

      I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

      Then you want a char array, not a string. How do you solve it when you need an int array?

      --
      Questions raise, answers kill. Raise questions to stay alive.
    8. Re:Which is worse? by smellotron · · Score: 1

      I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

      Pass around a size_t with your pointer and use the mem*() family of functions instead of str*().

    9. Re:Which is worse? by Anonymous Coward · · Score: 0

      She's a benjamite, kill her!

    10. Re:Which is worse? by Anonymous Coward · · Score: 0

      Look at the UNICODE_STRING struct and related routines in the NT API and when writing an NT driver. The other great thing about it is C doesn't force you into C strings, it's only an idiom.

    11. Re:Which is worse? by Dadoo · · Score: 1

      Crap. I forgot to keep up on this. On the off chance that you'll see this message, I'll post a reply, anyway.

      Since NUL is an unprintable character with no meaning, there's no reason to do that.

      That may be true in the context of C, but it isn't in the real world. When you're dealing with serial communication, for instance, NUL is hardly meaningless. Being forced to use special code to deal with NULs makes some stuff more difficult than it needs to be.

      --
      Sit, Ubuntu, sit. Good dog.
    12. Re:Which is worse? by Dadoo · · Score: 1

      Then you want a char array, not a string.

      So I can use functions like strcpy and strcmp on character arrays?

      --
      Sit, Ubuntu, sit. Good dog.
    13. Re:Which is worse? by AuMatar · · Score: 1

      If you have a possibility of a NUL, you shouldn't use string functions- you should use the memxxx functions (memcpy, memcmp, etc). Because at that point it isn't a string- it's a data format that can contain strings and other things. Trying to handle it as a string can be a quick way of getting up and running, but it'll cause breaks as you can see. You're better off thinking of serial communication as a byte stream, and parsing it into strings as needed.

      --
      I still have more fans than freaks. WTF is wrong with you people?
  11. Whatever by Old+Wolf · · Score: 4, Funny

    Come on , this is complete rubbish___8^)_#;3,2,.3root>^$)(^(943hellomax0984)_))1..l2l2_}[[}{

    1. Re:Whatever by Anonymous Coward · · Score: 0

      he-freaking-larious!

    2. Re:Whatever by Anonymous Coward · · Score: 0

      This is the most interesting article to appear on Slashdot in years....

  12. Actually tradeoff may not have been rational by perpenso · · Score: 1

    this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

    Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings. Perhaps this is actually an example of premature optimization, optimizing things before proper profiling and analysis has shown the problem exists and the proposed solution is beneficial.

    1. Re:Actually tradeoff may not have been rational by c0lo · · Score: 2

      this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

      Actually the tradeoff may not have been rational.

      Actually, the choice was rational (at least, on purpose) - you see, it's not about a single byte, it's about a new data type.

      C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    2. Re:Actually tradeoff may not have been rational by osu-neko · · Score: 2

      Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings.

      Not really, no. Having written basic library code for both, it usually requires more code to handle Pascal-style (length+data) strings than C-style (data+null) strings. You save quite a bit of code ("quite a bit" being relative, but I've had to squeeze code into 208 bytes of RAM before) by using the C-style strings most of the time.

      --
      "Convictions are more dangerous enemies of truth than lies."
    3. Re:Actually tradeoff may not have been rational by perpenso · · Score: 1

      Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings.

      Not really, no. Having written basic library code for both, it usually requires more code to handle Pascal-style (length+data) strings than C-style (data+null) strings. You save quite a bit of code ("quite a bit" being relative, but I've had to squeeze code into 208 bytes of RAM before) by using the C-style strings most of the time.

      That depends on the programming language and the CPU architecture to a degree. When programming in assembly, as one may very well do in a library, one may be able to use CPU based string instructions - copy or compare or set, increment pointers, decrement count, branch/repeat depending on zero flag, etc in one or two instructions. These require a length up front so Pascal-style may be a better fit.

    4. Re:Actually tradeoff may not have been rational by perpenso · · Score: 1

      That depends on the programming language and the CPU architecture to a degree. When programming in assembly, as one may very well do in a library, one may be able to use CPU based string instructions - copy or compare or set, increment pointers, decrement count, branch/repeat depending on zero flag, etc in one or two instructions. These require a length up front so Pascal-style may be a better fit.

      OK that was a little x86 centric. IIRC with the PDP-11 and 68K the move instruction will set the zero flag so the count and decrement is not needed on a copy. However the decrement is embedded into a specialized branch instruction so its the same number of instructions in the loop either way.

    5. Re:Actually tradeoff may not have been rational by jeremyp · · Score: 1

      That's very interesting but it's also rubbish or at least not as bad as Dennis Ritchie makes out. What would be added?

      1. a new data type "string" perhaps. At the syntactic level it would behave like all the other data types (integer types, floating point types, struct/union types)

      2. A handful of operators for manipulating strings. For the most part these would be existing operators overloaded (in much the same way as the arithmetic operators are overloaded between integer and floating point types) e.g. string concatenation could be +, extracting single characters would be [], the length could use sizeof. Extracting substrings would need a new operator.

      Casting strings to and from other types might be tricky, but could just be banned for the most part.

      So in terms of parsing, one new type among the many is needed and one new operator. Code generation would require new bits, of course, but no more so than the code generation for manipulating floating point numbers.

      --
      All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
    6. Re:Actually tradeoff may not have been rational by KiloByte · · Score: 1

      It seems perfectly rational to me:
      * works with array types (as c0lo said)
      * no crippling limit of 255 characters if length is 1
      * no waste for sane lengths, memory was at a great premium at the time
      * no urge to have even weirder limits, like 65535 for 2-byte length
      * all code works the same no matter if strings are short or long
      * many string operations were more efficient: with Pascal strings you need to hold both the pointer and an offset inside every loop that goes over the string, this made things more complex on tiny computers of the time. On the other hand, some other operations were worse, so this goes neutral.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    7. Re:Actually tradeoff may not have been rational by perpenso · · Score: 1

      It seems perfectly rational to me: * works with array types (as c0lo said)

      See jerremyp's post regarding adding a new type. I don't agree with everything, for example the sizeof part, but otherwise it demonstrates that a string type may have been a reasonable choice.

      * no crippling limit of 255 characters if length is 1 * no urge to have even weirder limits, like 65535 for 2-byte length

      Length 1 seems a bit of a red herring, I doubt it was a serious consideration. size_t would be much more likely, if so the max length of a string would be the max length of any array.

      * no waste for sane lengths, memory was at a great premium at the time

      Having done microcontroller work I appreciate this sentiment however I think it is over blown in this context. String processing would seem a minor activity in the vast majority of the apps of that time, also today. For those that were string focused the programmer could simply not use the built in string type and use an array of chars with an end marker. **IF** that made sense, for some apps with custom string handing the length is desired. What would (s)he be lacking, constant string assignment and stdlib functions? The former seems a minor consideration and the later represents a handful of very simple functions. Some programs would benefit from the string type, other programs would benefit from the current implementation. I personally think it is likely that the larger group would have been the former.

      * all code works the same no matter if strings are short or long

      This is also true with the likely size_t based length.

      * many string operations were more efficient: with Pascal strings you need to hold both the pointer and an offset inside every loop that goes over the string, this made things more complex on tiny computers of the time. On the other hand, some other operations were worse, so this goes neutral.

      That doesn't seem entirely correct. Since the offset is a known constant value the offset can be embedded in the addressing mode. 68K and x86 include such addressing modes. I don't recall if PDP-11 did, I lean toward not. Even without an embedded offset the pointer or offset could be adjusted by sizeof(size_t) outside the loop, note this is all below the app programmer's view. It's something for compiler code generation when indexing a string data type and for stdlib string function implementors.

    8. Re:Actually tradeoff may not have been rational by shutdown+-p+now · · Score: 1

      This has nothing to do with the issue at hand. A length-prefixed implementation also means that "strings are arrays". Just that, instead of having a NUL postfix, you have a length prefix.

    9. Re:Actually tradeoff may not have been rational by c0lo · · Score: 1

      This has nothing to do with the issue at hand. A length-prefixed implementation also means that "strings are arrays". Just that, instead of having a NUL postfix, you have a length prefix.

      Hmmmm.... is it? Let's try if the two are equivalent: in a length prefix implementation, how are you going to store/pass as param the "tail substring of a constant string" without copying the original string?

      --
      Questions raise, answers kill. Raise questions to stay alive.
  13. Error! by larry+bagina · · Score: 1

    If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

    On all modern computers, the page size is a power of 2, cleanly divisible by, 32, 64, 128, 256, etc Modern computers have a terrible penalty (sometimes including SIGBUS) for memory accesses which aren't aligned on the native word size. Throw those two facts together and you can't accidentally read past the vm page.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

    1. Re:Error! by shutdown+-p+now · · Score: 1

      Modern computers have a terrible penalty (sometimes including SIGBUS) for memory accesses which aren't aligned on the native word size.

      Say, what? This is trivially demonstrated wrong by creating a string (with no null terminator) that is exactly as long as a memory page.

  14. Worst mistake: Therac-25 by Anonymous Coward · · Score: 0

    Unchecked boundary conditions, in the case of the Therac-25 an overflow of a one-byte counter, are a fatal flaw in poorly written software. In older 8-bit apps, this could wind up with random unexplained crashes. Well, in this case it caused people to be exposed to high-doses of radiation over large areas of their bodies and cost people lives. (and when I learned about this, it was when I decided I was much happier working on web/e-commerce stuff than working on embedded systems programming)

  15. The cost of a byte - or was that the value? by Teunis · · Score: 2

    hmm. marker character, or a length.

    Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)

    Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

    IMO, C and the null terminated string -saved- more than it cost. It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?

    When you mark cost, know what you pay. Not all costs are negative.

    1. Re:The cost of a byte - or was that the value? by Sloppy · · Score: 1

      Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

      These aren't hard questions, IMHO. Just say the length is an int (or an unsigned int), and then assuming you didn't freak out when someone asked all those very same questions about ints, then you should be fairly happy with the result.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    2. Re:The cost of a byte - or was that the value? by Arlet · · Score: 1

      Instead of an int, shouldn't that be a size_t ?

    3. Re:The cost of a byte - or was that the value? by EvanED · · Score: 1

      String can be any size and only restricted by available memory.

      It's not like you can't get that with counted strings. If you're in the "infinite" limit case, then you're already doing something very different than just treating a block of memory as a string, and so you can either use a terminated string in that (very unusual) case or allow for a variable-sized count field.

      What's the size? What byte order? What bit size? How will this affect communications between platforms?

      The size of the counter I'll grant you -- IMO this may be the biggest reason that I'm glad for historical reasons that C didn't go with count fields. (I'm worried that we'd still be using 2-byte fields or something nowadays.) But I think you're overstating the problems with it... you already have to worry about all of those problems.

      It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?

      I don't buy that one iota.

      So first, "how do you know when you git the end of a file"? That's not signaled by null in the first place, so the same way you do now. End of a buffer? Because you reached the count.

      Second, it's not like if there was a situation where you'd frequently not know the size of the data a priori you wouldn't be able to change the protocol and include a terrminator in that instance. (You could use this to still provide something like find's -print0 and xarg's -0 if you didn't want lengths to show up on standard out.)

      Third, think about what your assertion basically boils down to: that you can't do web programming in languages that give you counted strings. And of course that's crazy.

      Personally I think there's something you don't see much in this debate: there are actually three pieces of information that matter: the string data, the length of the string, and the size of the buffer. It's always necessary to track the first, but any time you want to extend the length of the string you have to track the third. (And that's a fair bit.) In my ideal world, C's "standard" string representation (supported by the language-provided APIs) would have been like that. (Windows has it right.)

    4. Re:The cost of a byte - or was that the value? by c0lo · · Score: 1

      Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

      Adding: how do you pass to you the tail substring? (like: I parsed to here, take over from now on. Oh, yes, on top of the length, deal with another offset).

      --
      Questions raise, answers kill. Raise questions to stay alive.
    5. Re:The cost of a byte - or was that the value? by Anonymous Coward · · Score: 0

      Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

      In no way at all. Communication between platforms is done with files formats and/or protocols that presumably are well defined.
      There are plenty of file formats and protocols out there that stores the string length before the string data. (Possibly more than those that uses a termination code. (What's the size of the termination code b.t.w? What is the value?)

      Not that it really matters. C does not prevent anyone one from using their own format to store strings since it is an application specific problem. ANSI-C provides functions for working with both null-terminated strings and arbitary length data-blocks and generally you should use the data-format that is best for solving your problem.

      Heck, AmigaOS uses both C-type strings and BCPL-type strings in the same system. (Devices uces BCPL-type strings for some odd reason.)

    6. Re:The cost of a byte - or was that the value? by Anonymous Coward · · Score: 0

      Yes, but no one really cares.

    7. Re:The cost of a byte - or was that the value? by Arlet · · Score: 1

      It would case extra bugs on systems where int != size_t.

    8. Re:The cost of a byte - or was that the value? by w_dragon · · Score: 1

      So if an old system with a 16-bit int wants to transfer a string to a new system with a 32-bit int you handle that how exactly?

    9. Re:The cost of a byte - or was that the value? by Sloppy · · Score: 1

      The same way you transfer an int.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
  16. register starvation by tabrisnet · · Score: 1

    The real problem with the addr+len approach is that now every string becomes a struct, or a structptr.

    This means that when passing a string to a function, either the string takes up two register/stack slots, or you're passing around a const-ptr (but the contents of the struct are not const), which means one more memory access due to pointer indirection.

    x86 and the PDP-11 are register-starved. the x86 has 8 registers, with 4 or 5 available as general-purpose registers.
    The PDP-11 was similar with 8 registers total as well.

    1. Re:register starvation by perpenso · · Score: 1

      My assembly class in college was on a PDP-11. I've done quite a bit of x86 assembly over the years. I'm confused as to why you think a pascal style string structure pointer requires any more registers or stack than a C character pointer. In assembly if I want the the length I would reference a size_t at the pointer address, and if I want text I would reference a char at pointer+offset where offset is sizeof(size_t).

    2. Re:register starvation by EvanED · · Score: 1

      x86's register situation, while not nearly as good as it should be (even x64 isn't all that good), is not nearly as bad as it seems. First, register renaming does a bit to help, but my understanding is that x86 chips pull a special trick: they are able to specially detect most reads and writes to the top several stack slots and redirect those accesses to a register as well. (It's been a while since I've read that, and I forget where.)

      (BTW, your "4 or 5" is a little low: it's really 6 or 7 registers that are generally available. You definitely get eax through edx, esi, and edi. That's 6. If you turn on frame pointer optimization, you've also got ebp.)

    3. Re:register starvation by Anonymous Coward · · Score: 0

      Please... Please ... Mod Parent +1 informative. Please.... its actually brilliant.

    4. Re:register starvation by EvanED · · Score: 1

      OK, now think about how you would compile a loop; say, strcpy. With null-terminated strings, you can do this with one source register, one destination register, and one temp register to hold the value you pull out of memory.

      With counted strings, you "need" one more. You either need to store the base and offset of the source string, or the current address and the ending address... then you also need the temp and destination register.

    5. Re:register starvation by perpenso · · Score: 1

      The GP was referring to registers being consumed during parameter passing not function implementation.

      Regarding function implementation, registers are preserved so they are all available as working registers inside the function. Needing one additional register for the count does not seem troublesome. Inlining something like strcpy would complicate things but if you are inlining you are probably optimizing for speed. If so you may want to use the CPUs built-in string capabilities and these generally want a count up front, for example rep movs.

    6. Re:register starvation by Anonymous Coward · · Score: 0

      Not quite right either, with PIC (often considered a requirement for shared libraries) you do not get ebx.
      So it is 5 to 7.

    7. Re:register starvation by JSBiff · · Score: 1

      Why couldn't the string just be a pointer to a block of memory (just like a current string pointer), where the very first sizeof(size_t) bytes of the block contains the length of the string, and where the string data starts at location strPtr + sizeof(size_t)?

      In that case, you would only copy the strlen value into a register when you actually needed to use the value (e.g. when you are testing to see if you're at the end of the string)? In some scenarios, if you are testing (or changing) that value often, you might keep it in a register; but, you don't *have* to keep it in the register/stack when you aren't interested in it?

    8. Re:register starvation by Anonymous Coward · · Score: 0

      they are able to specially detect most reads and writes to the top several stack slots and redirect those accesses to a register as well. (It's been a while since I've read that, and I forget where.)

      How convenient. Are you show you're not confusing yourself with L1 cache?

    9. Re:register starvation by EvanED · · Score: 1

      No, I'm not sure. I did say that it's been a while since I read that. :-) But given the x86->microoperation translation that chip front ends do (at least Intel's), it seems entirely plausible.

      It may even have just been that anywhere in the L1 cache is only a cycle or two slower than registers nowadays, or something like that.

      The main thing I remember was taking away the impression of what I said before: that x86 register pressure is not nearly as bad in real terms as it looks.

    10. Re:register starvation by Rockoon · · Score: 1

      On x86 you dont need the temp register that the ascii 0 strings need if you are dealing with a counted string.

      ..and as another poster pointed out, the CPU's string instructions require an up-front count.

      Furthermore, thats a naive implementation of an ascii 0 string copy routine. A decent implementation would use a pointer, a static displacement (stored in a register, of course), and the temp. This requires only one pointer increment instead of two, at the cost of only a single subtraction (disp = dest - src) before the copy loop begins.

      Do you still want to let implementation details drive your argument? I'm guessing that no, you actually dont want the implementation details to be a part of this discussion since they do not actually support your argument.

      --
      "His name was James Damore."
    11. Re:register starvation by tabrisnet · · Score: 1

      I'll admit to not being an assembly programmer, but for one, I was referring more to what they would have done back then... when register renaming wasn't available. Nowadays many programmers don't care about cycle counting... I know I haven't had to. Writing in perl or C++/STL leaves me too far away from the metal to even know how many cycles I'm wasting.

      But since the question is one of yesteryear, lack of regs could be a question. And I was under the impression you weren't supposed to use esi and edi as tmp stores or as the output of an expr. They're ptrs into stack and data-area, respectively.

    12. Re:register starvation by tabrisnet · · Score: 1

      I meant that if you were to actually pass the struct as not a structptr, but as the 2 values themselves. After all, you could optimize for fewer regs, or fewer memory accesses. Pointer indirection used to be considered a problem. Albeit not anymore.

      Yes, that would be an ABI change, but this was back when they could choose the ABI as they wished... so we can throw many current assumptions of ABI out the window.

    13. Re:register starvation by _0xd0ad · · Score: 1

      OK, now think about how you would compile a loop; say, strcpy. With null-terminated strings, you can do this with one source register, one destination register, and one temp register to hold the value you pull out of memory.

      I'd use the temp register to hold the # of bytes remaining, and do the copy with XOR. :P

    14. Re:register starvation by _0xd0ad · · Score: 1

      Never mind, I commented too quickly. I was confused by needing an extra register but I'd forgotten you can't do a direct memory-to-memory copy.

    15. Re:register starvation by EvanED · · Score: 1

      Yes, that's true: the problem was rather more acute historically. A lot of C's decisions where we'd say "things would be way better if they had done things this way" were probably correct at the time; I think that null-terminated strings are another.

      And I was under the impression you weren't supposed to use esi and edi as tmp stores or as the output of an expr.

      They are totally fine to use as temporary registers. The main catch is that the string instructions (like stosd, usually prefixed with rep) use those registers as part of their contract. But it's to do stuff you'd have in a register anyway.

      If you compile this function:

      void foo() {
        int a = get(), b = get(), c = get(), d = get(), e = get(), f = get();
        use(a, b, c, d, e, f);
      }

      with GCC -O3 (this is with -m32, otherwise it will use the new x64 registers), you'll get this out (remember, in AT&T syntax, data flows left-to-right):

      call get
        movl %eax, -20(%ebp) // a is stored in memory at ebp-20
        call get
        movl %eax, -16(%ebp) // b is stored in memory at ebp-16
        call get
        movl %eax, %edi // c is stored in edi
        call get
        movl %eax, %esi // d is stored in esi
        call get
        movl %eax, %ebx // e is stored in ebx
        call get
        movl %ebx, 16(%esp) // populate the argument slots (this is e)
        movl %esi, 12(%esp) // d
        movl %edi, 8(%esp) // c
        movl %eax, 20(%esp) // eax still holds the return from the last call, so this is f (don't know why it's so late)
        movl -16(%ebp), %eax // ebp-16 is b; load it, then write back to the argument slot
        movl %eax, 4(%esp)
        movl -20(%ebp), %eax // ebp-20 is a; load it, then write back to the argument slot
        movl %eax, (%esp)
        call use

      In other words, GCC has no compunction about using edi and esi as temporary storage for c and d. I don't know why a and b go into memory though, to be honest, and not into ecx and edx. The registers are open. It's a little strange. :-)

    16. Re:register starvation by EvanED · · Score: 1

      And BTW, Intel's compiler (version 11) produces code that's more like what it expects. For the six temporaries, it uses registers for everything but one (not sure what is going on there):

      call get
      movl %eax, %edi // a
      call get
      movl %eax, (%esp) // b
      call get
      movl %eax, %esi // c
      call get
      movl %eax, %ebp // d
      call get
      movl %eax, %ebx // e
      call get
      pushl %eax // eax holds f from the call
      pushl %ebx // ebx was e
      pushl %ebp // ebp was d
      pushl %esi // esi was c
      movl 16(%esp), %eax // b was in memory for whatever reason
      pushl %eax
      pushl %edi // edi is a
      call use

    17. Re:register starvation by perpenso · · Score: 1

      I apologize if I am missing something obvious, the coffee has not kicked in yet, but I don't see the need for indirection. I would expect a string type to be implemented as a length immediate followed by a character array - a single block of memory, not as a length and a character pointer. I suppose there are advantages to both implementations but I suspect the former to be the more likely. Especially so if the string type were a built-in type.

    18. Re:register starvation by tabrisnet · · Score: 1

      I've seen this done in serialization formats, but never seen it actually done in a program.

      So you're saying you want native-size-word (int) followed by the string chars, so you'd malloc(strlen+4) for the dynamic string, and then in the reader read the first 4 bytes and set the ptr to stringPtr=basePtr+(void *)(sizeof(size_t)). It just doesn't feel natural in C as we know it. But then again, I'm sure they could have defined it to be natural. it would be an implied struct.

    19. Re:register starvation by perpenso · · Score: 1

      No need for all that casting, as you mention it would be implied. If this is done in the context of a string type then the details are hidden. You address characters with the normal [] operator just like today, the compiler automatically generates code for the offset. About the only thing new, other than a string class, would be some operator for referencing the length.

    20. Re:register starvation by tabrisnet · · Score: 1

      It seems that the intent there is to make C into something a lot like C++/STL. Once you've done it once, nothing to prevent Complex types and others being builtins... but this seems to bring C further from its "portable ASM" status.

      Not that I'm entirely sure this would be bad... and the struct could be just a typedef. Now all we need is operator overloading, which is also a C++ism. Maybe a C++ minus the type-strictness, and minus all of the automatic copying. and the templates.

  17. Slashdot Sensation Prevention Section by gmhowell · · Score: 4, Informative

    FTA:

    We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

    In other words, Ken, Dennis, and Brian did the right thing.

    --
    Jesus was all right but his disciples were thick and ordinary. -John Lennon
    1. Re:Slashdot Sensation Prevention Section by nitehawk214 · · Score: 1

      Wow, this place has come a long way from a simple news for nerds site. Now, the authors are placing disclaimers specifically addressed to us :)

      And we ignore them anyhow, since the editors don't RTFA.

      --
      I'm a good cook. I'm a fantastic eater. - Steven Brust
  18. Slashdot Sensation Prevention Section by Target+Practice · · Score: 1

    Wow, this place has come a long way from a simple news for nerds site. Now, the authors are placing disclaimers specifically addressed to us :)

    --
    There's a 68.71% chance you're right.
  19. Fair and balanced by lucm · · Score: 0

    From the article:
    > Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed [...]

    This is the kind of factual, objective and unbiased content that gives credibility to an article.

    --
    lucm, indeed.
  20. Got it wrong by Spazmania · · Score: 3, Insightful

    It probably wasn't about the bytes. The factors are:

    1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.

    2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

    3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.

    For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    1. Re:Got it wrong by PCM2 · · Score: 2

      Beyond those points:

      It is interesting to compare C's approach with that of two nearly contemporaneous languages, Algol 68 and Pascal [Jensen 74]. Arrays in Algol 68 either have fixed bounds, or are `flexible:' considerable mechanism is required both in the language definition, and in compilers, to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-sized arrays and strings, and this proved confining [Kernighan 81]. Later, this was partially fixed, though the resulting language is not yet universally available.

      C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

      And that's coming from Dennis Ritchie, who was there.

      --
      Breakfast served all day!
    2. Re:Got it wrong by Homburg · · Score: 1

      To keep the string length, you'd have to employ a struct.

      No, strings with a listed length would also be pointers to a series of integers - it's just that, instead of giving a value special semantics (0 as end of string), you give a position in the series special semantics (store the length in the first two bytes). In both cases, you need your string-handling functions to be aware of whatever the convention is.

      Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

      I don't know that that's true. Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

    3. Re:Got it wrong by Arlet · · Score: 1

      (store the length in the first two bytes)

      So 65536 byte strings should be enough for anybody ?

      Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

      C-style strings are simpler. That's the biggest advantage. For the few cases where performance matters, you can always define your own string type.

    4. Re:Got it wrong by EvanED · · Score: 1

      So why suffer the overhead of keeping track?

      Because you usually need to for correctness anyway, to make sure you don't overflow your buffers.

      For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

      For the kind of language C aimed to be, it sure as hell gets used in a lot of inappropriate venues. Like OS kernels.

    5. Re:Got it wrong by EvanED · · Score: 1

      I don't know that that's true. Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent.

      So I will back up the OP in a couple small respect here: it is still possible to track the length yourself, and you can do all the operations that do need to know the length of the string in a different way using that information. (E.g. if you have s1 ands2 and the length of s1 is n and you want to concatenate them, you can just do strcpy(s1+n, s2) instead of strcat(s1, s2). (Or whatever the invocations are.)

      You get the O(1) strlen operation when you need it, and don't suffer the overhead of maintaining the counts when you don't. The only problem arises with modules that aren't written this way: you know the length of the string, foo() needs to know the length of the string, but you don't control the implementation of foo() and it's written without a way for you to tell it the length.

      The second bit is that there are a couple representations you could use for a string. First, you could have a pointer to a block containing a count and a pointer to the actual data. This adds a level of indirection to each access, and it adds more allocation and deallocation overhead. Tiny amount, but nonzero, and it's on every access, including reads. (This is even more true back before optimizers would have been able to do stuff like save the address of the real block, hoist that out of the loop, and only do it once. Though I guess you could still do that manually.)

      The second representation is to have a pointer to a the string where that block is prefixed by the count. However, you then can't create a string that's a suffix of the original without copying the whole string. (With C's representation, if p is a non-empty string, then p+1 is also a string. This also makes things like iterating through a string quite nice.)

      I've spent some time thinking about this in the past, and I've developed a reasonably strong opinion of how a C-like language would "best" handle strings, but there are substantial benefits and substantial drawbacks to every option. (I like a variant of counted strings ... but I also have some fairly unconventional and strong opinions on some programming language and OS fronts as well. :-))

    6. Re:Got it wrong by osu-neko · · Score: 2

      What are the common cases you are thinking of where C-style strings are faster?

      strcpy(char *d, char *s)
      {
      while ( *d++ = *s++ );
      }

      Challenge: come up with the equivalent for pascal-style strings in a way that doesn't compile into at least twice as much code.

      In fact, aside from strlen, are there any string functions that aren't made at least twice complex by using P-style instead of C-style strings? Most of strlib.c can be implemented as one-liners, assuming C-style strings.

      --
      "Convictions are more dangerous enemies of truth than lies."
    7. Re:Got it wrong by Mark+J+Tilford · · Score: 1

      C-style arrays have the property that (array + offset), provided offset is within the array, can be treated as a shorter array of the same type.
      Giving a value within the string special semantics preserves that property; giving a position withing that string special semantics does not.

      --
      -----------
      100% pure freak
    8. Re:Got it wrong by Dog-Cow · · Score: 1

      As proof that C is perfectly appropriate for OS kernels, one simply has to look at the most common kernels. Name one that is written in a language other than C. Linux, Windows, Mach, *BSD... All in C. Even OpenBSD is in C, which one would think an odd choice considering the stated goal of OpenBSD.

      If anything, one might argue that an OS kernel is the only appropriate place for C.

    9. Re:Got it wrong by Spazmania · · Score: 1

      In C, that "special semantic" to store the length in the first bytes looks like this:

      typedef struct {
          unsigned short max;
          unsigned short length;
          char s[1];
      } string;

      string *newstring(unsigned short max) {
          string *s;
          s=malloc(sizeof(string)+(sizeof (char)*(max-1)));
          s->max=max;
          s->length=0;
          return s;
      }

      Because the character array runs on past the official end of the structure, this is a very advanced topic in C. Hence much more complex that a simple character array.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    10. Re:Got it wrong by Spazmania · · Score: 1

      Before C, OS kernels were laboriously written in assembly language. The kernel is an excellent place for the use of C.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    11. Re:Got it wrong by Rockoon · · Score: 1

      Just wish to point out that Ritchie is citing Kernighan here with regards to Pascal's strings, which is akin to Ballmer citing Gates with regards to Macintosh.

      --
      "His name was James Damore."
    12. Re:Got it wrong by Rockoon · · Score: 1

      You do realize that pascal strings can be copied with a single x86 assembler instruction, right?

      rep movsb

      Your argument doesnt sound like you know that tho. You seem to think that it takes more assembler instructions than the C code. The absolute smallest implementation of that C code's loop on x86 would 4 instruction:

      copyloop:
      lodsb
      stosb
      cmp al, 0
      jnz copyloop

      I assure you that this is significantly slower than the pascal version too.

      --
      "His name was James Damore."
    13. Re:Got it wrong by Anonymous Coward · · Score: 0

      And x86 instructions had exactly what relevance back when the null-terminated C string decision was made?

      Also, x86 compiler optimizers at least as far back as Borland Turbo C would recognize such constructs and emit the REP instructions.

      - T

    14. Re:Got it wrong by PCM2 · · Score: 1

      Just wish to point out that Ritchie is citing Kernighan here with regards to Pascal's strings, which is akin to Ballmer citing Gates with regards to Macintosh.

      Really? You just compared the creators of C to Ballmer and Gates? Are we to infer that you actually liked Pascal's 255-character strings?

      --
      Breakfast served all day!
    15. Re:Got it wrong by dfghjk · · Score: 1

      "No, strings with a listed length would also be pointers to a series of integers..."

      Interesting that you deleted a portion of the claim so you could argue with it, then ignored the fact that you, yourself, created a struct as a counterexample to the claim that a struct was needed. Defining the first two bytes as meaning something special means that it is either explicitly a struct or "magic type" as the OP said.

    16. Re:Got it wrong by shutdown+-p+now · · Score: 1

      As proof that C is perfectly appropriate for OS kernels, one simply has to look at the most common kernels. Name one that is written in a language other than C

      The fact that few OS kernels are written in something other than C does not mean that C is "perfectly appropriate" for kernels. It just means that either better alternatives don't exist, or that they had died for other reason. The latter is, in fact, true - there were several other system programming languages devised, some of which - most notably later Wirth's creations, from Modula-2 on - were in many ways superior to C. Both Modula and Oberon had OS kernels written in them. Unfortunately, the rest of the world was already hooked onto C thanks to Unix, and you needed something significantly better to take over (and maybe not even that would have worked).

      C has many, many flaws. The problem with it is that it's "good enough" (meaning that you learn to work around things that you can, and learn to live with the rest), and that's it's so ubiquitous that uprooting it is nigh impossible now.

    17. Re:Got it wrong by Anonymous Coward · · Score: 0

      Challenge: come up with the equivalent for pascal-style strings in a way that doesn't compile into at least twice as much code.

      In fact, aside from strlen, are there any string functions that aren't made at least twice complex by using P-style instead of C-style strings? Most of strlib.c can be implemented as one-liners, assuming C-style strings.

      1) You are asking people to implement the standard string library from scratch. What is the point of a string library if you're just going to recode it from scratch over and over?

      2) You are claiming that short obtuse code constructs justify the null terminator. By that logic Perl one-liners are heaven rather than hell.

      3) You obviously haven't tried implementing the C library for Pascal strings otherwise you'd know the answer and wouldn't be so overly confident that a decision made to save space whilst costing CPU (remember the space-time tradeoff?) would somehow be more efficient.

      Let's go with this:
      PascalString * strcat(PascalString * dest, PascalString const * src)
      {
              size_t i;
              char *d = dest->ptr + dest->length;
              dest->length += src->length;
              for (i = 0 ; i < src->length ; ++i) d[i] = src->ptr[i];
              return dest;
      }

      Things get better if you also include the buffer length in the string, then you can check for overflow and set errno. This code is LESS complex then the null version.
      char * strcat(char * dest, char const * src)
      {
              char *d = dest;
              while (*d) ++d;
              while ((*d++ = *src++));
              return dest;
      }

      (PS. No bullshit about the null version being shorter and therefore "simpler", that shit is opaque to anyone who isn't thoroughly familiar with operator precedence rules)
      By the by, the Pascal version of strcat is O(n) whereas the C string version is O(n+m).

    18. Re:Got it wrong by Anonymous Coward · · Score: 0

      As proof that C is perfectly appropriate for OS kernels, one simply has to look at the most common kernels. Name one that is written in a language other than C. Linux, Windows, Mach, *BSD... All in C. Even OpenBSD is in C, which one would think an odd choice considering the stated goal of OpenBSD.

      If anything, one might argue that an OS kernel is the only appropriate place for C.

      Windows, huh? This might interest you.

      Windows is written in C/C++ hybrid, part of the upper subsystems use classes from C++ (multimedia/sound at least). It also doesn't use C strings (see the link), instead using UNICODE_STRING structures, which are basically Pascal Strings, for all text processing [including, most importantly, file paths]. Oops.

    19. Re:Got it wrong by Your.Master · · Score: 1

      That's not necessarily a fair example. Strcpy isn't safe unless you can guarantee that the size of the destination buffer is greater than the number of characters up to and including the null terminator. Typically that means you have to call strlen anyway (or carry the string length with you as though it were a counted string).

  21. PHK wide of the mark by epine · · Score: 5, Insightful

    Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

    C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.

    char string [] = { 't', 'e', 's', 't', '\0' };
    char* cdr_string = string + 1;

    Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

    Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

    No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

    If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

    C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

    Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

    PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

    For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

    And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

    The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

    1. Re:PHK wide of the mark by EvanED · · Score: 3, Insightful

      If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

      To play Devil's advocate, strcpy cannot check it's precondition. You can't tell whether a pointer you're given is valid, or how much space is left in the buffer.

      (Well, I guess you could go make malloc record far more information than it otherwise has to, and make strcpy grovel through that and some other data, but even I don't think that'd have been worth it. And I'm pretty far on the side of "why the heck are we using languages that are as unsafe as C".)

    2. Re:PHK wide of the mark by EvanED · · Score: 1

      Hmm, looking at your post again, I think you may have been singling that out for exactly that reason. If so, never mind.

      (Though if so, I'll point out that much of C is designed around "you don't need to check your preconditions" -- from general array bound accesses to union accesses to casts to all sorts of stuff. strcpy is basically exactly in line with the rest of the language.)

    3. Re:PHK wide of the mark by Carewolf · · Score: 1

      To play Devil's advocate, strcpy cannot check it's precondition. You can't tell whether a pointer you're given is valid, or how much space is left in the buffer.

      No, but you could tell it, the buffer-size might not be written to memory anywhere, but if you as programmer doesn't know, then you are doing something wrong.

    4. Re:PHK wide of the mark by Anonymous Coward · · Score: 0

      I imagine the idea is that strcpy() should not exist.

    5. Re:PHK wide of the mark by gplus · · Score: 1

      I think that the point TFA is trying to make is, that the world would be a better place, if the NULL terminated C-string didn't exist.

      Personally, I agree with that sentiment. Because a lot of people, including me, are sloppy, lazy coders...

      Also, I think that PHK distinctly says that he doesn't blame anyone for the giant one-byte mistake.

    6. Re:PHK wide of the mark by EvanED · · Score: 1

      OK, but then why single out strcpy? How's it different from almost any other operation?

    7. Re:PHK wide of the mark by Richard_J_N · · Score: 1

      Remember that many functions are very much more efficient if they know the string length. For example, strlen() and strcat() both have to walk the string 1 byte at a time. This would easily compensate for the overhead of having an initial length field. Anyway, why not have both? Keep the null-terminated strings for the close-to-the-metal code, and implement a sensible string library for human-scale text processing, where most strings are tens to hundreds of characters in length.

    8. Re:PHK wide of the mark by steelfood · · Score: 1

      Problem's don't stem from the concept of null-terminated strings, it's how everything else around it is written base on its benefits and limitations. In the case of strcpy, the problems weren't taken into consideration, and thus the result turned out poorly.

      Effectively, null-termination doesn't result in bad things happening, it's the lack of understanding of what it means for everything else that does. And that's probably extensible for everything.

      --
      "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
    9. Re:PHK wide of the mark by Anonymous Coward · · Score: 0

      I don't think God intended anything about computers; He did not invent them, Man did.

    10. Re:PHK wide of the mark by Anonymous Coward · · Score: 0

      Nothing its free, but composition (or efficiency) isnt dammend by safety... just look to Oberon_F http://www.oberon.ch/blackbox.html

    11. Re:PHK wide of the mark by shutdown+-p+now · · Score: 1

      C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer

      What's seamless about .c_str()?

    12. Re:PHK wide of the mark by sribe · · Score: 1

      Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

      Too bad that most of the people that need to understand that, are not smart enough to understand. Good point, but shouting in the wilderness...

  22. BSD is dead! by Anonymous Coward · · Score: 0

    Man, its a sad day on Slashdot that PHK says something and noone says that BSD is dead! You wingnuts are losing your edge.

    1. Re:BSD is dead! by Zontar+The+Mindless · · Score: 1

      Man, its a sad day on Slashdot that PHK says something and noone says that BSD is dead! You wingnuts are losing your edge.

      On the plus side, we've not seen any Candlejack posts latel

      --
      Il n'y a pas de Planet B.
  23. As an assembly language programmer I resent that by perpenso · · Score: 1

    Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class.

    I spent my formative programming years primarily writing code in assembly and I resent that statement. :-) Runtime is always in one's mind and optimizing for speed is the desired goal. Optimizing for size is something that is merely forced upon us by circumstances beyond our control. No true assembly programmer, nor any true Scotsman, would prioritize size over speed if avoidable.

  24. The trouble is arrays, not strings. by Animats · · Score: 3, Interesting

    The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.

    Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.

    This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.

    Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.

    The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.

    That's how we got into this mess.

    As I point out occasionally, the right answer would have been array syntax like

    int read(int fd, char[n]& buf, size_t n);

    That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,

    int read(int fd, char* buf, size_t n);

    is a lie - you're not passing a pointer by value, you're passing an array by reference.

    C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.

    1. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      int read(int fd, char[n]& buf, size_t n);

      That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on.

      How do you figure there's no overhead? There's now an extra argument that needs to be passed to the caller so it can be evaluated at runtime. And what is the compiler to do with the type information, other than to verify naive const overindexing of the array?

      I agree it would be a cool concept but the cost is not zero and the compile-time error checking it would provide would be negligible. Having the array size available for evaluation at runtime would be cool, but it's just syntactic sugar added to the standard:

      int read(int fd, char *buf, size_t buf_max, size_t n);

    2. Re:The trouble is arrays, not strings. by EvanED · · Score: 1

      This is not strictly the same thing, but don't underestimate the problems that are caused (particularly for automatic analysis) by the fact that you can't tell apart something that is semantically a pointer to a single character vs a pointer to an array.

    3. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      You mean "int or double". The float type existed, but parameters were all double.

      dom

    4. Re:The trouble is arrays, not strings. by hey+hey+hey · · Score: 2

      Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program.

      Your relative comparisons are a bit off. The Altair from 1975 (the first versions of C were finished around 1973) had a whopping 1KB of memory. The mini computers of the day ran rings around what PCs there were, both in raw power and in memory.

    5. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      The C language and Pascal were both developed in the 1968 to 1973 time frame for the PDP-x series of computers. PCs did not come along until later. By 1980 the computer hardware landscape was much different than it had been ten years earlier.

    6. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      You don't remember. A PC was limited to 64k as defined by the segment registers. You get 640k by switching segment register values.

      Second, you also forget that string types were tried.

      VAX-11 hardware supported multiple string types (it had to). Not just length must be stored, but also storage type, and allocation length. And all lengths had to be supported in 16 bit/32 bit sizes. Storage type was dynamic, static, constant (not the same as static which was compile time). I seem to remember there were 4 bits for the storage type.

      No. A null terminated string is much simpler, and faster.

    7. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      Your pointing out a better alternative reminds me of the Go language http://golang.org/. In Go, arrays are restricted so many of these problems go away. To compensate for the limitations this imposes, they came up with what is called a "slice" which is a subsequence of the array. Slices allow the flexibility we need to "slice and dice" arrays but retains the ability to ensure that you never read or write outside the bounds of the original array.

    8. Re:The trouble is arrays, not strings. by Animats · · Score: 1

      Right. This is the short explanation. I have a longer document which makes a more detailed proposal, and includes slices, with Python-like slice syntax for C.

      Microsoft went down this road, with their "source-code annotation language" for C. Essentially the same information as I'm taking about is written, but as comments which can be machine-checked by tools. This never caught on outside Microsoft.

    9. Re:The trouble is arrays, not strings. by Animats · · Score: 1

      The C language and Pascal were both developed in the 1968 to 1973 time frame for the PDP-x series of computers.

      No, Pascal was originally developed, by Wirth, for the CDC 6600, which was considered a "supercomputer" at the time.

    10. Re:The trouble is arrays, not strings. by dfghjk · · Score: 1

      "Pascal powered much of the personal computer revolution, including the Macintosh."

      Not sure what this claim has to do with anything, but it is absurd.

    11. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      Yeah, there were a whole lot of PCs with 640kB around in 1969-1970... You also say late-70s... as if the difference of "only" 10 years is insignificant. Hell you are equivocating the endpoints of a period that saw the development of TCP/IP, DOS, Alto, the Apple as if it was all just one static period. Wow.

    12. Re:The trouble is arrays, not strings. by shutdown+-p+now · · Score: 1

      int read(int fd, char[n]& buf, size_t n);

      But that's essentially what they did in C99.

    13. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      ...come to think of it, the parking ticket machine of today has more computing power than the PDP-11 of yesterday...

    14. Re:The trouble is arrays, not strings. by Anonymous Coward · · Score: 0

      There are so many programming languages. At the same time as C was becoming so very popular, I was working with CORAL66. For this discussion, the interesting thing is that strings had a length field, exactly one byte, so limited to 255 bytes of data. Strings were only really used for small pieces of text. Programmers never used them for arbitrary buffers.

      The hardware was also different, for example the natural word size was 16 bits and there were special instructions to access 8-bit offsets. Just to be clear: address 0x1001 pointed to 16 bits in memory and 0x1002 to the next 16 bits. 8-bit accesses required specific code.

      I think that C has had a large impact on the design of hardware. If it had been differently designed, quite a lot of hardware today would have been different to match that.

      The above are just differences, not necessarily a benefit. The point of the remark is that there is quite a lot more flexibility available than is obvious at first sight and this discussion should allow for that.

  25. First: READ TFA by Hymer · · Score: 1

    PHK's articles are worth reading... always.

    Second: there is a /. Sensation Prevention Section where he explains that NUL-terminated strings was the correct choice at the time, it just caused some unforeseen consequences.

    1. Re:First: READ TFA by dutchd00d · · Score: 1

      Second: there is a /. Sensation Prevention Section...

      <Snort>Sure, that'll work</Snort>

  26. String Nulls in SQL by Tablizer · · Score: 1

    Die and rot in nullhell! The verbosity and work-arounds they force...

    1. Re:String Nulls in SQL by shutdown+-p+now · · Score: 1

      SQL NULL has nothing whatsoever to do with C null-terminated strings.

    2. Re:String Nulls in SQL by Tablizer · · Score: 1

      It's an example of a vast one-byte mistake.

  27. Were nul-terminated strings essential? by davidgay · · Score: 1
    The real question nobody has addressed here: if C had gone for length+characters for its string, would it have succeeded?

    David Gay, scarred by Pascal "strings"
    PS: I've often wondered the same about that other decried C feature, the preprocessor.

  28. Well I differ in my view. by hamster_nz · · Score: 3, Informative

    After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.

    Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.

    it isn't forced on to you - you don't have to follow it.

    1. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      char *foo="This is a string";

      C most definitely has a concept of strings.

    2. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Except it would be nice if the compiler had support for writing string constant which are "Pascal" strings. The GCC actually can, but only the 255 length variant.

    3. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Except that string literals are terminated with zero.

    4. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Uh. Are you saying that string literals don't exist in C? That's a weird thing to think.

    5. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      I think that's a bit misleading, C does favor the null terminated string. If you put a string in your source, "like this" as you know, it's type is a char* to a null-terminated string. So it's only natural to use that as the string type

    6. Re:Well I differ in my view. by hamster_nz · · Score: 1

      No, C has an easy way to initailize byte arrays with constants, and pathologically adds nulls to the end of them.

      char mystring[8] = "\x07One\0Two";

      Amazing - I've just defined a "Length byte plus data" string much like Pascal uses - and sizeof(mystring) returns '8', so it has no terminating null.

      It is just a convention that can be rejected if needed.

    7. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      I beg to disagree.

        printf("Hello World.\n");
                        ^^^^^ Hey, look, C's concept of a string!

    8. Re:Well I differ in my view. by tgv · · Score: 1

      I pity you for the number of replies that apparently only read up until the point where they get excited, and conveniently overlook the last part, the Anonymous Cowards...

    9. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.

      Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.

      it isn't forced on to you - you don't have to follow it.

      Don't C compilers zero-terminate build-in exe strings?

    10. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Except string constants, if one uses them (in the age of internationalization, perhaps one doesn't).

    11. Re:Well I differ in my view. by Lord_Naikon · · Score: 1

      Too bad you still have to explicitly specify the length of the string. Twice.

    12. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      If you want to write a nontrivial program, i.e. one that uses ANSI C functions like "fopen", you don't have a choice but to use null terminated strings because that's what those calls expect.

    13. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      char * str = '"are you sure?";

    14. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Not quite true:

      char *s = "Hello";

      and we're looking at a NUL terminated string, and if it was an array, sizeof would say 6.

    15. Re:Well I differ in my view. by hamster_nz · · Score: 1

      Agreed. It has got me worried that so many people screamed "you are so wrong" and the email is also marked "+4 informative"... I guess they brains are stuck in the "C source" level of abstraction.and believe C actually has strings. No wonder there is so much crappy code out there!

      Picking up my copy of K&R Edition 2.. Page 30

      getline puts the character '\0' at the end of the array it is creating, to mark the end of the string of characters. This convention is also used by the C language: when a string constant like "hello\n" appears in a C program it is stored as an array of characters containing the characters of the string and terminated with a '\0' to mark the end.

      K&R says that null terminated strings is a convention, and that at runtime string constants are actually stored as arrays of characters.

    16. Re:Well I differ in my view. by hamster_nz · · Score: 1

      Yes, but

      char unterminated[9] = "Not always";

    17. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Not true. The compiler supports string literals and converts them to NULL-terminated arrays of chars. It doesn't do this with other statically initialized data.

    18. Re:Well I differ in my view. by shutdown+-p+now · · Score: 2

      it isn't forced on to you - you don't have to follow it.

      It's forced in practice by the fact that the entire standard library, and all third-party libraries, all produce and consume null-terminated strings.

      What's far worse is that, since C FFI is the lowest common denominator that we have across various languages, null-terminated strings become the standard way to marshal strings between libraries written in different languages. This means many things: for one, no embedded nulls, which is bad for many scenarios where handling them is desired.

      For another, it means that high-level languages and frameworks often have to take C representation into account when designing their own strings, just so that they can be efficiently converted to a C string. For example, in Qt and .NET, strings have separately stored length, but they're also null-terminated just so that (assuming the string has no embedded nulls, which is otherwise valid) a pointer to the first character is a valid C string, and can be used to call some C API. This is especially sad when the library in question is itself written in another language that, in fact, has its own string representation which supports embedded nulls.

    19. Re:Well I differ in my view. by Anonymous Coward · · Score: 0

      Right, and when you do "somethign else", how do you assign "Hello, world\n" to 'struct String foo;' ? The C language has a definite concept of string literals, and that concept matches with string.h functions.

      (Even in C++03 it's still a bit of a mess: you can assign "Hello, world\n" to 'std::string foo;' but it's not the same type)

  29. Missing the point by Casandro · · Score: 1

    It would have been more urgent to find out where an allocated part of RAM ends.

    Or just like Integers and floats, strings could have been their very own basic type. Essentially leave the implementation of it to the compiler, so it can do range checks. Most C-programmers seem to believe that this is done already.

    BTW range check on integers don't cost anything anymore. I've benchmarked some real-life code using large arrays (doing statistics on it) and range checks didn't cause any slow down. Essentially the compare operation can be done in parallel with the memory read operation.... that is when your language supports that at all.

  30. Faster loops by Sloppy · · Score: 4, Insightful

    TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.

    If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.

    It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.

    BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)

    --
    As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    1. Re:Faster loops by Rockoon · · Score: 1

      If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero

      The most popular architecture with flags is the x86 line, and it most certainly does NOT do what you are claiming. None of the memory copying instructions (mem to mem, mem to reg, reg to mem, or even reg to reg) alter the flags at all, ever, and thats by design.

      Which architecture does what you claim? Seriously. I know it isn't PDP-11, x86, 6502, or 65816.

      Please tell us.

      --
      "His name was James Damore."
    2. Re:Faster loops by Alioth · · Score: 1

      That stuff is *still* important, for every PC sold with gigs of RAM, hundreds of microcontroller-based devices are sold, where register use matters, where memory use matters, where each clock cycle matters. For instance, take the ATtiny16 microcontroller: an 8 bit "system on a chip" with 1K of flash ROM and 64 *bytes* of RAM. Yet C is a useful and productive language to write code for such a device.

    3. Re:Faster loops by Anonymous Coward · · Score: 1

      It was always my impression that C was designed in part to generate efficient PDP-11 assembly code. The PDP-11 had a post-increment addressing mode, and could do memory-to-memory copies, which meant that

              char* sp = &s;
              char* dp = &d;
              while (*dp++ = *sp)
                      ;

      could be translated into:

              MOV R1,address of S
              MOV R2, address of D
      LOOP:
              MOV (R1)+,(R2)+
              branch if non-zero to LOOP

      (Sorry, it's been almost forty years since I did any PDP-11 assembly language, and I don't remember the syntax any more).

      In the days of counting bytes and clock cycles, this kind of efficiency was a gem. It was one of the things that made it possible to write operating systems code in C (a "higher-level" language), as opposed to assembly language (which was the norm).

    4. Re:Faster loops by Sloppy · · Score: 1

      Which architecture does what you claim? Seriously. I know it isn't PDP-11, x86, 6502, or 65816.

      PDP-11 was very long time ago for me and only wrote a few hundred lines of assembly on it, so maybe you're right. I know I could test a register for zeroness pretty damn fast, though -- if it wasn't "free" is was faster than just about anything else the processor was capable of doing.

      But 6502, are you serious? Load the accumulator with a zero and the Z flag is set. Load it with any other value and Z flag is clear. It's that easy. LDA followed by BNE or BEQ and that's your way out of the loop.

      x86 is irrelevant to 1970s programmers.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    5. Re:Faster loops by pruss · · Score: 1

      I think the two pieces of code would be something like:

      NUL-terminated:

          MOV SI, [pointer] ; load pointer
          CLD ; clear direction flag for traversing
      loop_top:
          LODSB ; get byte pointed to by SI, increase SI
          OR AL,AL
          JZ done ... do whatever you need to do with AL, restoring it if necessary
          JMP loop_top
      done:

      length-prefixed:
          MOV SI, [pointer] ; load pointer
          CLD ; clear direction flag for traversing
          LODSW ; get length (N.B. expensive if not word aligned), increase SI by 2
          MOV CX,AX ; copy length to BX
          OR CX,CX ; check to see if length is zero
          JZ done ; get out if it is
      loop_top:
          LODSB ; get byte pointed to by SI, increase SI ... do whatever you need to with AL
          LOOP loop_top ; decrement CX and jump if non-zero
      done:

      The length-prefixed code has more setup code. Notice the kludgy separate case check for zero-length. Moreover, the length-prefixed code uses an extra 16-bit register (CX) for the countdown. On the other hand, the length-prefixed code has a lot less work within the loop. The looping structure is all handled by the LOOP opcode, while the NUL-terminated needs three instructions to control the looping.

      The LOOP opcode is 16 cycles when the jump happens, while the OR is 3, the JZ is 4 when the jump doesn't happen and the JMP is 11 if the loop isn't too large. So, the length-based loop uses 16 cycles for loop control, while the NUL-based loop uses 18. Two cycles isn't a big difference, and will be made up for if the NUL-based loop does something useful with the extra 16-bit register.

      However, some simpler operations would be faster on a length-prefixed arrangement. For instance, for a string comparison one could use the cool REPNZ CMPSB, and for string copying one could use REPNZ MOVSB. Both of these require knowing the length ahead of time. The NUL-terminated equivalents would require a conventional loop.

      That said, I like the NUL-terminated strings simply on aesthetic grounds: there is no need for a string type in the language. It's all just a matter of the library, and if you don't like the standard string library, you can write your own, for instance storing strings as struct {unsigned length; char* string;}. (Maybe better would be struct {unsigned length; char string[0];}, but such extendable-length structs weren't quite standard.)

    6. Re:Faster loops by pruss · · Score: 1

      I noticed that my length-based example could collapse two lines of setup code to one. Replace OR CX,CX and JZ done with:
          JCXZ done

      It's been a long time. :-(

    7. Re:Faster loops by spitzak · · Score: 1

      The PDP-11 did in fact set the non-zero flag for a very large set of operations, including memory to memory copy. So in fact the GP is correct, this is probably the most efficient way to copy a string on a PDP-11.

    8. Re:Faster loops by jedwidz · · Score: 1

      So if not free, it's still a relatively cheap TST (or whatever) instruction after reading the data into a register. GP's point stands.

    9. Re:Faster loops by sribe · · Score: 1

      It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s).

      People have been laughing about that for over 20 years by my count. Yet I still care. And you should hear the feedback I get from users when they try a competitor's product. Yeah, it still counts. It means you can be fast with data sets that those guys back then could hardly dream of ;-)

    10. Re:Faster loops by Anonymous Coward · · Score: 0

      You forgot coding in the seventies. You didn't store a current position, length and offset. That's just redundant. You just decremented the "remaining length" on every iteration, got the free zero check, and then used "branch if not zero". Takes one more register, not two, but you've eliminated a test on a variable read from memory.

      And as for the legacy part: remember C came from Bell Labs. AT&T was running seriously old tech at that point; the whole fad of disposable electronics wasn't as prevalent as it is now. K&R had all reasons to believe their code would find it's way into new AT&T hardware (it did, 4ESS/5ESS) which will be around for a while (at least until the next decade, if not longer).

    11. Re:Faster loops by Sloppy · · Score: 1

      You didn't store a current position, length and offset. That's just redundant. You just decremented the "remaining length" on every iteration, got the free zero check, and then used "branch if not zero". Takes one more register, not two

      You're exactly right. I realized that a few minutes after my post, thinking, "Oh geez, someone is going to call me on my shitty inefficient programming" but a day went by and it looked like I was getting away with it. And then.. damn you. Oh well.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    12. Re:Faster loops by badkarmadayaccount · · Score: 1

      If you want bare metal, use Ada.

      --
      I know tobacco is bad for you, so I smoke weed with crack.
  31. MS path separator (backslash) by Anonymous Coward · · Score: 0

    If this was not be worst idea to employ backslash in DOS for path separator, than I do not know what else...

    1. Re:MS path separator (backslash) by spitzak · · Score: 1

      Although I agree, since backslash had a very long use as the escape character, you could say that the choice of slash by Multics/Unix was bad, too.

      Period would have made a lot more sense, it would match the way hierarchies are indicated in virtually all programming languages.

      The problem was that period made so much sense that it had already been incorporated into existing primitive filename conventions (usually as a separator between the name and the type, often called an "extension"). Unix had to be able to copy sets of files from other machines, which meant period had to be preserved, and thus could not be used for directory separator. (although it would be interesting if foo.c and foo.o meant a directory called foo with c and o files in it, I assume the overhead of creating many 1-entry directories for typical sets of files was considered prohibitive).

  32. Re:Why do I need a subject? by sortius_nod · · Score: 0

    you fail at slashdot AC.

    1) Slashdot is about the discussion of the news for nerds, not just having everything
    2) posting a snide remark as AC means you won't get discussion

    double fail at slashdot, enjoy.

  33. Everyone misunderstands that poem. by symbolset · · Score: 1

    Including you.

    --
    Help stamp out iliturcy.
    1. Re:Everyone misunderstands that poem. by mwvdlee · · Score: 2

      Which would seem to imply you have reason to believe the GP is incorrect in his interpretation of the poem.
      Please enlighten us with your insights.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    2. Re:Everyone misunderstands that poem. by symbolset · · Score: 0

      OK fine. It's art. It is what it means to you. That is the point of art.

      --
      Help stamp out iliturcy.
    3. Re:Everyone misunderstands that poem. by Rosy+At+Random · · Score: 1

      To me it means that we should all choose to kill as many people as possible. Hey, it's art, damnit, and my interpretation is just as valid.

      --
      Would you like a slice of toast?
    4. Re:Everyone misunderstands that poem. by Chryana · · Score: 1

      That might be a good definition if we were discussing a painting made by throwing paint at a canvas or the picture of an unmade bed, but in this case it is not. I think I'll go for the interpretation you disagreed with rather than your completely inane one.

  34. Y2k as most expensive mistake by Anonymous Coward · · Score: 0

    I'm a bit surprised the PHK didn't mention Y2k as an example of a design choice that made sense at the time but required very expensive mitigation.

    1. Re:Y2k as most expensive mistake by Arancaytar · · Score: 1

      Because that one was a two-byte mistake.

    2. Re:Y2k as most expensive mistake by Dog-Cow · · Score: 1

      Not so. Simply storing the century in another byte would have mitigated any y2k issues. Of course, that would only work until the year 25500, but I think we'd be OK in general.

  35. The problem is in-band signalling in general by arglebargle_xiv · · Score: 1

    The problem isn't so much '\0' vs counted strings, it's in-band signalling in general. The telcos found this out in the 1970s with 2600 Hz whistles (and, eventually, fixed it), while the general computing world continues to use it, and in fact is busy inventing new and more complex ways of doing it all the time. String overruns, SQL injection, XSS, and many others are all examples of exploiting in-band signalling. The worst offender of the lot must be XML, which so thoroughly confuses what's data and what's control information that we'll still be trying to sort out the mess for decades to come. If you could remove in-band signalling, you'd also suddenly deal with a significant chunk of the OWASP perpetual top ten.

    1. Re:The problem is in-band signalling in general by Jon+Stone · · Score: 1

      This is a fundamental problem. Instructions and data are intermingled in memory and on disk. Buffer overflows exploit this by tricking computers into executing data as code. Most interpreted languages support an eval() like procedure that takes data and interprets it as code. On the topic of interpreted languages - is a Perl script data or code?

      Things like the NX bit in newer CPUs help but don't solve the problem.

    2. Re:The problem is in-band signalling in general by arglebargle_xiv · · Score: 1

      On the topic of interpreted languages - is a Perl script data or code?

      That leads to a related problem, that we have Turing machines everywhere. For example what should be a flat technical document (PDF) has no less than three complete programming environments built into it, even more if you include stuff like MHEG content embedded in video streams in the PDF. It seems like no technology has really "arrived" until it's programmable in some way, and with that programmability comes exploitability.

  36. Might not have been the most costly but... by rcpitt · · Score: 1
    Ken Thompson, one of the original creators of the UNIX system and the C language was asked what he'd do differently if he were redesigning the UNIX system.

    His reply: "I'd spell creat with an e"

    one byte - but a world of errors

    --
    Been there, done that, paid for the T-shirt
    and didn't get it
  37. Be careful what you wish for... by dutchd00d · · Score: 2

    If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".

    1. Re:Be careful what you wish for... by lederhosen · · Score: 1

      If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".

      One extra byte was clearly for that time, now days you would probably use 7 extra bytes (depending on memory model). This would of course not be hard coded.

    2. Re:Be careful what you wish for... by Anonymous Coward · · Score: 0

      If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".

      No we wouldn't, the standard library would have provided a set of rope functions so that when you started playing with really long strings then you'd just switch to ropes. More simply, you can just place a bunch of strings in a linked list with a light API over the top to obscure it but you shouldn't need to worry about that too often. IMO, having really long strings indicates a design flaw, you've loaded too much data or are using the wrong data structure [for a word processor you should have something like pages that contain paragraphs that contain segments (formatting: size, bold, etc) that contain strings for example].

  38. On the MS part... by yuhong · · Score: 1

    Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads.

    As it happens, I researched this one for years, and I think the root cause of that one is Gates being an aggressive businessman who considered business as war, which helped it win the IBM PC deal, but also led to many of MS's evils over the 1990s.

  39. C strings? by hcs_$reboot · · Score: 1
    C language has no string. It has only char arrays, or pointers to char.
    A general convention (used primarily in the libraries) was to consider a "string" being a set of characters ended by a zero-byte.

    By convention in C, the last character in a character array should be a `\0' because most programs that manipulate character arrays expect it.

    -- Brian W. Kernighan

    --
    Slashdot, fix the reply notifications... You won't get away with it...
    1. Re:C strings? by Arlet · · Score: 1

      C has constant strings.

  40. Really low end systems: MCUs by drolli · · Score: 1

    I can easily imagine a situation where on an MCU with 64 bytes of ram, the additional counter you need to maintain not NUL terminated strings is an issue (e.g. when sending out data in an interrupt routine...).

  41. Re:The Road Not Taken by other Slashdot FPers by o'reor · · Score: 5, Funny

    I for one welcome that refreshing new way of writing "Frost's pissed."

    --
    In Soviet Russia, our new overlords are belong to all your base.
  42. The worse the programmer... by Tanuki64 · · Score: 1

    ...the more the complaints about bad language features.

    1. Re:The worse the programmer... by Anonymous Coward · · Score: 0

      Poul-Henning Kamp is a better programmer than any 10 people with a slashdot account.

    2. Re:The worse the programmer... by SlashV · · Score: 1

      The worse the programmer... ...the more the complaints about bad language features.

      That may be true, but that doesn't make those particular features 'good' language features.

  43. Let's face it. by byrtolet · · Score: 1
    C just does not have strings. NUL terminated strings is just a hack and a good one.

    The real problem is that many people suck at coding, and worse, many code without a proper and thorough understanding of what they do.

    There are no bad tools. There are misused ones.

  44. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  45. Sorry, he's wrong by rdebath · · Score: 1

    He's underestimated the space cost. It's not just lose the NUL byte and gain an integer length, just encoding the current length in the type doesn't protect you from buffer overflows. For that you also need a maxstrlen integer.

    So the size would be a length int, a maxlen int and the bytes of the string. But you're lucky, you see malloc holds a integer too for the length of the memory block it will free back into the pool, this can be overloaded with the maxstrlen field. So you're back to two bytes of overhead for a malloc'd string.

    Constant strings are another matter, they don't have the malloc header so something else would have to be done. Probably the easiest would be to set the malloc length to zero, it's a constant after all so doesn't need to be freed or overwritten. That does mean 4.5 bytes overhead (including alignment) though.

    The problem comes with buffers on the stack. These have a fixed chunk of memory allocated and so can't be expanded. But the malloc header we have for the strings explicitly can be expanded. There are two solutions; we could spend two more bytes and add a pointer to the base string structure then malloc the bytes for the string itself independently. A malloc'd string would be malloc's in two pieces doubling the malloc overhead and we'd have to do something about freeing this malloc'd space when the buffer goes out of scope ... I'm stuck, this isn't C anymore.

    So the second choice goes right back to the start, the malloc size (maxstrlen) is now a hard limit; no library routine can expand it after it's been created. A string on the stack has a fake malloc header and the current string length (overheader 4.5 bytes). Of course there's the problem as to what to do if the string is too big; there are no exceptions, just truncation. (more, different, bugs)

    So that's it the additional overhead is three bytes, not one, plus the alignment overhead. Two of the bytes (and the alignment) disappear with malloc'd strings but the added complexity stays.

    And you don't even get dynamic strings.

    1. Re:Sorry, he's wrong by darkwing_bmf · · Score: 1

      Constant strings are another matter, they don't have the malloc header so something else would have to be done. Probably the easiest would be to set the malloc length to zero, it's a constant after all so doesn't need to be freed or overwritten. That does mean 4.5 bytes overhead (including alignment) though.

      For constant strings the compiler could keep track of the size. You could use even less memory than C.

    2. Re:Sorry, he's wrong by rdebath · · Score: 1

      Only if you use an extra pointer to point at the bytes of the string (so you can manufacture a string header anywhere at run time) or you copy your byte array to somewhere where there is some clear space before the string so that the library functions can find the length value where they expect it.

      If you don't you'd have to have a complete set of special functions to use the different format constant strings or a special function call to convert the byte array plus length into a proper "string" structure.

      In the end you'd waste more space dealing with the difference than you would just adding the fake string header in the first place.

    3. Re:Sorry, he's wrong by darkwing_bmf · · Score: 1

      That's only if the string is passed to a separately compiled function. You're talking about implementation details. But even assuming an extra pointer is needed, those really don't take up a lot of space in comparison to the character data itself. Certainly today the benefits gained by knowing the length of the string far outweigh any drawbacks in memory usage. And even back when memory was scarce... well, if one byte is that much concern then even C is too much. Do it in assembly and pre-calculate your character array storage needs.

  46. Re:Why do I need a subject? by Darfeld · · Score: 2

    But you reply to him knowing he won't read you? ( Or, if I let my paranoia run, you're AC and you reply so you have a link to your prior post and can check answers... And so you'ld be trolling.)

    --
    (\__/) This is Lapinator
    (='.'=) copy it in your sig
    (")_(") so it can take over the world
  47. How about looking at current mistakes? by thaig · · Score: 1

    What are today's costly mistakes?

    --
    This is all just my personal opinion.
    1. Re:How about looking at current mistakes? by Migala77 · · Score: 1

      - Oracle
      - Patents
      - Oracle's patents

    2. Re:How about looking at current mistakes? by fritsd · · Score: 1

      Good question!
      What do you people think of the following ones:
      UTF-16 and UTF-32 instead of UTF-8 (N.B. UTF-8 plays nice with the NUL-terminated strings)
      32-bit time_t (we're all going to be busy in 2037 :-) )
      (maybe off-topic) choosing MSOOXML instead of ODF 1.2

      --
      To be, or not to be: isn't that quite logical, Slashdot Beta?
    3. Re:How about looking at current mistakes? by fritsd · · Score: 1

      that's only a costly mistake for the USA and Japan

      --
      To be, or not to be: isn't that quite logical, Slashdot Beta?
    4. Re:How about looking at current mistakes? by spitzak · · Score: 1

      I would agree that UTF-16 instead of UTF-8 is the most costly current-day mistake.

      Some other aspects of Unicode design might also be big mistakes, such as multiple ways of representing the "same" characters. This is going to be like case-independent filenames (does lower-case æ match uppercase or not?), but with a million different and very complex possible "case foldings" and everybody disagreeing about which set they use. Possibly they can learn from Unix and make different byte streams always mean different strings, but I have my doubts that some system programmers are that intelligent, since they also think UTF-16 is a good idea.

      The 32-bit time_t is an old mistake, not a modern one. I think a bigger mistake in time is to keep underestimating how fine of a division you want of a second (original unix had 1) and to keep using powers of 10 for these divisions, which do not translate loss-less into floating point formats. Linux and Posix have about a dozen ways of representing time, each with a different unit for the sub-second portion, and all of them powers of 10.

      Another mistake is to not have a primitive atomic-file-create call on Unix or Windows. This would appear to create an empty file that you could write to, but until the file is closed it would not appear in the file system, instead all processes would either not see the file or see exactly the previous version of the file. Every program wants this and in fact I think a system could be made where this is the only way to write a file, but the file systems do not support it directly, instead you must use very complex workarounds (or simpler workarounds that are flawed). There was a huge stink because new Linux filesystems made these workarounds fail, and this hostile reactions of some the the Linux guru's leads me to believe that this mistake is not going to be addressed soon.

      Related to this, glibc and Windows refusing to put strlcpy into the standard library is probably a cause of enormous numbers of nul termination bugs, since programmers are lazy and will do stupid things because they don't have this call.

  48. Not the language, the library and the system by medoc · · Score: 1

    PHK actually hints at 2 things: that strings should have been length+array, and that the compiler should know about it.

    The first assertion is subject to discussion and has its serious issues (strings would have become foreign to other C arrays, what integer size etc.).

    The second point is I think more clear-cut. As it is, the C compiler knows mostly nothing about strings, which means that it's easy to design a different strings library and use it instead of strcpy() et al (cf. c++ strings). The only constraint is that you have to present a zero terminated string to system and foreign libraries interfaces.

    Embedding the string structure in the compiler would have ossified the choice, making C a much less adaptable language, in contrast with its other features, and a fault of style.

    1. Re:Not the language, the library and the system by MichaelSmith · · Score: 1

      A descriptor based string API would be pretty simple to write. There is one in VMS for example.

    2. Re:Not the language, the library and the system by shutdown+-p+now · · Score: 1

      As it is, the C compiler knows mostly nothing about strings

      But it does. It takes your string literal, "abc", and lays it out in memory as a null-terminated string.

  49. Pascal string hell by Anonymous Coward · · Score: 0

    The article is complete BS. Anyone who has programmed strings in pascal knows what a PITA
    it is programming strings with a 255 character limit. I despised this while working in that hell of
    a language.

    1. Re:Pascal string hell by MichaelSmith · · Score: 1

      String descriptors could be coded with arbitrary width length fields. It could use an extension bit for example.

    2. Re:Pascal string hell by darkwing_bmf · · Score: 1

      Not all fixed string languages are Pascal.

  50. The decision was different! by angel'o'sphere · · Score: 1

    I don't believe the the decision was based on memory concerns.

    C was originally designed to be a portable assembler.

    Most micro processors clear the carry flag and set the zero flag if a 0x00 byte is loaded into a register (or a 0x0000 word etc.) or moved.

    That means loops like this:

    strcpy(char *dest, char* src) {
        while(*dest++ = *src++)
    ;
    }

    are basically a 1:1 transformation from assembly into C.

    Also keep in mind: having every string starting with a length indicator would make typical unix file handling and piping between utilities a little bit more complicated.
    Text files ... should every line now start with 2 bytes length indicator? In which endian format? Or should they stay plain text but while reading lines the "readline()" call is counting .... (to which line terminator ?? \n??) and updating the size?

    Bottom line using "0x00" as string terminator was pretty elegant. After all it allowed performant algorithms on strings and kept the library ore simple. That reminds me: how many "structs" are defined in the old standard C library?

    --
    Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
    1. Re:The decision was different! by bussdriver · · Score: 1

      The mistake would have been pascal style strings putting the length byte onto the array therefore making strings into a complex datatype. It would have been 1 length byte (like Pascal, of similar age.) This would have limited the length of strings and quickly led people to using Byte arrays for their string work or ugly hacks to handle and mixed string sizes. Remember, they weren't running 8 bit words back then so they'd have to make this new complex string structure AND a byte datatype since text was a byte and we need byte sized units ;-p
      This wasn't just a design decision but also created more work; when terminators would allow you to combine bytes and strings.

      255 length strings are long enough for everybody... biggest reason I went to C over Pascal-- I hated length byte string limitations! The limitation would have made a LOT of people use byte arrays instead. We can handle huge length integers on everything today so terminators are less useful. 64bits are long enough for most people; 128bit being unlikely to be an issue. This would have been insane to do on systems around 16bit on something used so frequently.

      I agree with the assembly perspective but it merely would add a bias to making decisions it does not make it the deciding factor. There are plenty of macro like things C did but it does make a few abstractions in how it tackled problems which could have gone in other directions -- nothing fancy because the compilers back then were more like automated assembler macro tools. What I could see them geeking out over is making the perfect macro sets where you could combine them into clever ways-- I can see the strong appeal of just how clever some of these "bad" C programming techniques are when a simple direct translation is involved. When designing the language the mindset would be to make the smallest set of powerful assembly macros so again it makes the length byte seem like the lesser solution (from that perspective.)

      Now with hindsight and existence of software engineering practices, the bias is towards avoiding bugs etc and we just ignore speed and figure someday the compiler will figure it out or the runtime system will dynamically tweak it. This kind of thinking probably would get you fired back then while now their thinking gets you fired (goto was not EVIL back then.)

      Their assembly background and love of goto is why goto still exists. Goto allows for things otherwise impossible in the language to be done, like a backdoor to the constraints of the high level abstractions -- if they thought more like today, goto would be dead and we'd have some smarter language features; such as:

      a break/continue that can go up multiple blocks (and not just loops... think about how many if else trees could be pruned!) I really wish somebody would fix this already!

      A primitive precursor to exceptions (named goto? limited scope goto? a fancier switch? dispatch tables? multiple return values? nested function definitions or shared function frames?)

    2. Re:The decision was different! by Anonymous Coward · · Score: 0

      The mistake would have been pascal style strings putting the length byte onto the array therefore making strings into a complex datatype

      Strings are already a complex data type. They're arrays of characters. Characters are (usually) a simple data type but C doesn't treat strings like simple arrays, thus the need for a terminator character like null.

      What it does allow is processing of arbitrary-length strings, but you still need a fixed-length array or buffer to store strings while handling them, and you need to keep track of the size of that buffer, so why not just do that with the string itself?

    3. Re:The decision was different! by spitzak · · Score: 1

      Why they did not (and still don't) have named break/continue is a mystery. I would say 90% of the uses of goto are because of this. I also suspect that if the early C compiler could do a goto, it could have done a named break/continue.

      Other use of goto is code like this, does anybody know of a plausable construct in any programming language? In this example bar is a complex mess of code that refers to dozens of local variables, thus making it into a function or duplicating it would make the code much more complex and hard to understand. In addition test_2 will crash or produce an undefined result if test_1 is true.

          if (test_1) {
              foo;
              goto TEST_2;
          } else if (test_2) {
          TEST_2:
              bar;
          }

      Some of the earliest C functions had setjmp/longjmp and these did work as your "primitive precursor to exceptions".

    4. Re:The decision was different! by darkwing_bmf · · Score: 1

      Text files ... should every line now start with 2 bytes length indicator?

      Text files are not the same as "strings". "String length" languages can handle standard ascii text files just as well or better than C.

    5. Re:The decision was different! by darkwing_bmf · · Score: 1

      if (test_1) {
            foo;
      }
      if (test_1 || test_2) {
            bar;
      }

      Something like that?

    6. Re:The decision was different! by spitzak · · Score: 1

      No, test_1 is false after foo is done, so this will not work. After foo is done the only thing that should be done is bar, the tests test_1 and test_2 may produce garbage or even crash (though I think in most examples they will just waste time).

      I admit the exact problem is difficult to state as there are a lot of conditions and I probably did not list them all, but I have always ended up putting a goto in to fix it, and think it would be nice if some language came up with a solution.

    7. Re:The decision was different! by shutdown+-p+now · · Score: 1

      By far the most frequent use of goto in plain C code, in my experience, is to skip the rest of the function and execute the cleanup block after checking an error code returned from a function call. Basically, what would be try/finally in Java. I often wonder why they don't consider adding this as a proper language feature in C1x.

      But, yes, labelled break/continue is much more readable than goto.

    8. Re:The decision was different! by darkwing_bmf · · Score: 1

      if (test_1) {
            foo; /* test_1 is now garbage */
            bar_flag = 1;
      }
      if (bar_flag || test_2) {
            bar;
      }

      Would a simple "do this thing" flag not work? It does take an extra assignment and comparison but that might be faster than reading an address from memory and jumping to that address.

    9. Re:The decision was different! by spitzak · · Score: 1

      Yes that would work. I don't like adding test variables, but this is equivalent to most other "how do I avoid a goto" solutions. I think it does indicate a type of control flow that should be supported by a language somehow.

    10. Re:The decision was different! by bussdriver · · Score: 1

      You are missing this general point in the details of his simple example.

      There are examples of where goto is beneficial that are more complex and the gains are more than just saving an additional variable test and additional variable.

      BTW, conditionals still have impact today when you blow your long instruction pipeline on a bad branch prediction... especially if you have many branches nested near each other where the branch prediction quickly becomes useless. Blowing 20 cycles on a conditional doesn't bother most people anyhow; but I'm just mentioning it so somebody doesn't think 1 more flag test case is only just 2 more ops.

      I argue that exceptions are an evolved form of goto; just as a while loop is an evolved form of goto (jump on condition). The language provides an abstraction for a goto situation that prevents some human error and limits a lot of flexibility but it is worth it for the labor saved as well as the portability; which is why we have while loops and then had exception handling; it may not have been in C but C++ forced the need for exceptions to solve the issue of nested constructor errors (I can't think of a better solution to that problem.) When I was a kid, I wanted some sort of jump dispatch table like an evolved switch statement, exceptions partially fit the bill but for complex nests of conditionals (if / else) exceptions didn't work so great.

      Logic is not a hierarchy, it is a graph (as in C.S.) and C's constructs as well as many other languages fail to represent these graphs; requiring work around solutions which can add more overhead, complexity, and confusion due to their indirect mapping of the logic. Perhaps a state machine definition syntax would facilitate this? (but editing would probably be just as or more confusing... fancy editors however could generate editable graphs from such a definition.) The state of the machine would be your position in the logic graph. The right tools could make this really powerful... we've been using ascii text-only languages for far too long... we could at least leverage some unicode features... for example: union, intersection (sets), Pi, almost equal, less than or equal, root, not (instead of !), ratio (for fractions; the division op is saved for when it is cast into another type)

    11. Re:The decision was different! by bussdriver · · Score: 1

      Because the C language construct for strings likely would have been like Pascal-- with the 1st byte indicating the extremely short length of 255 characters. This forces that size limit upon you.

      I wouldn't say C strings are complex; C doesn't really do much for strings. It does character arrays which are actually byte arrays but were called chars because in those days characters were only 1 byte in size; char was picked to be a byte. (at least as a kid I didn't have a byte data type.) I can think of the compiler only formatting strings constants into this format-- all the other stuff is standard library functions for working with strings; not the language technically but the built-in run-time library (which could have all string functions removed or replaced with pascal styled ones.)

      char should have been byte; the string terminator convention could have been a typedef of the byte data type. It only would have made it slightly easier to migrate to 2 byte chars if this was the case. char must remain 1 byte in size. To do serious strings today you must use a massive unicode library with a complex data type and that could be implemented either way.

      Anyhow the point is-- a buffer is whatever size you want; a data type which contains it's own length locks you into the definition of that data type for the life of its use.

      Length has some speed advantages and some limitations. One can see how cool a tight bit of assembly which handled any size and was more reusable would be favored and seem suitable for promotion over something more involved and requiring more labor to scale.

      The one byte "mistake" would help with some problems but it would still exist in other forms. Any terminating data stream is going to have these issues and like I said, it is quite likely that developers would have used it to get around language string length constraints; not to mention all the other uses for terminators outside of strings.

    12. Re:The decision was different! by darkwing_bmf · · Score: 1

      I'm not against GOTO, I just showed that it wasn't necessary or even very useful in that particular example. All modern languages have constructs for conditional jumps but very few things *require* an unconditional jump. When in doubt it's usually best to avoid GOTO.

  51. Come on by MichaelSmith · · Score: 1

    There is nothing wrong with null terminated str^&%^&GShgayuat65a6 7gxhsvxhshxsgyuy6d5656565^&%&^&ZCVZCVZCVBAVCAF FAGAAAYSTWafgsgfsgd6565^%^^

  52. "page not present" statement is incorrect by lostdistance · · Score: 1

    In the section on performance costs, the article states that a multi-byte string copy risks crossing a VM page boundary - potentially causing a "page not present" fault - if the NUL character was the last byte of the page.

    This is simply incorrect.

    No memory transfer within an aligned multi-byte string copy could ever straddle a page boundary, even if the NUL character was the last byte of the page. And performance considerations should preclude a non-aligned copy, even assuming the hardware could handle non-aligned memory accesses.

  53. False dilemma by vlm · · Score: 1

    'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

    Why?

    Its a false dilemma. You need arbitrary length null terminated strings for streams, and if anyone gave a damn in the last 30 years someone would have grafted an address/length extension on top of C's current stuff.

    The mistake was simple binary thinking, both at design time and in the current article.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:False dilemma by shutdown+-p+now · · Score: 1

      You need arbitrary length null terminated strings for streams

      Why? If you try to read the entire stream into memory, then you're already doing it wrong (for one thing, it may be infinite - consider /dev/random).

  54. I reject 'C's strings and substitute my own... by hamster_nz · · Score: 1

    Please ignore the null terminated string past to printf(), but the point is clear, you can define a character array that contains a length followed with character values that does not have a terminating null added by the compiler. It can also contain embedded null characters if you like.

    #include <stdio.h >

    char mystring[8] = "\x07One\0Two";
    int main(int c, char *v[])
    {
        printf("Mystring is %i bytes, and %i characters long\n",
                sizeof(mystring),mystring[0]);
        return 0;
    }

    I think my point stands. C does not force you to use null terminated strings (but I do agree that it is easy to).

    1. Re:I reject 'C's strings and substitute my own... by Anonymous Coward · · Score: 0

      N00b - sizeof doesn't need parentheses when applied on an object. ;)
      BTW, that's pretty much a non-argument. You could also emulate strings in Python or Java using arrays of bytes. Is it a good idea? No.

    2. Re:I reject 'C's strings and substitute my own... by hamster_nz · · Score: 1

      Well shoot me if I am Old Skool.

      Pick up your copy of "The C Programming Language Edition 2" from 1988, and look at page 135 - sizeof operator can be used either with or without brackets.

      And then look at section 8.7. The malloc implementation uses "sizeof(Header)" a few times. If it is good enough for K&R then it is good enough for me.

      These pesky youngsters.... get off my lawn! :-)

  55. They could have offered both. by master_p · · Score: 1

    They could have offered both solutions: a high level slow api that used strructs with lengths and a low level faster api with null terminated strings. The high level api would be used for string manipulation, and the low level api for hacking strings. The string array in the struct could contain the nul terminated character.

    1. Re:They could have offered both. by darkwing_bmf · · Score: 1

      The could have offered both solutions. But I'd argue the "known length" string operations would be faster than the "search for null" ones.

    2. Re:They could have offered both. by shutdown+-p+now · · Score: 1

      Coincidentally, this is precisely the situation in C++ - std::string knows its length, but its buffer is also null-terminated so that you can easily pass it around as a C string.

  56. C is just a macro assembler anyway by Anonymous Coward · · Score: 0

    For anyone who has done assembly-language programming, C looks less like a "language" a more of a assembler with a whole bunch of pre-defined macros.

    Just like MS-DOS isn't an operating system but merely a collection of subroutines.

  57. How medieval! by Archtech · · Score: 1

    "...their PDP computer had limited core memory."

    Unlike modern computers, which have unlimited memory.

    The "limited memory" apologia doesn't fly any better than the choice of two-digit year fields that led to the Y2K problem. At the time, DEC saw the advantage of string descriptors and made them available on exactly the same PDP11 computers (sometimes as an option, as suggested by other comments in this thread).

    --
    I am sure that there are many other solipsists out there.
  58. C strings by Windwraith · · Score: 1

    C strings are an instrument for the devil, just saying.

  59. TFA: a rebuttal by sick_soul · · Score: 1

    Did Ken, Dennis, and Brian choose wrong with NUL-terminated text strings?

    The author of TFA does not like that choice for the C system programming language.
    But tries to demonstrate that it is an objectively "wrong" choice with weak and plain false arguments.

    The hardware development cost argument is weak. The fastest CPUs around have a very rich instruction set. It maybe hacky and ugly but where is the evidence of a noticeable burden on CPU performance or cost due to the additional instructions to handle 0 in the input? And they are pretty handy instructions anyway.

    The compiler argument is uninformed about how compilers work, and are permitted to work by the C standard. There is simply nothing there.

    The gets(3) argument does not have anything to do with the NUL-terminated strings.
    It has to do with the fact that gets is for most uses a broken API that should seldom if ever be used.

    The FreeBSD libc bcopy/memcpy argument is plain false.
    If the program is correct C, there is no "unwarranted page not present" fault.
    One can do mistakes by mixing memcpy and C Strings if one is not careful, but it's exactly the same with any attempt to read past the end of an array.
    If the confused author is trying to copy C Strings around, he can use the C String functions.

    TFA speaks more about the ignorance of the author than the value of the NUL-terminated string choice for the C programming langugage, which can still be debated, but not on these grounds.

    1. Re:TFA: a rebuttal by shugah · · Score: 1

      Judging by the 10,000 words before he got to Null terminated strings, the author or TFA doesn't seem to have too many problems with wasted bytes.

      --
      If you aren't part of the solution, then there is good money to be made prolonging the problem
    2. Re:TFA: a rebuttal by darkwing_bmf · · Score: 1

      The compiler argument is uninformed about how compilers work, and are permitted to work by the C standard.

      When discussing the rationalizations for whether or not something should have been in the C standard, referencing the C standard for justification is circular logic.

  60. N/M - already covered above by JSBiff · · Score: 1

    I see now that perpenso pointed out the same solution a few replies above this one. Disregard.

  61. Joe's Own Editor by Anonymous Coward · · Score: 0

    It might be of note that JOE (an old standby editor shipped with Linux and BSD distributions) used both systems at once. When JOE allocated a new string, it allocated room for the string plus an additional int prepending it. It would then return a pointer to the data past the int, the beginning of the character data. You could use that pointer just like any other char * with usual C functions, but if you wanted to find the length of the string in constant time (as opposed to linear time, which strlen operates in), you just needed to do ((int *)p)[-1]. It worked very well, though obviously it limits what sort of pointer arithmetic you can do with strings since you wouldn't want to ever end up with a pointer to the middle of a string and assume the length is still preceding it.

  62. The point of the poem is that by Anonymous Coward · · Score: 0

    --we have to make choices between options even when we have been unable to tell the difference between them.
    --even though we can't tell the difference and so can't anticipate the consequences in advance where the choice takes us DOES, often, make all the difference.
    --We often rationalize the choice and convince ourselves that there IS a basis for the choice aside from an effective flip of the coin.
    --As to the sigh -- one wonders - What might have been?

    ====
    Robert Frost does not apply here. The immediate difference was the saving of one Byte - or not. That difference was real - not imaginary. And, it mattered to them.

    Once can never anticipate all of the consequences of any choice. The time it takes me to stop and say hello will affect who I run into around the corner. A future spouse - or a bullet.

  63. I question his premise. by Bill_the_Engineer · · Score: 1

    I find his hypothesis a little weak:

    As far as I can determine from my research, however, the address + length format was preferred by the majority of programming languages at the time, whereas the address + magic_marker format was used mostly in assembly programs. As the C language was a development from assembly to a portable high-level language, I have a hard time believing that Ken, Dennis, and Brian gave it no thought at all.

    I spent the 80's doing assembly language programming and I used both NUL terminated strings and length based strings. It just depended on the situation. Sure I could do a quick test of the accumulator using BEQ which looks at the zero flag and if set exit the loop, or I could load a counter register with the length and do a decrement during each iteration and BEQ test on the counter instead. Pardon my foggy memory since I haven't used assembly language exclusively in twenty eight years (1983!).

    This is a very speculative paper that is trying to place the blame for poor programming practices on a programming language that gives the programmer plenty of rope to hang themselves with.

    --
    These comments are my own and do not necessarily reflect the views or opinions of my employer or colleagues...
  64. It's easier to blame someone else by jsprenkle · · Score: 1

    The problem with C strings is the same problem everyone has with C and assembler. It requires you to be absolutely competent. If you're not it does nothing to catch your mistakes. Blaming the current problems other people's "poor choices" is just rubbish.

    The vulnerabilities to specially crafted attacks aren't mistakes. They were design choices that were correct given the knowledge the designers had at the time. Times have changed and nobody wants to pay to redo the code. I can just as easily craft a stack overflow using length type strings.

    The author is short sighted or is deliberately making up something controversial to gain attention. In either case slashdot will you please ignore flag him?

    --
    - I've got bad karma because I won't parrot everyone else's opinion
    1. Re:It's easier to blame someone else by darkwing_bmf · · Score: 1

      The problem with C strings is the same problem everyone has with C and assembler. It requires you to be absolutely competent. ...
      The vulnerabilities to specially crafted attacks aren't mistakes. They were design choices that were correct given the knowledge the designers had at the time. Times have changed and nobody wants to pay to redo the code

      According to what you wrote, the problem with C strings is not that the original programmer has to be absolutely competent, it's that that burden is placed on all future generations who use the code as well.

  65. You are correct sir. by bobs666 · · Score: 1

    Poul-Henning Kamp, is a non computer engineer.

    'C' was primarily written to write the UNIX operating system. The fact that is was great for so many other things was a plus. If your writing an OS you want the control of a near assembly language like 'C'. For the rest us, in the 21 century, there are great scripting languages. Where strings are managed for us.

    JAPH.

  66. Re:The Road Not Taken by other Slashdot FPers by tehcyder · · Score: 1

    Brilliant, five internets for you!

    --
    To have a right to do a thing is not at all the same as to be right in doing it
  67. Correction by Dan+East · · Score: 1

    Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer.

    A pretty significant correction to your post. The article says "Using an address + length format would cost one more byte of overhead than an address + magic_marker format". If it costs one byte more, and the magic-marker is no longer used, then that means there are TWO bytes available for the length, which would allow strings of 65636 characters.

    Then one could reserve the length value of 0xffff to indicate a 32 bit length value, allowing strings of 2^32 length.

    If it was a single-byte length then it would require exactly the same storage requirements as the NULL terminated method. So the design choice was A) limiting strings to 255 bytes, B) using a NULL terminated string, or C) using an extra byte for 65636 character strings. The article says it was a choice between B and C, and they chose B. Option A which is what you refute, wasn't even an option at all, which is why it wasn't discussed.

    --
    Better known as 318230.
  68. It's a friggen character pointer, not a string by mark-t · · Score: 1

    It's nothing more than convention to utilize a nul terminated character stream as a string, but that's not actually part of the language.

    Further, there's nothing prohibiting a person from implementing their own, higher-level string notion that *does* utilize the address-length paradigm for representing strings. Meanwhile, if they had direct support for that in the language, they would have either had to drop their current notion of character pointers completely, or else added what is fundamentally an entirely new type to the language. The former solution would have been undesirable because it sacrifices generality. Further, neither solution would really fit the C paradigm of the data types corresponding to the most widely used native machine language types.

  69. Delphi has smart string solution by kbg · · Score: 1

    The best solution for strings I have seen is the implementation of "Long String" for Delphi. It has the best of both worlds. It has reference counting with copy on write, it has a 32-bit size length and is also null terminated. Since it also has a null terminator it is very easy to communicate with C code and the Windows API all you do is pass the pointer directly.

  70. NUL-terminated strings predate C and UNIX by years by Anonymous Coward · · Score: 0

    You know, kids, there was such a thing as a computer, and programming languages, prior to the advent of C and UNIX.

    The NUL-terminated string convention was a very common one. All of DEC's operating systems used it, including on the PDP-10.

    It was considered to be a big improvement over FORTRAN Hollerith strings, where the programmer had to count the characters in the string, e.g.,

    7FOO BAR

    was the FORTRAN way of saying

    "FOO BAR"

  71. C was not designed to today legions of hacks by Anonymous Coward · · Score: 0

    This article is way off base.

    C was primarily a portable assembler. Complaining about string handling in C makes almost as much sense as complaining about string handling in assembler. C provided a convenient way to port system-level code (e.g., compilers, operating systems, etc.). As such, it was not intended that C would be used by today's legions of hacks. Instead of blaming K&R, why doesn't the author blame the universities for teaching C as if it were a high-level language.

    If there is blame to be laid for the null terminated string, it's not with the language but with the string handling libraries. If the string handling libraries had defined a string as a struct that carried a length, then few people would have used the null-termianted construct.

  72. Re:The Road Not Taken by other Slashdot FPers by networkBoy · · Score: 1

    I'll contribute a gold star as well.

    --
    whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
  73. All of you poetry nazis are being modded off-topic by sl4shd0rk · · Score: 1

    -1 :: Off-topic Poetry Nazi

    --
    Join the Slashcott! Feb 10 thru Feb 17!
  74. There's also the rationalization by Quila · · Score: 1

    He didn't pick the road less traveled. As you note, he earlier said both appeared to be equally worn. Picking the road less traveled is a rationalization for the decision made years later, looking back on the event.

    It is funny, the most famous line is basically a lie.

    So with the literalist view you may now take that road that you think is less traveled in order to be nonconformist, but in your later years with the ironic view you'll look back on that decision and maybe rationalize your choice in an entirely different way.

  75. Encoding length by ekc · · Score: 1

    A C string can be as long as you want and never requires more than length+1 bytes of storage. A length+data scheme would need unlimited lengths to be as flexible with preferably a 1-byte overhead for short strings at least. I guess you could do something UTF-8-ish and add extra length bytes as needed for longer strings, but then you'd need a formula to figure out how much storage the string would require. Sounds a little messy to me.

  76. Re:Out of mod points by bab72 · · Score: 1

    This is already +5 Hilarious, but I wish I had mod points to give it anyway. LMAO

    --
    Bab72 (Not my real name)
  77. Ordinary People buffer management by epine · · Score: 1

    I actually think that computer science would benefit from more sage retrospectives on the path not taken, where one does not necessarily end the analysis with the smoking gun.

    Buffer allocations in the C language family tend to be static. No matter what you do in the privacy of your own buffer, the boundaries (front and back) are firm: whether poaching from your neighbour's apple tree, making a spectacle of indecent display, or committing an access fault triggering a core dump of yellow police tape and chalk outlines.

    Traditionally it takes more creativity to get arrested in the front yard. In the back yard, most programmers have no standard of conduct whatsoever.

    safecat (back_fence, the_usual_suspects, ...);
    wildcat (just_the_ammunition);

    It's true, you might know that the unfenced wildcat() is OK due to some prudent arithmetic three loops up, as etched into stone tablets by a guru of right thinking (at the optimal dosage point between second and third coffee) embodied in an immutable marble monument of nil maintenance.

    Usually in an API the front fence is defacto whatever src position is supplied as the current working position; where the algorithm swings both ways, both ought to be passed explicitly, in addition to the working position. Helper functions in a tightly-crafted C runtime library might sanely presume that this condition holds. A chosen few among us are well suited to stone work where efficiency matters.

    The major fault with the C language was failing to provide an "Ordinary People" set of buffer management routines (anything that clobbers memory) where the back fence comes first in every function signature.

  78. Question about variable-length strings by Anonymous Coward · · Score: 0

    For variable-length strings that are not NULL-terminated in the string itself, do they use NULL-termination for the byte-length? If not, does this also introduce buffer overflow vulnerabilities? Since this is something that would be set and modified internally, I don't think it would be subject to the same vulnerabilities unless you could trick the code into thinking a string was a different length when it tries to set its length.

  79. address + length address + magic_marker? by stoyannn · · Score: 1

    "Using an address + length format would cost one more byte of overhead than an address + magic_marker format" Why? If the string length = magic_marker length, they would be the same, no? Depending on the charset, it could be the same or even better to use the address + length format...

  80. Simple solution by i · · Score: 1

    Have a length field that's unlimited expandable: 1 byte for lengths 0 - 254, if > 255 the first length byte contains 255 as a flag and the length is in the following 2 bytes, if length > 65024 these bytes contains 65025 as a flag and the length is in the following 4 bytes - etc in aeternum. The length field is here always minimal compared to the length of data. (For lengths 255 it takes no more space than a null terminator solution.)

    --
    Mundus Vult Decipi
  81. Streaming (to file) by LongearedBat · · Score: 1

    With sized strings you need to know the length of the complete string before you begin streaming. So you'd stream the size first, followed by the content of the string. Not good if your string could be very long and memory is expensive.

    But with null terminated strings you can keep on appending almost ad-infinitum, using whatever business logic you like, until you finally end it with a null.

    1. Re:Streaming (to file) by darkwing_bmf · · Score: 1

      With sized strings you need to know the length of the complete string before you begin streaming. So you'd stream the size first, followed by the content of the string. Not good if your string could be very long and memory is expensive.

      With sized stings you do know the length of the complete string. That's the whole point. And it's even better with expensive memory because you only need to allocate the given size and not some "I'm not sure how big this is going to be, I hope it doesn't try to write past the end of the buffer" size.

    2. Re:Streaming (to file) by LongearedBat · · Score: 1

      Yes of course, when you're working within memory alone.

      However, when you stream out of memory (say, to disk) then you can keep on appending short string fragments to the file until you're done. You really didn't (past tense) want to build the contents of that stream in memory, because you might run out of memory before completing (what might be) a very long string. And with streaming you usually couldn't jump back X number of bytes (to rewrite the length of a string). With files, you had to start writing from beginning to end, one way, no jumps.

      That issue alone could have been enough to tip the scales in favour of null terminated strings, as not being able to write really long strings to file may have been seen as a serious design flaw. Though, in retospect that might not be so obvious.

      Not saying that it's better to use null terminated strings. Personally I always preferred sized strings, but I'm saying that streaming may be one of the factors that led them to choose null termination over pre-sizing.

    3. Re:Streaming (to file) by darkwing_bmf · · Score: 1

      Are you making this up as you go? There's nothing about knowing the length of the string that makes writing to files harder. In fact it makes it easier because you can write out chunks at a time instead of one byte at a time.

    4. Re:Streaming (to file) by Anonymous Coward · · Score: 0

      void writeToFile(FileStream fle, int Count) {
      fle << <Length of string>; // For sized strings
      for (int i = 1 ; i <= Count ; i++) {
          fle << intToStr(i);
          fle << ", ";
      }
      fle << 0x0;
      }

      No problems with null termination. Just keep streaming out.

      But with sized strings, you ought to first write out the length of the finished string. What's the length of the string that should be written out?

    5. Re:Streaming (to file) by darkwing_bmf · · Score: 1

      You don't write out the length of the string to the file just because you're using a different programming language. When implementing a design, your goal should be to make the output the same regardless of language. And if you think fixed length string languages need to "read in" a separate length then you need to learn some programming languages that don't have C in their name.

  82. One BYTE or one COUNTER by Theovon · · Score: 1

    The summary misrepresents the value of using the null terminator.

    With a one-byte length, strings are limited to 255 characters. Is that good? Would you never want to have a longer string? If you want to have a 2-byte counter, do you now have to create a whole new long-string data type and overload every library function? It doesn't seem scalable.

    On the other hand, the null terminator is a single byte no matter what and can support strings of arbitrary length.

    Of course, there are disadvantages. For instance, computing length and concatenating strings take longer.

    But don't act like using a byte-sized length field is fundamentally superior.

  83. Not C's Fault by StormyMonday · · Score: 1

    The problem is not with the C language. NULL terminated strings are just fine for printing status messages and suchlike, which is all they were intended for. The problem is using C to write text-bashing programs. In C, you have to spend a lot of time and effort checking string lengths, allocating and deallocating buffers, worrying about character sets and funny characters ("magic cookies", anyone?), dealing with byte order, and all sorts of other cruft that should be handled by the compiler.

    IMHO, the first really useful language that was designed for text bashing was PERL, or perhaps Microsoft BASIC (I've used SED and AWK. Bleagh. I've not used SNOBOL so I can't say anything about it.)

    --
    Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
  84. Re:Robert Frost pre-mocking hipsters in 1920 by DocSavage64109 · · Score: 1

    Where oh where are my mod points. I agree that Robert Frost was mocking hipsters, and you've summed it up quite well.

  85. Frost comments on his own poem by doug141 · · Score: 2

    http://poetrypages.lemon8.nl/life/roadnottaken/roadnottaken.htm Robert Frost on his own poetry: "One stanza of 'The Road Not Taken' was written while I was sitting on a sofa in the middle of England: Was found three or four years later, and I couldn't bear not to finish it. I wasn't thinking about myself there, but about a friend who had gone off to war, a person who, whichever road he went, would be sorry he didn't go the other. He was hard on himself that way."

  86. Java strings by Estanislao+Mart�nez · · Score: 1

    One performance advantage of NULL-terminated strings is you can trivially maintain two independent representations of the same string, one of which has a static prefix.

    char *str2 = str1 + prefix_length;

    In Java, strings are represented as an object that has a char[], a start index in the array, and a length. This representation gives you all of the advantages of the string-with-length-baked-in design and the representation sharing that you describe.

    Any non-reallocating modification of one string instantly affects the other.

    Which is a form of aliasing, one of the major sources of software bugs. Don't do that!

  87. This is why there are breadth reqs... by snowwrestler · · Score: 2

    The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

    Ahem.

    The ironic interpretation, widely held by critics,[2][3] is that the poem is instead about making personal choices and rationalizing our decisions, whether with pride or with regret.

    Source: http://en.wikipedia.org/wiki/The_Road_Not_Taken_(poem)

    I'm tempted to bookmark this response as a great example of why engineers should not fear breadth requirements. (I'm assuming anyone with such a low Slashdot ID works in engineering...)

    The ironic interpretation is widely held because it's supported not only by the text, but also Frost's own statements, and the broader context of his work--in which seemingly simple descriptive verse hides darker, more complex themes. (A major reason why he is held in such high regard.) This particular poem is a common subject for lessons on critical analysis of literature. The key starting point is that first-person narrators are not necessarily reliable.

    --
    Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
  88. Most Expensive One-Byte Mistake? by Anonymous Coward · · Score: 0

    Don't you remember Y2K?

  89. stupid article by sribe · · Score: 1

    Really, what a dumbass article--surprising considering the source... Pascal had the 1-byte length at the beginning, and the 255-byte limit caused ***FAR*** more problems than the supposed issues with null termination. Hell, the old Mac & Win APIs supported formatted string-like things with 2-byte length, and that limit on string length caused plenty of issues.

  90. So write a better string lib... by niftymitch · · Score: 1

    So where is it cast in stone that an author cannot
    write a lib that uses a count + pointer for strings.

    There is no reason that the string managing data engine in an application
    cannot do it right (better/ differently) and then hand known safe strings to
    those functions not yet rewritten.

    It would be a bit of work but hey if it is important ....

    It is not necessary to start with exec() and args.
    It is not necessary to attack text files but like
    end of line converters it would be a modest task
    to convert .txt to .ett ( Enhanced TexT) or some such
    thing....

    --
    Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
  91. Most expensive one bit mistake by Keyboarder · · Score: 1

    On a related note, PHK points out that Ben Franklin totally screwed the pooch by defining current flow from plus to minus.

  92. Using the NULL character was the right decision. by tomhudson · · Score: 1

    1. Using the NULL character allows for strings more than 255 bytes long;
    2. Using the NULL character makes it quicker to append strings (strcat) - no need to update the length byte(s);
    3. Using the NULL character saves more than 1 byte when you change architectures, and you don't have to worry about byte-padding when calculating storage;
    4. You don't have different types of strings with different maximum lengths (255 bytes, 64k bytes, etc.) and code to deal with interfacing between the types.

    IOW, the article's claim is wrong.