Slashdot Mirror


XML 1.1 Spec Hits Some Snags

oever writes "News.com reports that the new XML 1.1 specification defines a new newline character, making it incompatible with the 1.0 specifiation. Apparently, IBM has been pushing the new character to avoid having to modify their software, thereby invalidating everybody else's XML software."

257 comments

  1. It's only a candidate specification. by tomhudson · · Score: 5, Insightful
    This specification is being put forth as a W3C Candidate Recommendation of XML 1.1.

    If you don't like it, keep in mind that you CAN bitch about it and help change this.

    1. Re:It's only a candidate specification. by Anonymous Coward · · Score: 0

      Re: your helpful hint, man screen.

    2. Re:It's only a candidate specification. by Anonymous Coward · · Score: 0, Funny

      Main Screen Turn-On.

      We get signal.

      Someone set us up a special character.

      It's you.

      Good evening gentlemen

      W3C are belong to us.

      You are on a path to destruction

  2. MS Office by Shinsei · · Score: 1, Interesting

    I wonder if this will have any impact on MS plans for making the next generation of Office. AFAIK, they're planning to make all the applications work together through XML... Then again, it is "only" a newline character... :P

    --
    God does not play dice - Albert Einstein
  3. Sounds like IBM is trying the MS approach by as400as2 · · Score: 0, Troll

    Well Microsoft did it with Java and C, why cant IBM do it with XML? Think about it!

    1. Re:Sounds like IBM is trying the MS approach by Timesprout · · Score: 1

      They are already well down the path to this approach in the Java arena with their SWT implementation, with a lot more success than MS it must be said.

      --
      Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
      What truth?
      There is no dupe
    2. Re:Sounds like IBM is trying the MS approach by Anonymous Coward · · Score: 0

      MS gets bashed for not playing nicely with others. Maybe its time to start looking at the other major players in our industry and make sure they don't follow suit.

    3. Re:Sounds like IBM is trying the MS approach by Luke-Jr · · Score: 1

      There's a difference here though. Microsoft did not create C or Java yet they changed it anyway (actually, AFAIC they just forked new languages). W3C, however, made XML so should be able to change it if they wish. Backward compatibility WOULD be nice though... I guess one could possibly use the XML version and work with both versions if they decide to break the compatibility.

      --
      Luke-Jr
  4. What? No newline? by Distinguished+Hero · · Score: 3, Funny

    That explains why my website was one long paragraph.

    --
    Uttering logically derived and empirically supported truths to the disciples of the orthodox establishment.
    1. Re:What? No newline? by elodan · · Score: 1
      It would, but then newlines don't show up in HTML.
      You need to use <br /> ...remember?
      And maybe <p />

      Go on... mod me down for nitpicking... I don't care.
      Paedophilia may be bad, but pedantry is fine :-)

    2. Re:What? No newline? by dossen · · Score: 5, Informative
      Where did you get that awful idea of using />? The proper way is to inclose the paragraph in p-tags like so:
      <p> Your text here</p>

      Using the paragraph tags as large linebreaks is a very bad habit from the bad old days of the web. Please head to W3C and study the recent standards, and validate your documents before publishing (using a validator, not a browser).
      Ohh... And this is actually an issue about XML 1.1 unicode support, so worrying about HTML is quite premature (XHTML is still XML 1.0, and will remain so untill XML 1.1 becomes a standard (or recommendation in W3C-speak).
    3. Re:What? No newline? by Craig+Davison · · Score: 0
      Before XHTML, the p tag had no closing tag. The end of a paragraph was defined by the beginning of the next one. So that wasn't necessarily a "bad habit" back then.

      </p> was still allowed, but all it did was start a new paragraph. Hence, <p>hello</p>there<p>slashdot</p>
      Would have displayed as 3 paragraphs. BTW I don't know what slashcode put that last ; in there.

    4. Re:What? No newline? by PainKilleR-CE · · Score: 4, Informative

      Before XHTML, the p tag had no closing tag. The end of a paragraph was defined by the beginning of the next one. So that wasn't necessarily a "bad habit" back then.

      This is not correct. The P tag ALWAYS had a closing tag, but it was not REQUIRED. Here's the P tag section for HTML 2.0:
      http://www.w3.org/MarkUp/html-spec/L2index.html# P


      was still allowed, but all it did was start a new paragraph.


      Actually, it just ends the current paragraph. You can't use the closing tag without the beginning tag and still have well-formed HTML. Since it wasn't required, though, there are probably many different methods of handling it as far as HTML parsers/browsers are concerned. The only reason it would start a new paragraph is because it designates the end of the current paragraph, and another paragraph is what typically follows the end of a paragraph within a document.

      --
      -PainKilleR-[CE]
    5. Re:What? No newline? by dossen · · Score: 1

      Thanks for the tip, I was under the impression that <p></p> was also "The Right Way" in HTML 4. And reading the spec I get the impression that it was intended that way (end tag optional, cannot contain block-level elements). But you are absolutely right that only XHTML enforces this use of the tag.

      I do think it makes a lot more sense the way XHTML has it. But that is also to be expected, since XHTML has had the chance to learn from the mistakes of the past.

    6. Re:What? No newline? by Anonymous Coward · · Score: 0

      Or perhaps you momentarily forgot that you were writing The Autumn of the Patriarch.

    7. Re:What? No newline? by psamuels · · Score: 1
      I do think it makes a lot more sense the way XHTML has it.

      I, on the other hand, think this whole XML thing is a mistake. The whole point of computers is to automate work. The syntactical difference between SGML and XML (or between HTML and XHTML, if you'd rather frame it that way) is chiefly to make the document easier to parse at the expense of making it harder to hand-write / hand-edit.

      Sure, I often use </p> tags - as I have done since 1995 or so - but that's because I'm an A-R pedant. Requiring them is silly. It just makes busy-work. Unless you think everyone should switch from vi to Dreamweaver just to edit a stupid HTML file.

      There's a reason most people don't feel the need to add </li> tags to their list elements, or </td> to their table elements. (And these tags, unlike <p>, were always considered containers, as opposed to separators.)

      SGML parsing was a solved problem. Sure it had its annoyances but it could be done. XML may be a step forward in terms of validation without a DTD, but for well-known, slow-moving standards like HTML, what's the point?

      --
      "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  5. Simple Solution by Anonymous Coward · · Score: 3, Interesting

    Why don't they make new-lines overridable? Then IBM can put the override at the beginning of their files.

    1. Re:Simple Solution by Anonymous Coward · · Score: 0, Insightful

      Allright Einstein, so how do you know when the line describing the overriding ends?

    2. Re:Simple Solution by brsett · · Score: 1

      Mainly because I don't think you can put newlines or whitespace in qualified names (elements, attributes, tags, directives, etc), so there is no need to make newlines overridable. There is no issue with character data and attribute values, as those are overridable.

      Mainly this is a non issue, but for programmers using free tools, notably xerces, this is a change in the spec we probably want. You see, IBM is THE major contributor to our free XML parser, and they will almost definietly implement this change in xerces whether the standard says to or not. This makes all the other big companies break their parsers as well, so they're still compatible with us (the people using xerces). SO while maybe IBM isn't a good guy in this deal, the w3c is doing a good job handling a problem that would primarily plague those of us with the least resources to handle it.

  6. Read the Unicode spec.... by vidarh · · Score: 5, Informative

    Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).

    1. Re:Read the Unicode spec.... by Anonymous Coward · · Score: 0

      I hope you're wrong, 0x85 is still useful, at least for us French speaking people. This is a grave (à) for us.

    2. Re:Read the Unicode spec.... by Anonymous Coward · · Score: 5, Insightful

      Like the man says, read the Unicode specification! Unicode defines a far wider range of characters than simple 7 or 8bit ASCII text can cover, and the à is simply mapped into another Unicode byte pair. You won't loose the ability to use à in your XML documents, you just use Unicode.

    3. Re:Read the Unicode spec.... by gorilla · · Score: 5, Informative

      It's more complicated than that. Unicode has
      2029 - paragraph seperator
      2028 - line seperator
      000D - CR
      000A - LF
      0085 - NEL (Next Line)
      Any of these could be interpeted as the end of a logical line.

    4. Re:Read the Unicode spec.... by FooBarWidget · · Score: 3, Interesting

      Doesn't make this XML files uneditable with most editors, like vi, pico and gedit? They all use \n (byte 10) as newline character.

    5. Re:Read the Unicode spec.... by vidarh · · Score: 5, Informative

      And if you read the XML 1.1 spec you'll see that all of the characters you've listed above except for 0x2029 are interpreted by XML 1.0 as the end of a logical line.

    6. Re:Read the Unicode spec.... by khuber · · Score: 3, Funny
      Yes, it's true. Unicode omits French. They couldn't find anything worthwhile to read in French so they just dropped it.

      -Kevin

    7. Re:Read the Unicode spec.... by gorilla · · Score: 2

      Yup, but the original posting suggested that 0085 was the only correct encoding in Unicode, when it's actually one of many possible correct encodings.

    8. Re:Read the Unicode spec.... by Anonymous Coward · · Score: 2, Informative

      Ah, dude, the newline is the least of your problems. 2-byte Unicode characters are not exactly backwards compatible with 1-byte (or, pedantically, 7-bit) ASCII.

    9. Re:Read the Unicode spec.... by vidarh · · Score: 2

      No, I didn't. I specifically wrote that Unicode define 0x85 as a newline character, not the newline character.

    10. Re:Read the Unicode spec.... by julesh · · Score: 5, Informative

      No, not really. It means that *some* XML files can't be edited with these editors. But then that was true already; some might have used \r or some other of the list of characters.

      What it *does* mean is that editors on other systems than Unix are able to edit XML files. It means I can create an XML file in DOS 'edit' which uses \r\n, or on a mac with an editor that might use \r, or on (apparently) an IBM system where the standard text editors use \u85.

      This is absolutely essential. It does however mean that in order to support *all* XML files, you need to recognise *all* of those line endings. As always, its easier to support a subset, but harder to support everything. However the fact that existing software works at all is very important, so I think they're moving in the right direction.

    11. Re:Read the Unicode spec.... by innate · · Score: 5, Informative

      There are no 2-byte Unicode characters, only encodings (such as UTF-16) which use two or more bytes to represent each character. Some Unicode characters, those not in the Basic Multilingual Plane (BMP), require more than two bytes to represent in UTF-16.

      And 7-bit ASCII is a strict subset of UTF-8 encoding. UTF-8 encodes each character to one or more bytes, with characters up to 127 defined the same as in ASCII. If your text is strict 7-bit ASCII, it is also a UTF-8 file.

      You could also use UTF-32 (UCS-4), which represents each character as 4 bytes, but that is overkill for most applications.

      The main problem with multibyte encodings such as Shift-JIS and Big 5 was lead-byte detection: you couldn't jump into the middle of a string and determine if you were looking at the only, first, or second byte of a character. You had to start parsing at the beginning of the string in order to synchronize your character detection. Unicode has done away with this by strictly defining the lead byte ranges in such a way that there is never any ambiguity.

      --
      No, I don't want to explore the Recycle Bin.
    12. Re:Read the Unicode spec.... by kalidasa · · Score: 3, Informative

      Uh, dude, Unicode characters are not necessarily 2-byte. UTF-8 characters are 1 byte if ASCII (real ASCII, the "pedantic" 128-character set), 2 byte or more (up to 6 or 8 bytes, iirc) if not. UTF-16 characters are either 2 bytes or 4 bytes. UTF-32 characters are all 4 bytes. Read the spec.

    13. Re:Read the Unicode spec.... by smallpaul · · Score: 2

      There are all kinds of funny characters that XML does not support. You can't use zero-width or non-breakingspaces as whitespace either. You can't use circled numerals as numbers. etc. NEL is only special because IBM is a big enough company to get attention paid to its legacy software. "Legacy software" and "unicode" seem like an odd mix to me anyhow. Couldn't NEL problems be fixed in a transcoder?

  7. So? by Anonymous Coward · · Score: 5, Informative
    1.0 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

    1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:

    • the two-character sequence #xD #xA
    • the two-character sequence #xD #x85
    • the single character #x85
    • the single character #x2028
    • any #xD character that is not immediately followed by #xA or #x85.


    I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?
    1. Re:So? by Trusty+Penfold · · Score: 2, Insightful

      If you used those characters for something in a 1.0 compatible file, they will be trashed with a 1.1 compliant processor.

    2. Re:So? by Anonymous Coward · · Score: 2, Interesting

      Fair point, but how many people do you think have actually used those characters in an XML1.0 document?

      IBM would appear to be right, too, when they note that an application should look at the version identifier which is present at the top of the XML stream.

    3. Re:So? by Trusty+Penfold · · Score: 0, Informative

      0x85 is à (a grave). So everyone in France?

    4. Re:So? by vidarh · · Score: 5, Informative

      In which character set? Certainly not in Unicode, so if anyone used 0x85 as à in XML documents using any Unicode encoding they've messed up. à (latin letter a with grave) is 0xE0, and À (lating chapital letter a with grave) is 0xC0.

    5. Re:So? by Saib0t · · Score: 1
      0x85 is à (a grave). So everyone in France?

      Not only French people but english people too, if they could spell the french expressions they use properly (like déjà vu) :-)
      --

      One shall speak only if what one has to say is more beautiful than silence
    6. Re:So? by Anonymous Coward · · Score: 1, Informative

      Thanks, I was just trying to eyeball the Unicode database files myself then. Indeed, as you say, à is not 0x85 in Unicode, and as the major point of XML 1.1 is to add Unicode support, I don't see what the problem is with using 0x85 as an EOL character for the 1.1 specification.

    7. Re:So? by vidarh · · Score: 5, Informative
      To expand on that, in ISO-Latin in general and certainly ISO-Latin-1, and thus by extension Unicode (which maps to ISO-Latin-1 for code points in the range 0x00 to 0xff), the area 0x80 to 0x9f was on purpose not used for displayable glyphs in order not to cause interoperability problems with 7bit systems if an 8bit text was moved between systems and the 8bit was stripped off.

      So unless you are using a non-Unicode, non-ISO-Latin encoding there are no printable characters in that range, and if you're using another character you will need to remap the characters before considering any of the rules in the XML spec anyway, since those rules refer to the unicode codepoints.

    8. Re:So? by Anonymous Coward · · Score: 0

      It is wrong to be French.

    9. Re:So? by Sir+Tristam · · Score: 5, Informative
      0x85 is à (a grave). So everyone in France?
      No, you're looking at the extended ASCII chart. What this is talking about is Unicode. A Unicode 0x0085 is the control character NEL (http://www.unicode.org/charts/PDF/U0080.pdf, page 3) NEL is NExt Line.

      Chris Beckenbach

    10. Re:So? by operagost · · Score: 1

      If my keyboard had the proper characters, I would. I get tired of doing ALT + num.

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    11. Re:So? by Anonymous Coward · · Score: 0

      Oh thank goodness. I was afraid *real* people might be inconvenienced by this...

    12. Re:So? by p3d0 · · Score: 1

      Gotta love those lating chapital letters.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    13. Re:So? by greenius · · Score: 4, Informative

      Hex dump of this message:

      30 78 38 35 20 69 73 20 e0 20 20 28 61 20 67 72 : 0x85 is à (a gr
      61 76 65 29 2e 20 20 53 6f 20 65 76 65 72 79 6f : ave). So everyo
      6e 65 20 69 6e 20 46 72 61 6e 63 65 3f 0a 09 09 : ne in France?...

      --
      I copied this sig from someone else (but where did they get it from?)
    14. Re:So? by Luke-Jr · · Score: 1

      Well obviously Slashdot is using Extended ASCII then. Extended ASCII is not any form of Unicode.

      --
      Luke-Jr
    15. Re:So? by greenius · · Score: 1

      No... Slashdot does not specify a character set. This means that browsers should use the default (iso-8859-1)which maps to the first sections of Unicode, in which a_grave is 0xe0 and not 0x85 as others on this thread were trying to say.

      --
      I copied this sig from someone else (but where did they get it from?)
    16. Re:So? by Anonymous Coward · · Score: 1, Informative
      So unless you are using a non-Unicode, non-ISO-Latin encoding there are no printable characters in that range, and if you're using another character you will need to remap the characters before considering any of the rules in the XML spec anyway, since those rules refer to the unicode codepoints.

      But, unfortunately, the very commonly used Windows codepage 1252 puts printable glyphs in that code point range. Rather, recent versions of CP 1252 (not all versions of CP1252 are the same) have put things like trademark signs in that character codepoint range. Keep in mind that Microsoft still dominates the computer software industry more so than IBM and their attempts to keep things open and standards body approved.

    17. Re:So? by Citizen+of+Earth · · Score: 2

      the area 0x80 to 0x9f was on purpose not used for displayable glyphs in order not to cause interoperability problems

      So what about MS-ISO-8859-1? What character did MS put at 0x85?

    18. Re:So? by PainKilleR-CE · · Score: 1

      But, unfortunately, the very commonly used Windows codepage 1252 puts printable glyphs in that code point range. Rather, recent versions of CP 1252 (not all versions of CP1252 are the same) have put things like trademark signs [dkuug.dk] in that character codepoint range. Keep in mind that Microsoft still dominates the computer software industry more so than IBM and their attempts to keep things open and standards body approved.

      Interestingly enough, the chart you linked only goes up to 007E and mostly conforms with Unicode 3.0 (even though it's Unicode 1.0). 007F is the first of the control codes, delete, in the section this applies to.

      --
      -PainKilleR-[CE]
    19. Re:So? by spitzak · · Score: 2
      The MicroSoft assignemnts should be made official, in my opinion.

      The idea of protecting systems from the high bit being stripped is obsolete. UTF-8 (and also all 16 or 32 bit encodings) will cause exactly the same havoc with such systems because they will use bytes that when the high bit is stripped produce C0 control characters. Therefore the idea of protecting the C1 range is irrelevant as soon as you talk about any encoding with more than 256 characters.

      These slots must be filled in and MicroSoft's assignments are quite a reasonable selection of punctuation marks and symbols that are in very common use. The euro symbol is particularily important today. Even if the assignments are not ideal, these assignments are used so much that it is impossible to argue for any other assignments.

    20. Re:So? by Anonymous Coward · · Score: 0

      WTF is "Extended ASCII"? Did you fall out of a timewarp from 1981?

    21. Re:So? by Anonymous Coward · · Score: 0
      From the cited page:
      .9 /x82 U201A SINGLE LOW-9 QUOTATION MARK f2 /x83 U0192 LATIN SMALL LETTER F WITH HOOK :9 /x84 U201E DOUBLE LOW-9 QUOTATION MARK .3 /x85 U2026 HORIZONTAL ELLIPSIS //- /x86 U2020 DAGGER //= /x87 U2021 DOUBLE DAGGER 1/ /x88 U02C6 MODIFIER LETTER CIRCUMFLEX ACCENT %0 /x89 U2030 PER MILLE SIGN S /x8A U0160 LATIN CAPITAL LETTER S WITH CARON 1 /x8B U2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK OE /x8C U0152 LATIN CAPITAL LIGATURE OE '6 /x91 U2018 LEFT SINGLE QUOTATION MARK '9 /x92 U2019 RIGHT SINGLE QUOTATION MARK "6 /x93 U201C LEFT DOUBLE QUOTATION MARK "9 /x94 U201D RIGHT DOUBLE QUOTATION MARK sb /x95 U2022 BULLET -N /x96 U2013 EN DASH -M /x97 U2014 EM DASH 1? /x98 U02DC SMALL TILDE TM /x99 U2122 TRADE MARK SIGN s /x9A U0161 LATIN SMALL LETTER S WITH CARON /1 /x9B U203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK oe /x9C U0153 LATIN SMALL LIGATURE OE
      In case you are having a bit of difficulty comparing hexadecimal numbers: 0x82 is greater than 0x7E and is also greater than 0x7F. The parent post was pointing out the problems with codes in the 0x80 to 0x9f range, also known as the C1 control characters. Yes Windows CP 1252 is a superset of 7 bit ASCII, as such it is compliant with ASCII when reastricted to 7 bit characters, also to ISO 8859-1 but only when restricted to 7 bit characters, and also to Unicode but only when restircted to 7 bit single byte characters. CP 1252 is not the same as ISO 8859-1, nor is it the same as any of the recognized Unicode representations. Unfortunatelt it is very widespread and has useful characters like Euros in the C1 range.
  8. version naming by tomzyk · · Score: 3, Insightful

    Typically, don't version-naming schemes imply something along the lines of: versions x.0, x.1, x.2, etc... are all compatible. And if the next version is NOT compatible, then it should be labeled as "(x+1).0"?

    I guess there's no law stating that this must always be the case, but if these two specifications are NOT compatible, then it would make sense that they would name the new one XML2.0 no?

    --
    Karma: NaN
    1. Re:version naming by jaredcoleman · · Score: 1

      Or you can just give it a fancy name like Palladium or something...

    2. Re:version naming by Fweeky · · Score: 4, Interesting
      if these two specifications are NOT compatible, then it would make sense that they would name the new one XML2.0 no?

      Not really. The change isn't exactly huge; it makes XML a bit more consistant with regard to UTF, but I don't see it breaking anything other than for those who both:
      • Failed to specify a prologue (and hence charset, meaning they accepted the default utf-8), and;
      • Actually used #x85 or #x2028 to encode anything useful other than newline.

      TBH if you were that lax in specifying your XML version and characterset, and then made use of non-printable characters that actually had known uses in the default charset, you deserve everything you get.
  9. End of Line by notestein · · Score: 3, Funny

    If we don't allow the IBM EOL in XML 1.1 ...
    How will we ever communicate with Master Control Program?

    End Of Line

    1. Re:End of Line by Maax · · Score: 1

      Sark: What kind of character is it?
      MCP: It's not any kind of character, Sark. It's a NEL.
      Sark: A NEL?!?
      MCP: What's the matter, Sark? You look nervous.

    2. Re:End of Line by Luke-Jr · · Score: 1

      LOL... But NEL is a character too. Not ASCII, but it's still a character. :)

      --
      Luke-Jr
    3. Re:End of Line by AJWM · · Score: 2

      Not to worry. Master Control Program (MCP) is a Burroughs operating system, not IBM.

      --
      -- Alastair
  10. One tiny little update ??? by Dave21212 · · Score: 5, Interesting

    Considering what some other vendors have done to standards, one tiny addition (which is an improvement) proposed by IBM shouldn't be a big deal. Sure, it feeds the news hounds, but seriously, compare the scale of the impact of one desirable change to all the suffering caused by other such changes in emerging standards (Microsoft's in particular).

    IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?

    --
    "Whoever would overthrow the liberty of a nation must begin by subduing the freeness of speech."--Benjamin Franklin
    1. Re:One tiny little update ??? by PainKilleR-CE · · Score: 5, Insightful

      IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?

      and, as an IBM rep pointed out in the article, XML documents are supposed to specify what version they're using at the top of the document. Any proper XML parser should read that it's 1.0 and interpret the newline character as 1.0 would.

      --
      -PainKilleR-[CE]
    2. Re:One tiny little update ??? by smallpaul · · Score: 3, Insightful

      Considering what some other vendors have done to standards, one tiny addition (which is an improvement) proposed by IBM shouldn't be a big deal.

      Two wrongs make a right?

      IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties.

      I don't know what that means. This change was requested by IBM and only IBM. As far as I know, no IBM customers have even stood up and asked for it publically (I could be wrong).

      Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?

      Obviously some people are. Let's keep in mind that there are millions of XML parsers out there and they work together in large part because there is only one version of XML. Now there are two and it will take years to roll out the new parsers universally.

    3. Re:One tiny little update ??? by smallpaul · · Score: 3, Interesting

      And what will an XML 1.0 parser ("millions served") do with an XML 1.1 document? When your IBM mainframe serves up 1.1 data with NEL to my Windows 98 with IE 5.5, IE will complain that the document is not well-formed. This means that there is a period of time where the XML world is split. It will be a LONG time before these mainframe users will be able to use NEL and confidently send the data to anyone else. It might have been cheaper to just fix the software.

    4. Re:One tiny little update ??? by PainKilleR-CE · · Score: 5, Informative

      The problem is that the data with the NEL character already exists, and is already generating these types of errors when it's translated into XML (or when XML is generated on these mainframes). From the original change proposal by IBM:

      Problem areas include:

      * Processing XML documents or DTDs generated on OS/390 systems, with XML 1.0 compliant parsers.
      * Processing XML documents or DTDs, using native OS/390 system tools.
      * Processing XML documents or DTDs retrieved from OS/390 database or file systems, in non-OS/390 environments.

      XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.


      Essentially 'just fix the software' involves operating system-level changes as well as possibly changes to most software that interprets NEL characters on that OS. As it stands, they're going to have a problem anyway, and it's probably best to simply add the change to the XML standard to fix what was essentially an osmission in the 1.0 standard.

      --
      -PainKilleR-[CE]
    5. Re:One tiny little update ??? by p3d0 · · Score: 2

      Where did you get the idea that the software is broken?

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    6. Re:One tiny little update ??? by p3d0 · · Score: 3, Informative
      Now there are two and it will take years to roll out the new parsers universally.
      You can't blame this on the NEL issue. XML 1.1 will arrive sooner or later regardless of now the NEL issue ends up.
      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    7. Re:One tiny little update ??? by PainKilleR-CE · · Score: 1

      Obviously some people are [xml.com]. Let's keep in mind that there are millions of XML parsers out there and they work together in large part because there is only one version of XML. Now there are two and it will take years to roll out the new parsers universally.

      Besides the fact that eventually someone's going to pitch a bitch over the fact that XML is not Unicode 3.0-compliant if they don't adopt this change, XML itself would've been updated eventually with something that required new parsers to be written (or updates to existing parsers).

      This one change is a very minor update for most parsers and could be done in a manner of minutes for some parsers. Full XML 1.1 compliance for existing 1.0 parsers could take anywhere from a few days from the final specification to a couple of months depending on the complexity and the developer. Any parser that isn't updated by the end of next year probably never will be updated, and shouldn't remain in use.

      If multiple versions of XML-type language specifications were a problem, SGML and HTML would've died out a long time ago.

      --
      -PainKilleR-[CE]
    8. Re:One tiny little update ??? by poot_rootbeer · · Score: 3, Insightful

      And what will an XML 1.0 parser ("millions served") do with an XML 1.1 document?

      It should reject it as unsupported, and you should upgrade your parser to one that supports the 1.1 standard.

      I think everyone agrees that the XML standards should be backwards-compatible, but you seem to be asserting the idea that it should be FORWARD-compatible and that a parser written today must correctly handle all future revisions that might ever be made, which is ludicrous.

    9. Re:One tiny little update ??? by smallpaul · · Score: 2

      I think everyone agrees that the XML standards should be backwards-compatible, but you seem to be asserting the idea that it should be FORWARD-compatible and that a parser written today must correctly handle all future revisions that might ever be made, which is ludicrous.

      No, I'm not saying that parsers should be forwards-compatible. I'm saying that there is very little call by any actual users for a new version of XML with these features. I'll defer to Rusty for details.

    10. Re:One tiny little update ??? by smallpaul · · Score: 2

      Essentially 'just fix the software' involves operating system-level changes as well as possibly changes to most software that interprets NEL characters on that OS.

      That is certainly not true. Producing XML is always a process undertaken by either a human being (creating it by hand) or a computer (generating it). In the former case, one can change NELs to \n's with the mainframe equivalent of sed. You could build this into the text editor or make it a one-line postprocess. In the latter case, you ALREADY have to do all kinds of character escaping and transcoding to get your data into XML in the first place. Do it there. You certainly don't have to change the OS kernel.

    11. Re:One tiny little update ??? by smallpaul · · Score: 2

      Where did you get the idea that the software is broken?

      The software is broken because it purports to generate XML but does not conform to the XML standard. Or perhaps it is a human being generating the XML. In which case, they should run the equivalent of "dos2unix": "mainframe2XML" and they'll have XML they can share with everyone else.

    12. Re:One tiny little update ??? by PainKilleR-CE · · Score: 1

      You're right, I wasn't really thinking about it correctly. It just begs the question of how many applications utilize this particular NEL character on OS/390. Of course, compliance with Unicode specifications should really be something that the XML specification strives for, either way.

      --
      -PainKilleR-[CE]
    13. Re:One tiny little update ??? by mikemulvaney · · Score: 2
      I think everyone agrees that the XML standards should be backwards-compatible

      Except IBM, and the people writing the standards, apparently.

      -Mike

    14. Re:One tiny little update ??? by smallpaul · · Score: 2

      Just FYI, I don't really care much about this issue. XML 1.1 will cause someone out there pain but probably not me. I just wanted to see both sides of the argument represented.

    15. Re:One tiny little update ??? by xigxag · · Score: 2

      And what will an XML 1.0 parser ("millions served") do with an XML 1.1 document?

      Really, that question isn't as insightful as you were thinking it was. Does a 2.0 browser know what to do with modern HTML4/CSS/XHTML? Does your BIOS from 1999 consider an 80GB partition on a 200GB HD to be "well-formed"? Point is that improvements are inevitable and they nearly always break forwards-compatibility to a greater or lesser extent. When XML 1.1 final appears, people will be directed to upgrade their browsers. End of story.

      --
      There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
    16. Re:One tiny little update ??? by smallpaul · · Score: 2

      Can you make the case that XML 1.1 is a sufficient "improvement" to require millions of people to upgrade their browsers, and other XML-consuming applications? Do you really think anyone cares about NEL or Mongolian tag names that much?

    17. Re:One tiny little update ??? by xigxag · · Score: 2

      There is no case to be made because there's no requirement that millions upgrade their browsers. The beauty of the various *ML standards is that you don't have to use 'em if they aren't important to you. For example, even though XHTML has superseded HTML4 as a stamdard. it hasn't obsoleted it. In fact, the XML 1.1 Candidate Recommendation page itself is still written in HTML 4.01, (although the W3C's homepage has been revised to XHTML.) And Slashdot gets by with still using HTML 3.2. Those people (e.g. Mongolians) who need to use XML 1.1 or to access it will find themselves upgrading sooner, the rest of us will eventually receive an upgrade along with the latest Security Update Of The Week.

      --
      There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
    18. Re:One tiny little update ??? by rodgerd · · Score: 2

      Actually, a well-formed HTML 4.0 page will be parsable and can be rendered by old browsers; XHTML possibly even more so (except for those stupid
      tags). HTML isn't too bad on that front.

  11. XML rant. by Anonymous Coward · · Score: 2, Funny

    <rant about="IBM" why="'FOR BREAKING COMPATIBILLITY">IBM SUCKS</rant>

    1. Re:XML rant. by russellh · · Score: 5, Funny

      parse error, line 1: no trailing '

      --
      must... stay... awake...
    2. Re:XML rant. by Anonymous Coward · · Score: 0

      What? No. It's within the "", so it's not part of the syntax parsing. :)

      -Alex

    3. Re:XML rant. by psamuels · · Score: 1
      parse error, line 1: no trailing '

      <reply content="So in other words your parser can't handle this sentence?" />

      --
      "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  12. What about poor old Acorn users? by PhilHibbs · · Score: 3, Informative

    I don't know if it's still the same in RiscOS, but the BBC Micro used 0x0A+0x0D as it's end-of-line marker. Why doesn't XML support this? If 1.1 is going to modify the end-of-line specification, then this is the perfect time to correct this glaring omission.

    1. Re:What about poor old Acorn users? by Anonymous Coward · · Score: 0

      I like how you get +5 Informative for this :) Methinks that someone misses the joke.

      As it is, XML 1.1 does have 0x0D 0x0A as a newline, so there may be some endian issues between your 6509 and the XML spec. Does the XML spec define any sort of byte ordering?

    2. Re:What about poor old Acorn users? by david+duncan+scott · · Score: 1
      They used 0A-0D? The whole damned world (well, except for IBM, I guess) uses 0D-0A or just 0A, and they used the backwards sequence?

      Is this somehow related to driving on the left side of the road?

      --

      This next song is very sad. Please clap along. -- Robin Zander

    3. Re:What about poor old Acorn users? by PhilHibbs · · Score: 1

      When did 0D-0A appear? Was it Microsoft, or does it pre-date them? (CP/M? I can't remember)

      Have you ever used a mechanical typewriter? When you pull the carriage return lever, it first ratchets the roller up one line, then pulls the carriage back to the beginning of the line. So, 0D-0A is a closer representation of a typewriter! This is why it seemed odd to me when I moved from the BBC Micro onto more mainstream systems. 0A on it's own just seemed weird to me at first.

    4. Re:What about poor old Acorn users? by PhilHibbs · · Score: 1
      I like how you get +5 Informative for this :)
      Yeah, I was rather astonished! It's down to 4 now, though, and not unfairly!

      Does the XML spec define any sort of byte ordering?
      The XML spec says this:

      "Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents."

      I think that means, you have to encode byte ordering info in the document.
    5. Re:What about poor old Acorn users? by gorilla · · Score: 4, Insightful

      The Atom, which the BBC was based upon came out in 1979.. At that time there wasn't an IBM PC, and the world was very diverse. You could choose Apples (0D), CP/M (0D,0A), Unix (0A), Primos (8A), VMS - which is either records or 0D. Also, it was quite rare for files to be shared amongst different systems - a file created on an Apple would stay on an Apple forever. A decision which looks strange in 2002 looks as sensible as any other option in 1979.

    6. Re:What about poor old Acorn users? by Anonymous Coward · · Score: 0

      Also, it was quite rare for files to be shared amongst different systems

      Indeed. It was still considered a novelty that an Amiga could read and write MS-DOS, Mac and various other disks as late as 1990. How did BBS users cope with all these different character sets though?

    7. Re:What about poor old Acorn users? by ErroneousBee · · Score: 2, Funny

      And what about old telex machine users that used x0A0A0D (CRCRLF) in case the carrige got stuck?

      You may think this is funny. but there is still some software out there that assumes a mechanical teletype at the far end, sending weather bulletins to Africa, football scores to the Beeb, etc.

      --
      **TODO** Steal someone elses sig.
    8. Re:What about poor old Acorn users? by gorilla · · Score: 2

      Mainly BBS's were for the same OS user. You'd have an Apple BBS or a CP/M BBS or whatever.

    9. Re:What about poor old Acorn users? by Luke-Jr · · Score: 1

      0D-0A == Windows
      0A == UNIX
      0D == Mac (pre OS X?)
      It was just a matter of time until something used 0A-0D...

      --
      Luke-Jr
    10. Re:What about poor old Acorn users? by Fastolfe · · Score: 1

      Byte order would only matter if the data consisted of multi-byte characters. While one might think that 0D0A is a multi-byte newline, it's really a two-character combination (CR+LF) that together mean a logical newline. I really doubt the Acorn is using multi-byte characters here where byte ordering mattered.

      Byte ordering is really a subject of encoding, though, and not something XML is supposed to have to worry about, though XML does allow for a prologue of sorts to allow parsers to "figure out" byte ordering for cases when they are using multi-byte characters.

      I believe MacOS also used a newline convention of 0A0D.

    11. Re:What about poor old Acorn users? by spitzak · · Score: 2
      I think the standard should read that a sequence containing a single 0x0a preceeded and followed by any number of 0x0d's should count as a single newline. A 0x0a or a 0x0d by itself also counts as a newline. This should handle all the possibilities listed.

      I used to think 0x0d could be treated as whitespace but it sounds like older Macintoshes messed up that idea.

    12. Re:What about poor old Acorn users? by david+duncan+scott · · Score: 2
      Yes, of course I've used a manual typewriter (or maybe not "of course", but then, I used to dial the telephone and wind my watch, too, and I don't suppose most Slashdotters would say that.) Wish I still had my old Underwood, or even the little Olivetti. Thing is, I don't think any computer has ever used a manual typewriter, and I think that the 0D-0A sequence was that used by teletypes (I'm not sure about Decwriters.)

      However, I have neither a TTY nor a TTY manual around here anywhere. Anybody?

      --

      This next song is very sad. Please clap along. -- Robin Zander

  13. New Line Character Vote by Anonymous Coward · · Score: 0, Funny

    IBM will make the new newline character a '$'

    (Just kidding)

    1. Re:New Line Character Vote by markhb · · Score: 1

      Actually, it's an '@'... or it looks like one.

      --
      Save Maine's economy: write stuff down. All comments are exclusively my own, not my employer.
    2. Re:New Line Character Vote by Luke-Jr · · Score: 1

      That's almost like the DC1 protocol which uses | for EOL...

      --
      Luke-Jr
    3. Re:New Line Character Vote by Nerdy · · Score: 1

      Offtopic -- beware

      Ah, Direct Connect. That thing still won't die...

  14. Here's A Good Point by LISNews · · Score: 5, Interesting
    From the article, which kind of put it into perspective for me:


    "The truth is that there are a lot of IBM mainframe systems out there, and they're very important," said Ronald Schmelzer, an analyst with ZapThink. "The truth is that this is not really for IBM's benefit, it's for IBM's customers' benefit. And I think that's fair. An international standard shouldn't change for the benefit of a company's future project, but it's clear that end-of-line characters are not a strategic business strategy for IBM."

    1. Re:Here's A Good Point by Fnkmaster · · Score: 3, Funny

      [/me raises hand]
      Ummm, sir, could you explain to me exactly what a "strategic business strategy" is?

    2. Re:Here's A Good Point by xneilj · · Score: 1

      It won't buy them any real advantage in the future market place. They're not going to be able to squeeze this in and suddenly make money from it.

      As many people have said, IBM want this in simply because a lot of their customers need it in and t the end of the day, it saves customers effort.

      IBM probably gain relatively little from this change themselves, but without it they'll have a lot of mainframe customers complaining - that's all they're trying to avoid.

      You should note that it's NOT just IBM and its customers who'll benefit. Anyone who needs to develop products which interoperate with mainframes will be happy to see this happen.

      --
      rm -rf / is the evil of all root
    3. Re:Here's A Good Point by Anonymous Coward · · Score: 1, Interesting

      Now let's take IBM out of this phrase and replace it with Microsoft, wonder how fast the responses would change from "it's not that bad" to "who do they think they are?"... Mod this as a troll if you will, but it's something to ponder.

      'Here's A Good Point (Score:5)
      by LISNews on Friday October 18, @10:28AM (#4478295)
      (User #150412 Info | http://www.lisnews.com)
      From the article, which kind of put it into perspective for me:

      "The truth is that there are a lot of Microsoft systems out there, and they're very important," said Ronald Schmelzer, an analyst with ZapThink. "The truth is that this is not really for Microsoft's benefit, it's for Microsoft's customers' benefit. And I think that's fair. An international standard shouldn't change for the benefit of a company's future project, but it's clear that end-of-line characters are not a strategic business strategy for Microsoft."

    4. Re:Here's A Good Point by Joe+Tie. · · Score: 1

      Ummm, sir, could you explain to me exactly what a "strategic business strategy" is?

      Cromulant?

      --
      Everything will be taken away from you.
    5. Re:Here's A Good Point by curunir · · Score: 2

      The uninformed would probably be saying that. But people who actually follow the XML specs would realize that MS has contributed a lot to the development of XML. IIRC, XSDs were Microsoft's idea and they're a big improvement over DTDs.

      I would prefer that Microsoft participate in *more* standards decisions. I only have problems with Microsoft are when they decide to create their own proprietary and closed standard (SMB, Exchange, etc).

      If standards are public, then everyone is free to compete on the implementations of those standards. If Microsoft ends up being the best at implementing the standard, then more power to them (and I would hope that people would buy their products). It's only when Microsoft prevents others from creating a competing implementation that I take issue.

      --
      "Don't blame me, I voted for Kodos!"
    6. Re:Here's A Good Point by SpaceLifeForm · · Score: 2

      \n, meet \r

      --
      You are being MICROattacked, from various angles, in a SOFT manner.
    7. Re:Here's A Good Point by brsett · · Score: 1

      I don't know who contributed XSDs, and they are indeed a huge improvement over DTDs, but the impacts would be different if MS was promoting this change. Xerces would not be at risk of being non standards compliant, only MSXML would, and thus I wouldn't care. IBM should get more consideration by the w3c because they provide the parser used by more developers than any other one (despite the fact that msxml is probably installed on more machines).

      Hopefully ACE will eventually have a decent XML implementation (tho based on the code I saw, they still appear to be quite far away), and at least c/C++ programmers will be free to do as they wish, but until then C'est La Vie, I cast my lot with Xerces and IBM.

  15. Considering ... by DigitalDreg · · Score: 5, Insightful

    That IBM gave the world SGML and XML by derivative ....

    That a lot of useful data exists on IBM mainframes ....

    That EBCDIC doesn't "cleanly" map into Unicode by design like ASCII/UTF-8 does ...

    That this benefits IBM users and customers, not IBM because there is no strategic market position related to new-line characters ...

    That this was a recommendation reached by a group ...

    Let it live and get a life.

  16. How hard can it be... by meringuoid · · Score: 0, Redundant

    ... if this goes through, to write a batch converter to fix the newline characters?

    $ fixnewline *.xml

    Shouldn't take too much Perl... hell, a shell script could probably do it. Or am I missing something?

    --
    Real Daleks don't climb stairs - they level the building.
  17. 2 line summary by Shagg · · Score: 5, Insightful

    1) XML 1.0 does not follow the Unicode spec
    3) XML 1.1 makes a change so that it does follow the spec

    What's the complaint again?

    --
    Unix is user friendly, it's just selective about who its friends are.
    1. Re:2 line summary by Anonymous Coward · · Score: 0

      you cant count, 1 then 3? huh? its 1,2,3

    2. Re:2 line summary by mookie-blaylock · · Score: 1

      No, the original poster was right. Basically, the logic follows this simple format (you may have seen it before). 1. Change newline character so it follows Unicode spec. 2. ????? 3. PROFIT!!!! I'm sure he just left out #2 for the sake of brevity.

      --
      I am not Herbert.
    3. Re:2 line summary by Anonymous Coward · · Score: 0

      You're an odd little thing, aren't you? It's still only 2 lines.

  18. The Balder Cult by Anonymous Coward · · Score: 0

    from the article:
    Although not referring specifically to the Mallinson case, he added it may be necessary to "weed out" employees who did not live up to Microsoft's code of behaviour.

    I'd like to see someone at Microsoft do another 'ape routine', personally.

  19. News?? by ceeam · · Score: 2, Funny

    In other news - Julius Caesar stabbed and died.
    Anyway - for how long 1.1 draft has been out?

  20. What do they mean, "XML 1.0 chokes"? by st.+augustine · · Score: 5, Insightful

    Does anyone have a link to a page explaining what's really going on? Last I heard, XML doesn't even have a concept of newlines -- most of the time all white space gets normalized (collapsed). The only problem that I could see is if the character wasn't part of the spec for white space. Now, people may have written XML software that chokes, but I think that's a slightly different story. So is the problem that the new character shows up as bogus text content in elements? And is that true for all XML processing software, or does software that relies on a proper Unicode engine not have the problem? What's the deal?

    --

    -- Some things are to be believed, though not susceptible to rational proof.
    1. Re:What do they mean, "XML 1.0 chokes"? by Anonymous Coward · · Score: 0

      Where it comes into play most is in *XML related* technologies like XSLT. Lets say, for example, you're using XSLT to transform an XML document into a Python code snippet, then your stylesheet has to make use of newlines.

      For anyone who's relying on XSLT for core business applications, the outcome of this spec is a really big deal.

    2. Re:What do they mean, "XML 1.0 chokes"? by Dapnant · · Score: 2, Informative

      Actually your understanding of whitespace in XML is almost completely incorrect.

      Many XML applications ignore whitespace (after parsing). XML parsers are prohibited from deleting any whitespace that might be part of the data in a document. The xml:space attribute allows a document to indicate places where the author or encoder encourages normalization of space in some way.

      This is all clearly explained in the standard itself (W3c XML pages).

    3. Re:What do they mean, "XML 1.0 chokes"? by Fastolfe · · Score: 2

      XML 1.0 and XML 1.1 require that parsers normalize what they consider "new lines" to a standard 0x0a newline character. For those applications where whitespace and newlines are somewhat significant, on those platforms that use a native newline other than what XML 1.0 allowed as a newline character, you had to create a hacked up "binary" file. In other words, you could not create this file as a text file on that platform, because the newlines would be treated as arbitrary binary data in the actual XML document and would not get normalized.

      All this change does is tells XML 1.1 parsers to recognize this Unicode newline as another valid newline, so that when it normalizes newlines in the XML data, it knows to honor this one as well. This lets people on those IBM mainframe platforms store XML as a native text document instead of trying to hack up something that is unmaintainable except through specialized tools. Unless people are doing stupid/incorrect stuff with their character sets and encodings, this should not affect anyone else.

    4. Re:What do they mean, "XML 1.0 chokes"? by st.+augustine · · Score: 2

      Actually your understanding of whitespace in XML is almost completely incorrect.
      Actually it's not, but I'll admit that my "most of the time" was -- I was mixing thinking about attribute values with thinking about element content. Outside attribute values, though, I repeat, I don't understand what the problem is supposed to be. If you've got a Unicode NEL in your UTF-8 encoded XML document, by and large the parser's going to pass it through to to your application. What will happen is that it'll fail to get marked as white space. I'm sure there are people for whom that's hugely relevant, but I bet for the majority of XML applications it's not. If I'm wrong, somebody post a link.

      This comment about external parsed entities is much more relevant. I don't use them in any of my XML applications, but I suppose they could get someone in trouble. However, I still don't see how this qualifies as choking.

      This is all clearly explained in the standard itself (W3c XML pages).
      Dude, you can't just say "it's clearly explained" and point to an enormous mountain of documents. Why didn't you point to the relevant part of the spec?
      --

      -- Some things are to be believed, though not susceptible to rational proof.
  21. Complain by almeida · · Score: 3, Informative

    Don't like it? Complain. Comments can be viewed here.

  22. CRLF in EBCDIC by spacefight · · Score: 4, Interesting

    is 0x156C in my programming area, 'nough said. EBCDIC is still live. Did you know that about 90% of todays enterprise data is stored in EBCDIC chars? You better update the XML specs :)

    1. Re:CRLF in EBCDIC by Waffle+Iron · · Score: 2
      Did you know that about 90% of todays enterprise data is stored in EBCDIC chars?

      That sounds like a tautology, because I bet you define "enterprise data" as any data that is stored in a mainframe.

    2. Re:CRLF in EBCDIC by Anonymous Coward · · Score: 0

      no, "enterprise data" is any data stored in EBCDIC format.

    3. Re:CRLF in EBCDIC by njdj · · Score: 2

      Did you know that about 90% of todays enterprise data is stored in EBCDIC

      Nearly right. Did you know that about 90% of enterprise data is out of date?

    4. Re:CRLF in EBCDIC by Mittermeyer · · Score: 2

      Spaceflight, let's do a parade of dinosaurs up and down right in front of all these penguin serverheads. Then we'll eat THEIR eggs.

      --
      ________________________________________ History Must Not Fall Into The Wrong Hands ___________________________________
    5. Re:CRLF in EBCDIC by spacefight · · Score: 1

      My nick is spacefight. Anyway, I've nothing against penguin driven servers, but as I am in the mainframe biz since one year, coding C and HLASM, I'll attend you for this parade :)

    6. Re:CRLF in EBCDIC by spacefight · · Score: 1

      Out of date? Not a bit. Banks, Airlines, generally transaction processing systems are on Mainframes and these data is not a single bit out of date.

  23. New Newline Character? by RAMMS+EIN · · Score: 4, Interesting

    Anybody care to explain to me _why_ we need so many different newline characters | sequences? I see a point in having a single \x0a character, because a newline is one character. I see a point in having \x0a\x0d and \x0d\x0a, because they represent more accurately how a typewriter does it (and conform better to the original ASCII standard, I think). However, one of these is kind of redundant, and history seems to have decided that this is \x0a\x0d. But why, for goodness's sake, do we need all those others??? Why is it that people always do things their own way instead of following standards that work fine???

    --
    Please correct me if I got my facts wrong.
    1. Re:New Newline Character? by PainKilleR-CE · · Score: 2, Interesting

      I see a point in having \x0a\x0d and \x0d\x0a, because they represent more accurately how a typewriter does it (and conform better to the original ASCII standard, I think). However, one of these is kind of redundant, and history seems to have decided that this is \x0a\x0d. But why, for goodness's sake, do we need all those others??? Why is it that people always do things their own way instead of following standards that work fine???

      Because ASCII doesn't work for the character sets that over 50% of the world's (literate) population reads and writes. Hence the Unicode standard, which of course tries to make it's overlap with ASCII compatible when possible.

      --
      -PainKilleR-[CE]
    2. Re:New Newline Character? by Anonymous Coward · · Score: 0

      > and history seems to have decided that this is \x0a\x0d

      I ask in all seriousness: Who uses that?

      I think "\r\n" (a.k.a. "\x0d\x0a") is used far more often.

    3. Re:New Newline Character? by RAMMS+EIN · · Score: 2

      ``Because ASCII doesn't work for the character sets that over 50% of the world's (literate) population reads and writes. Hence the Unicode standard, which of course tries to make it's overlap with ASCII compatible when possible.''
      Note that I said ``standards that work fine''. ASCII doesn't work fine, because it can only represent languages that use the latin alphabet with no accents decently. However, the newline/LF and CR in ASCII work fine, no matter what language, as far as I can see. So my question remains...why add to that? Does this new newline character server another function besides indicating a newline, so that it is semantically different from the old newline?

      ---
      Know Your Sysadmin

      --
      Please correct me if I got my facts wrong.
    4. Re:New Newline Character? by Anonymous Coward · · Score: 0

      Apparently, the BBC micro did...

    5. Re:New Newline Character? by gorilla · · Score: 5, Informative

      Each of them has a different function. 000A and 000D are for compatability with ASCII. 0085 is for a unified character to replace the 000D 000A pair used on some OS's. However, some programs (eg notepad) use line breaks when they really mean paragraph seperators, so Unicode defined two codes which mean REAL line seperator, and REAL paragraph seperator. This report explains it quite clearly.

    6. Re:New Newline Character? by Anonymous Coward · · Score: 0

      Dude - who told you that ASCII id THE standard.

      I mean it's OK for dinky little PC's and all, but there is a whole universe of standard encodings that have as much (or more) history as ASCII

      Just be glad no one is saying that we should all use only EBCDIC. They are just saying that EBCDIC should be given the same respect as ASCII.

      Unicode tries to respect all (or at least lots) of standard encodings that predate it.

      Seems reasonable to me.

    7. Re:New Newline Character? by PainKilleR-CE · · Score: 1

      However, the newline/LF and CR in ASCII work fine, no matter what language, as far as I can see. So my question remains...why add to that? Does this new newline character server another function besides indicating a newline, so that it is semantically different from the old newline?

      Actually, if you look up the specification for C1 control codes (where this particular NEL comes from), you'll see that it's quite archaic, and that the primary reason for this particular nextline control code is because it's mode-dependant and can behave differently based on the current mode of the device in use. In other words, it's a newline that might not be displayed on the screen depending on the mode of the display, sortof like deciding where a newline shows up if the user has selected the 'word wrap' option in their editor, though a little more complex.

      That being said, it's all dependant on the device implementation, and the specification is so old that most devices probably either ignore it or treat it as a simple newline.

      --
      -PainKilleR-[CE]
    8. Re:New Newline Character? by Lucas+Membrane · · Score: 2
      The excess of characters exists because these were thought up to control hardware, ie printers. The 0A character is a linefeed. It tells the printer to advance the paper vertically one line. On some printers, it did not change the print position (ie the column), The 0D character is the carriage return. It tells the printer to return the print position to the left margin (what it does for right-to-left languages, IDK). The CR could be used to overstrike print, because it does not cause vertical motion of the print position or paper.

      0D + 0A equals what has to happen at the end of a normal line when printing on one of these character printers. Why is the 0D usually first? Because it often took longer for the print head to move all the way across to the left margin than it did to advance the paper one line. Printers could do the linefeed (0A) while the carriage return was already in progress without waiting for the carriage return to complete -- producing a net increase of speed on slow printers. Mechanical typewriters did the linefeed first, so that you could save time by interrupting the carriage return before it was completed when you wanted to print indented text.

  24. This is what the process is for by cheezfreek · · Score: 1

    This is why there is a process for International Standards (for the moment, let's ignore fast-tracking of standards, which isn't being done for XML 1.1). If one company wants something, they can feel free to propose it, but the other members of the committee can vote it down. If this really does cause a significant incompatibility with a previous version, then the committee will realize it and vote it down. So, no one company can push something down the others' throats.

  25. Karma, please by p3d0 · · Score: 2, Redundant
    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  26. how hard is it to do parsing? by Wild+Bill+Hickock · · Score: 1

    I am wondering how hard is it for IBM or anybody else out there to write a small parsing utility to change the end of line character to the corrected one. As a computer science student I can say that this is not hard. Now on the other hand IBM probably has many xml files that would need to change but then again, we are talking about IBM. They have the power to do that.

    1. Re:how hard is it to do parsing? by vidarh · · Score: 3, Informative

      This isn't about changing existing files, but about changing software that is in production use on mainframes. And it isn't IBM, but IBMs mainframe customers and anyone dealing with their mainframe using customers that would get the work.

    2. Re:how hard is it to do parsing? by Anonymous Coward · · Score: 0


      I'll assume from that you've never had significant issues with CR/LF problems.

      I have. When they come up in a mixed environment it complicates things. It is a royal pain in the ass.

      Samba shares: files are transfered as binary, and as such the line endings match the original source computer, later the file is transfered to a unix share and then imported through a unix process which, unfortunatly like all unix process, expects a different format... so, you add a conversion to the process, only the file is 2gigs and the conversion takes too long... blah blah..

    3. Re:how hard is it to do parsing? by arkanes · · Score: 2

      We use an IBM mainframe, and transferring data from our (modern, working, fast) PCs to our (old, slow, not working) mainframe is ALREADY a huge pain in the ass. Thank god we didn't decide to use XML.

  27. Re:Who cares by Anonymous Coward · · Score: 0

    since they are not industry STANDARDS it doesnt matter

    windows is not an industry standard either.

  28. The whole thing seems silly by streak · · Score: 3, Insightful

    As has been pointed out by many people, this whole issue is stilly since 1.1 actually follows the Unicode spec more closely...
    (so who's code is broken now? huh?)

    Personally I don't see the big deal over XML itself. Its just a way of organizing data hierarchially and expressing it in a nice format (aka a TREE)

    I still don't know how people manage to write 500+ page books on it.
    Maybe I"m just completely stupid -- please, someone enlighten me to the great wonder that is XML.

    1. Re:The whole thing seems silly by master_p · · Score: 1

      XML is very useful for exchanging data without knowing beforehand what kind of data that is. The only prearrangement should be for the two communicating apps to have common definitions, i.e. an 'apple' for app 1 should also be an 'apple' for app 2 and not something else.

      By expressing the format in pure text, there is no need to deal with problems coming from binary representations: an XML document can be taken from a little-endian machine and be transferred to a big-endian machine very easily.

      And writing new apps for the XML data is a breeze: there is no need to have documentation for the format of the data, since XML data are self-describing.

      For example, if the Microsoft Word document format was XML, then Open Office would have no problem supporting the MS Word format.

      If XML data get corrupted, they can always be edited with a simple text editor. Binary data can't be edited so easily.

      There is a multitude of advantages of XML over traditional binary data representation. The natural way of handling data is pure text, and it is something that humans understand. Until now, a computer programmer was required to conform to the machine's peculiarities; with XML, the machine is the one that needs to adjust its behaviour according to the user.

      It my view, the problem is with the Unicode standard. Why are there so many similar (in functionality) characters ?

    2. Re:The whole thing seems silly by putaro · · Score: 1

      While I agree that all of the hype about XML is out-of-control and a lot of the things said about XML don't hold up when you start to use it (describes the data - NOT!) and it can be a ROYAL pain in the ass, I find myself using it a lot because it _is_ defined.

      For example, I recently integrated our web store with our fulfillment vendor here in Japan. I suggested XML but their system is only set up to work with CSV files (I need to visit their IT operations - I smell Excel on Windoze - what can you expect from guys who pack boxes all day long). Now CSV files are all fine and well, but you have to work through a bunch of other issues now. For example, what happens when the customer's company name has a "," in it - you have to agree on an escaping mechanism. XML has those kind of things ALREADY defined. It may be a pain in the ass to get XML to do something but most things that you can get it to do have been defined so you don't have to discuss the subject with the other people you're working with.

      And how are people able to write 500+ page books about it? Well, go try writing an XSL stylesheet that actually does something :-). You'll understand real quick.

    3. Re:The whole thing seems silly by dossen · · Score: 2, Insightful

      The great thing about XML is all the other stuff (XSL,XPath...) that comes with it. And all the implementations for various languages, leaving you free to code your application, instead of having to deal with data formats.

      As an example take programX versions 1.x and 2.x, and suppose they use different file-formats (perhaps a feature in 2.x revealed something profoundly broken in the 1.x format).
      Now the customers need to convert their data (perhaps even both ways, while they change their installations (and their customers/partners/suppliers/... do the same)).
      For that end you must write a conversion, and a program to execute the conversion.
      In XML the conversion would be an XSL-stylesheet and the program would be mostely of-the-shelf, saving you the trouble of writing your own conversion program (you would of cause combine the conversion and the "driver" program, right? There will never be a 3.x, right? And validating that your conversion is correct? Naaahhh... Real men trust their code, right?).

      Stuff like that is the reason why XML is nice for a data-format.

    4. Re:The whole thing seems silly by kalidasa · · Score: 2

      Read the XSLT spec. That gives you some idea of the wonder that is XML.

    5. Re:The whole thing seems silly by Anonymous Coward · · Score: 0


      XML is very useful for exchanging data without knowing beforehand what kind of data that is. The only prearrangement should be for the two communicating apps to have common definitions, i.e. an 'apple' for app 1 should also be an 'apple' for app 2 and not something else.


      Claptrap! The level of shared understanding between the two systems is _exactly_ what it would need to be for any number of binary representations.

      The data in offests 32-68 for app 1 should mean the same thing as the data in offsets 32-68 for app2, and not something else


      By expressing the format in pure text, there is no need to deal with problems coming from binary representations: an XML document can be taken from a little-endian machine and be transferred to a big-endian machine very easily.


      Ha! You missed the whole point of the article. XML is based on Unicode, not "text." 16-bit Unicode is perfectly valid XML, so yes, you do have to deal with byte order. Fortunately, XML requires that any 16-bit data start with a byte order mark, so that the parser can tell what byte order should be used, but then again, so would any well designed "binary" format.


      And writing new apps for the XML data is a breeze: there is no need to have documentation for the format of the data, since XML data are self-describing.


      Huh? OK here is a tag: <name />. What does it mean? (not very self-descriptive is it?) Oh, did you mean that it is _possible_ to use goo descriptive names in XML and comment the definition to explain things? Ummmm... And this is different from other formats how?


      If XML data get corrupted, they can always be edited with a simple text editor. Binary data can't be edited so easily.


      Lessee... Where did I put that (little/big)-endian UTF-16 editor? Oh hell, might as well just used the Hex editor like I used for my "binary" data.

      I am not dissing XML. I use it every day and think that it is a positive good. While not perfect, it sets some standards instead of having every program reinvent the wheel. This allows for standardized tools that can be reused. I like not having to write a parser every time I want to persist or share data between applications. XML is good.

      This XML is self-describing, or XML is text, or XML is easy in a way that binary data isn't foolishness however, is just ill-informed folks parroting something that they read in a book.

  29. W3C still meeting over IBM resolutions by spoonyfork · · Score: 5, Funny

    World Wide Web Consortium still meeting over IBM resolutions

    Posted: Sat, 19 Oct 2002 0:18 AEST

    The five permanent members of the World Wide Web Consortium are meeting overnight in an attempt to agree on a resolution on IBM. W3C diplomats say there are signs HP and Sun are now moving towards a compromise, after weeks of wrangling over the XML issue.

    HP wants clear instructions given to Microsoft that it return to the World Wide Web Consortium before taking legal action, while the Microsoft wants more leeway for itself and its allies. Meanwhile, the HP CEO, Carly Fiorina, has given another strong warning against the use of force against IBM.

    Speaking at the opening of a summit of XML using companies in the Silicon Valley, Mrs Fiorina said legal force must only be used as a last resort. She called for all conflicts to be resolved in ways respecting international law, as this was the only guarantee against what she described as "adventurist" policies.

    --
    Speak truth to power.
  30. choke? by Ender+Ryan · · Score: 2
    I don't understand why XML 1.0 should choke on this character, should it just see it as any other data?

    The article doesn't explain the technical problems in any depth at all.

    --
    Sticking feathers up your butt does not make you a chicken - Tyler Durden
  31. End of Line For XML? by Anonymous Coward · · Score: 1, Interesting

    Does this mean that XML has reached the end of the line and it is time to start working on the next big thing?

  32. Screw Unicode.. by grub · · Score: 2


    ... 8 bits should be enough for anybody.

    to paraphrase Bill G.

    --
    Trolling is a art,
  33. *Shrug* by Fweeky · · Score: 5, Insightful
    If you're using the XML prologue like you're supposed to, your XML 1.0 documents will have:
    <?xml version="1.0" ?>
    At the top. The parsers will then parse using the XML 1.0 specification and you won't notice a thing.

    If you don't use it, tough luck, you should have followed the original recommendation more closely. Lucky for you it's not exactly difficult to automatically process XML documents and add the prologe later.
    1. Re:*Shrug* by greenhide · · Score: 2


      Believe it or not as you will, adding the declaration frequently will cause bad behavior -- if it's XHTML on a browser. Often I've had clients call in, complaining about a site displaying bizarrely, or not at all. The thing that has fixed it has been taking out the XML declaration which, according all the specs that I've seen, is optional.

      The DOCTYPE declaration often causes problems too. Why, I don't know. I just wish all these vendors would get their acts together and get all of the browsers to respond the same for XHTML-compliant pages.
      </gripe>

      --
      Karma: Chevy Kavalierma.
    2. Re:*Shrug* by catfood · · Score: 2

      Non-compliant browser.

    3. Re:*Shrug* by greenhide · · Score: 2

      Widely used browser.

      The truth is, you can never just tell a client that a browser is non-compliant and leave it at that. They are using that browser, and the website must match what they should be getting.

      One of our clients had, for some god-forsaken reason, a font named "Georgia" which was some form of Cyrillic. He thought his whole website was gibberish, because in the style sheet we had defined that font as the default font for body text. Even though it was only the case on his computer which had this bizarre font, we had to change it to Times New Roman so that it would display correctly on his browser.

      Not to get on another rant but I have to say, I am a little sick of people who talk about standards and how it's more important to follow the standards than to create solutions that do what clients want, even if they don't comply. Have these people ever created websites for others for money? If so, they must have discovered by now that the mantra "The customer always thinks they're right" applies in their line of work, too. I don't have the luxory of telling some client to "shove off" if they don't want to comply with my standards-following ideals. I have to give them what they want.

      --
      Karma: Chevy Kavalierma.
    4. Re:*Shrug* by Anonymous Coward · · Score: 0

      Under 1.1 rules, anything without an XML declaration is assumed to be 1.0, unless it gets incorporated into a 1.1 document as an external entity.

    5. Re:*Shrug* by catfood · · Score: 2

      Which is all fine.

      But it doesn't make the browser any more compliant.

      Credit where credit's due, that's all I'm saying.

    6. Re:*Shrug* by Fastolfe · · Score: 1

      I have never had this problem and have been generating XHTML content for years on several browser types. Could you elaborate on what browser and version you saw this behavior? Was your web server delivering a proper MIME type? Could your browser have been IE, which second-guesses generic MIME types? It's possible you have a misconfiguration and just don't realize it because IE tries to play smart.

    7. Re:*Shrug* by Anonymous Coward · · Score: 0

      Then you probably shouldn't be using XHTML, because that browser barfs when you use the right MIME-type (application/xml+xhtml). Despite all the hype, the world still isn't ready for XHTML deployment; stick to HTML 4.01 and don't worry about it yet. (Not to mention that the sort of XHTML served as text/html is usually invalid and not well-formed, so it has *much* bigger problems in terms of XML interchange, which was the original question.)

    8. Re:*Shrug* by greenhide · · Score: 2

      It was delivering the mime type text/html and, yes, it was on an IE browser. One of the problems had been whitespace before the XML declaration, although even when I removed the whitespace, I still had problems.

      Also, as it was a dynamic site, I also did not have total control over the content put in by the client. Since they frequently used an & standing alone, that made it a badly formed XML which made the whole page unviewable. It would have been difficult to escape entites *everywhere* they appeared (links, page titles, page content, meta keywords and description and more were all editable by the client), just allowing Internet Explorer to think that it *wasn't* dealing with a stretch of XML even when it was seemed to fix that problem...

      --
      Karma: Chevy Kavalierma.
    9. Re:*Shrug* by Nathaniel · · Score: 2
      "I also did not have total control over the content put in by the client. Since they frequently used an & standing alone, that made it a badly formed XML...."

      Well, yeah. If it wasn't compliant XML, it is hardly surprising that the browsers refused to display the page when you claimed it was valid XML. That's what they are supposed to do.

      Removing the XML claim, as you found, was perfectly appropriate.

  34. XML 1.1 - Problem? by Plud · · Score: 5, Informative

    A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.

    1. Re:XML 1.1 - Problem? by J.+Random+Software · · Score: 2

      XML allows line breaks inside tags, but since EBCDIC platforms are likely to embed NEL characters when you do this, XML 1.0 parsers should reject such documents--even though what you typed would have been well-formed if you had been working in US-ASCII. Parsers are backward compatible, but documents are not.

  35. Full Details by Matts · · Score: 5, Informative

    Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.

    Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
    1. Re:Full Details by Anonymous Coward · · Score: 5, Interesting

      So I want off and read it (Or at least, what appears to be it. There is a rant someway down the page you link to. Is that it?)

      So anyway, I read it. Surprise the surprise, the guy doesn't actually offer any actual examples of where this change would actually cause a break in itself. All he basically does is cry that 0x85 is designated as a new line character, and how dare IBM do such a thing! Then he goes into a rant about IBM, monopolies and patents. Uh huh.

      The fact is that 0x0085 is designated as NEL (NEw Line) as part of the Unicode specification. XML 1.1 allows the use of Unicode, which XML 1.0 did not. Therefore, if you are using XML 1.1, and you are using 0x85 and expect to see a grave a, your document isn't a Unicode compliant document anyway, and you shouldn't be complaining that a non compliant document doesn't work with a compliant parser.

      If all these people want to use 0x85 in their XML 1.1 documents, then they'll have to properly convert them to Unicode as the specification allows. Surprising, that.

    2. Re:Full Details by nijhof · · Score: 2, Interesting

      There is only a rant on that page, no examples.

      And you know what? I think an XML v1.1 document would be incompatible with any non-updated program, no matter what the changes in v1.1 are -- for if the program wasn't upgraded, it can't know what XML v.1.1 means. And there must be some difference, otherwise it wouldn't have a different version number

      Jeroen

    3. Re:Full Details by Mike+Schiraldi · · Score: 1, Offtopic
      Full details of why this has the potential to break things are on the XML news site Cafe Con Leche ... Please read that before making uninformed comments.

      Okay, i read it. Time for some uninformed comments!
      • Microsoft is really overstepping its bounds here. The DOJ needs to hit it hard.
      • IBM invented XML, so they can do what they want.
      • XML isn't even supposed to contain newlines, so what's the difference?
      • If we don't make this change, none of the XML parsers will be able to run after January 1, 2000
      • If you don't like the new XML, take public transportation.
    4. Re:Full Details by MenTaLguY · · Score: 3, Informative

      If all these people want to use 0x85 in their XML 1.1 documents, then they'll have to properly convert them to Unicode as the specification allows. Surprising, that.

      Or specify iso-8859-1 as their encoding in the XML prologue, which they were supposed to do in XML 1.0 anyway.

      --

      DNA just wants to be free...
  36. The proposed change by Srin+Tuar · · Score: 4, Informative
    Rather than reading through the whole spec:
    here is a summary of just the proposed change.


    It seems to comply with unicode just fine, I dont see what the controversy is really.

  37. New Line and Characters by JasonSkywalker · · Score: 5, Funny

    Let us not forget that it was New Line that callously yanked Tolkien's loveable Tom Bombadil! It was New Line that turned Arwen into a heroic Nazgul-racing babe-elf! It was New Line that left out poor Glorfindel and his big moment at the river altogether!

    I don't know about anyone else, but I think it's only fitting that a New Line character be messed with.

    ---

    --
    I have Unix underpants.
  38. feature not flaw by UniverseIsADoughnut · · Score: 1

    IBM isn't trying to make up for a flaw on their part, their trying to introduce a new feature to everyone.

    1. Re:feature not flaw by Trix · · Score: 1

      Maybe I'm just being thick. I don't know. It's been one of those days.

      It seems to me that the line ending character they are trying to stabilize on is #xA, right? That's the same as 0x0a, right? That's the same as '\n,' right? That's the same linefeel line ender that UNIX has been using since the '70's right? What's the problem?

      I only wish this meant that DOS would stop putting all those extra carriage returns in files!

      Go ahead, mod me down. I just wanted to blow of steam anyway.

      --
      I want all of the power and none of the responsibility.
    2. Re:feature not flaw by UniverseIsADoughnut · · Score: 1

      it was a joke, from the classic line
      "If you can't fix it, feature it"

  39. It's been done already in DOS by x0n · · Score: 1

    The $ sign is used as end of line marker for function 09h of int 21h (print string), e.g.

    TXT db 'hello, world!$'

    mov ah,09h
    mov dx, TXT
    int 21h
    int 20h

    My x86 ain't what it's used to so I'm awaiting endless corrections to this, but don't miss the point people ;)

    --

    PGP KeyId: 0x08D63965
    1. Re:It's been done already in DOS by Anonymous Coward · · Score: 0


      DOS??? That latecomer...It's actually a frippin CP/M function call which got embraced by DOS.

  40. How many lines? by PhxBlue · · Score: 1

    2) ???

    --
    !#@%*)anks for hanging up the phone, dear.
    1. Re:How many lines? by Shagg · · Score: 2

      Doh!

      --
      Unix is user friendly, it's just selective about who its friends are.
    2. Re:How many lines? by p3d0 · · Score: 2, Funny

      I wish someone would invent an HTML tag for ordered lists that would number the items automatically. ;-)

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    3. Re:How many lines? by mhesseltine · · Score: 2

      I'm sure that you're joking, but this may help out some HTML newbies. Use: <ol>followed by<li> will enumerate list items.

      Example:

      1. Do something
      2. ????
      3. PROFIT!!!

      You can view the source to this (if you can stand reading past all of the slashcode)

      --
      Overrated / Underrated : Moderation :: Anonymous Coward : Posting
    4. Re:How many lines? by vsync64 · · Score: 1
      You can view the source to this (if you can stand reading past all of the slashcode)

      If you're using a Mozilla that's anything close to recent, you can select the body of mhesseltine's post, right-click, and select "View Selection Source".

      --
      TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
    5. Re:How many lines? by Anonymous Coward · · Score: 0

      2) Profit???

      Nah, that should be the last option.

  41. For those missing the problem, also � by Theatetus · · Score: 2, Informative

    First off, unicode 85H is NEXT LINE; ASCII 85H is ellipses. unicode 2028H is LINE SEPARATOR. AH and DH are the infamous CR/LF which annoy MS/UNIX text convertors.

    Anyways, it sounds like a problem comes up here:
    must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing,

    That seems to be the sticking point there.

    I suppose the charset is specified in the document, but then again I'm not sure how literally they intend implementors to take the phrase "before parsing", since getting at the charset description involves some degree of parsing the document

    --
    All's true that is mistrusted
    1. Re:For those missing the problem, also � by Anonymous Coward · · Score: 5, Informative

      From the XML 1.1 spec

      The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0

      So XML 1.0 used Unicode 2.0, but not properly. XML 1.1 fixes that, and defines that all Unicode 3.2 byte pairs are now valid when used in an XML document. As part of this change, XML 1.1 also correctly allows the use of the Unicode 0x0085 NEL character as an EOL marker, which is totally compliant and consistent with the Unicode 3.2 specification.

      In other words, if you're using any character encoding other than Unicode 3.2, your XML document isn't compliant with XML 1.1 and you shouldn't ever expect ISO-Latin 0x85 to be displayed as an ellipses.

    2. Re:For those missing the problem, also � by Anonymous Coward · · Score: 0

      There is no 85H in ASCII. Maybe you mean "ANSI", the generic name that Microsoft OSes use when they mean "platform default", which will be something like Windows-125x (x varies by region).

    3. Re:For those missing the problem, also � by Theatetus · · Score: 1

      Good catch

      Maybe you mean "ANSI", the generic name that Microsoft OSes use when they mean "platform default"

      It's even worse: I was thinking of HTML escape sequences. &#133; (decimal 0x85) is ellipses. I humbly cower.

      --
      All's true that is mistrusted
    4. Re:For those missing the problem, also � by J.+Random+Software · · Score: 2

      Gah. Even if your document is served with charset=windows-1252 (so that a literal 0x85 byte in the document should be interpreted as U+2026 ELLIPSIS, HORIZONTAL), character entities like always refer to Unicode codepoints. &#x85; or &#133; can only mean U+0085 NEXT LINE no matter what charset the document is in. Do browsers still screw this up?

  42. This is just not that big a deal. by BobGregg · · Score: 4, Insightful

    As the quote from IBM points out in the article, this issue is just a subset of the larger problem with Unicode compatibility in XML 1.0. And as someone else pointed out, if document creators are using the XML headers appropriately to begin with, then parsers would handle documents correctly anyway. I'm also willing to bet that the percentage of existing XML documents which contains this particular character (0x85), and which are not already on IBM mainframes, is *extremely* small.

    Face it: this just isn't that big a deal. It's good for industry acceptance and propagation of the standard, at very low cost. Move along, there's nothing to see here.

  43. IBM means Infernal Blurry Mariachis by LittleBigScript · · Score: 0, Offtopic

    main()
    {
    bool misspellings = true;

    "IBM also dismissed the notion it was railroading the XML working group to serve its own ends."

    What happens when a railroad meets the end of the track? I think I've seen something like that in the movies...

    "The debate over XML 1.1, formerly known as XML Blueberry, has raged on Internet discussion forums"

    Yes, the debate has been tremendous. But, of course it has something to do with godzilla...I mean, mozilla.
    Also, there seems to be alot of talk of unicorns...I mean, unicode.

    "...an increasingly global standard for representing characters in computerized text."

    First Earth, then the Milky Way, and eventually the Whole Universe(tm).

    Ok, I am done with the sarcasm.(obligitory simpson's quote)

    "Do I know what rhetorical means?"
    -Homer

  44. In other news... by T.E.D. · · Score: 4, Funny

    EBCDIC has won yet another chess game against the Grim Reaper.

  45. since when is xml whitespace-dependant? by forevermore · · Score: 2, Interesting

    I'll admit that I don't know much about the technical side of xml (and I really can't see all of the great advantages to it, either), but since when does a parser care about whitespace? Wouldn't it make more sense to let the newline character match that of the overlying OS so people can actually TYPE those newline characters? Switching to unicode is fine and dandy, but what about all of those legacy systems that don't support it?

    --
    Do you really need reason for beer? Wingman Brewers
  46. Additional Newline Character, not a Replacement by Anonymous Coward · · Score: 2, Insightful

    Don't get too worked up about this. They aren't deleting any newline characters like \n. Just adding one to the list that XML considers whitespace.

  47. Embrace and extend, anyone? by dave-fu · · Score: 1, Flamebait

    If one were to replace "IBM" with "Microsoft", I wonder what sort of self-righteous fury would be rained down for this sort of legacy document-smashing behavior (even if at the root, it's just bringing things in line with Unicode).

    --
    Easy does it!
    This comment has been submitted already, 276865 hours , 59 minutes ago. No need to try again.
  48. XML sucks anyway by Anonymous Coward · · Score: 0

    Just use Lisp sexps.

    and if you want a markup language, instead of an innefficient tree description language, XML's no use anyway...

  49. Crying wolf by Salamander · · Score: 3, Insightful

    The real issue here is standards-committee wankery and the tendency of some people to accuse anyone who doesn't agree with them 100% of being proprietary, monopolistic, etc. This is exactly the sort of non-issue that doesn't deserve such rhetoric, and those who insist on crying wolf should be ejected from the process until they learn that "collaboration" doesn't mean "we rubber-stamp your ideas just because they're yours".

    --
    Slashdot - News for Herds. Stuff that Splatters.
  50. no. it's fine. by MenTaLguY · · Score: 4, Informative

    Doesn't make this XML files uneditable with most editors, like vi, pico and gedit? They all use \n (byte 10) as newline character.



    No. XML is defined in terms of Unicode, but XML files can be stored in any encoding with a known mapping to Unicode.



    Most XML files these days are using iso-8859-1 or UTF-8, both of which manage fine in vi/pico/gedit.



    chars 32-127 are identical in ASCII and Unicode, and iso-8859-1 is exactly identical to the bottom 8-bits worth of Unicode.



    Also, UTF-8 is an encoding of the full Unicode range that is backwards-compatible with 7-bit ASCII.



    in any case, note the entry for LF in your parent post -- 000A (hex) = 10 (decimal).

    --

    DNA just wants to be free...
  51. Re:Sad news ... Stephen King dead at 55 by Stephen+King · · Score: 1, Funny

    No, I'm not.

    --
    Karma: Undead.
  52. ever wondered what alt-gr is for? by DrSkwid · · Score: 4, Funny

    damn that lameness filter

    Reason: Your comment looks too much like ascii art.

    ever wondered what alt-gr is for?

    raw a e i o u 4 `
    alt-gr á é í ó ú ¦

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    1. Re:ever wondered what alt-gr is for? by Anonymous Coward · · Score: 0

      American keyboards don't have that key.

    2. Re:ever wondered what alt-gr is for? by pacc · · Score: 2

      That key is dead.
      That's what dead-keys are for.

      Alt-Gr is used to give you straining-injuries.

    3. Re:ever wondered what alt-gr is for? by Anonymous Coward · · Score: 0

      What about compose?

      ae - æ
      o' - ó
      n- - ñ
      xo -
      c, - ç
      !! -
      || - ¦
      __ -
      .. -
      *0 -
      c/ -

      It's fun.

      It's telling me to use fewer junk characters. I'm thinking I'll just pad my comment with extra stuff to make it appear as if it has fewer of these so-called "junk characters." Hopefully a few sentences will be enough to satisfy the filter. If not, I may just have to resort to submitting it as code. I'd rather not do that, since it isn't code.

      That paragraph wasn't enough. Maybe it counts the number of characters, rather than the percentage of the comment. That would mean that I am writing all this for nothing. I'd hate to think I'm doing all this work for nothing. I want to evade the filter, I don't want to have to resort to working around it. If that happens, then the terrorists have won.

      Looks like the terrorists lose again.

  53. The other side of the argument by smallpaul · · Score: 4, Interesting

    The Slashdot commentary has been pretty one-sided so I'll try and address the other side. First, IBM has said that this fix is for their mainframe customers, not for themselves. But nobody in the XML world has heard from these customers. As far as I know, no user has submitted a request for this NEL feature. No user has sent a message to the many XML mailing lists. No user has posted to Slashdot. Updating all of the XML parsers in the world is really expensive and if the mainframers don't care enough about the problem to storm the gates then maybe it isn't hurting them that badly. So from a democratic point of view, we're going to make life harder for the people who care enough to scream out loud in order to make life easier for the small minority who perhaps are not even that badly impacted.

    Further discussion is on xml.com.

    1. Re:The other side of the argument by Anonymous Coward · · Score: 0

      The thing is, I don't think "mainframers" work that way. (I'm fairly sure it would never occur to my wife, queen of COBOL.) When they have a problem, they go to their vendor, not to the community, because they've given the vendor hundreds of thousands (millions, perhaps) of dollars for hardware, licenses, and support. To their mind, it's IBM's job to get this sorted out for them, and that's what IBM is trying to do.

    2. Re:The other side of the argument by Anonymous Coward · · Score: 0

      This is just the inevitable operation of Goldfarb's First Law--proof that not even languages designed for the Desperate Perl Hacker are immune.

    3. Re:The other side of the argument by Anonymous Coward · · Score: 1, Insightful


      Updating all of the XML parsers in the world is really expensive


      Ummmm.... Updating all the XML 1.1 parsers in the world would be fairly easy at this time. (seeing that the spec isn't final, there isn't any such thing as an XML 1.1 parser yet)

      This change wouldn't require any change to the existing XML 1.0 parsers since it is an XML 1.1 feature.

  54. Mod Article -1 Troll by kalidasa · · Score: 4, Insightful

    This is just a way to spark a holy war "my newline character is better than yours" debate. The proposal makes perfect sense - it brings XML into line with Unicode and ISO-10646.

  55. Upgrading OS vs Upgrading XML 1.1 by ebresie · · Score: 2, Interesting

    Okay...maybe I'm not looking at this incorrectly, but...

    If IBM problem is they don't want to force everyone to update their Mainframes and cause them a head ache...but won't they still have to upgrade their Mainframes to support XML 1.1 with new XML 1.1 compatible parsers?

    --

    Eric B
    ebresie@gmail.com
  56. End of Line Character? by Lucas+Membrane · · Score: 2

    Who ever heard of an end-of-line character on a mainframe? Everything was always fixed length, with the length in the DCB, or variable length, with a length prefix, at least back when I used them. There was a "record mark" character, but that was used back on the 1401's, back in ancient times.

  57. Re:no. it's fine. by p3d0 · · Score: 2

    Just to clarify, what happens to a 0085 (NEL) character when a Unicode file is saved in an iso-8859-1 or UTF-8 encoding?

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  58. Doesn't seem to work here by Anonymous Coward · · Score: 0

    æe

  59. For all you W3 Standards Worshipers by hafidhahullah · · Score: 0, Flamebait

    Consider the brilliance of XHTML 2.0 abolishing the /> tag in favor of something that would read like this: <p> <line>public class HelloWorld {</line> <line>public static void main (String[] args){ </line> <line>System.out.println("Hello world!"); </line> <line>}</line> <line>}</line> </p> Why in the world should you expect the XML standard to be in complance with the Unicode standard?

  60. I don't work for IBM, but I agree by cornicefire · · Score: 2, Interesting

    I'm dealing with some cross-platform XML these days. It's generally pretty wonderful, but the newline character is something that drives me a bit batty. If anyone can bring some unity to this disunity, I'm sure that all of the XML world and the Java world would be better off. It's an anachronism.

  61. Let's get rid of all newline characters by Fastolfe · · Score: 2

    There's a reason XML supports multiple types of newlines: because there are platforms out there that use those other types of newlines. XML 1.1 standardizes on a 0x0a newline, but recognizes that there are other types out there, and requires parsers to normalize this before parsing the XML. All the specification is doing is adding another type of newline that a fairly popular platform uses. This is no different than adding MacOS-specific and Windows-specific newline types to XML in the first place. The goal is to allow platforms to store XML data as a native text document instead of forcing newlines that cause XML documents to be treated as awkward binary data on those platforms that don't have XML-compatible newline conventions.

    This whole thread is retarded. Few people posting all of this FUD seem to know the difference between a character encoding, a character set, and a pimple on their ass. This change in XML 1.1 changes nothing for anyone, except those that want to write XML 1.1 parsers and those on platforms that use this Unicode newline as their native newline character.

    If we're going to throw up such a fuss about this one addition (which in NO way breaks ANY existing XML 1.0 documents), why aren't we throwing up a fuss about including MacOS or Windows newlines into XML in the first place? GET RID OF THEM ALL! UNIX NEWLINES ARE THE ONLY TRUE NEWLINES!!#!!@#$

    Jesus, people...

  62. depends on the character set by MenTaLguY · · Score: 3, Informative

    It's the same thing that happens any time you take codepoints across character sets.

    iso-8859-1:

    NEL (from EBCDIC) doesn't exist in iso-8859-1; what character gets used instead is up to the discretion of the tool doing the conversion. Since Unicode defines 0x0085 as a whitespace character, the tool could know to substitute another if that was desired.

    UTF-8:

    It's a non-lossy encoding of Unicode. All characters above 0x007f get stored as multi-byte sequences. 0x0085 becomes 0xc2 0x85.

    --

    DNA just wants to be free...
  63. UTF-8 by MenTaLguY · · Score: 3, Informative

    If you want to know more about UTF-8, see RFC 2279.

    --

    DNA just wants to be free...
  64. Some clarifcation by kune · · Score: 3, Informative

    The problem is cause be the EBDIC code. It has special codes for CARRIAGE RETURN (CR), LINE FEED (LF) and NEW LINE (NL). The problem is now how you convert NL into UNICODE, you could map it to LF, but that can't be mapped back. You could als map it to LINE SEPERATOR U+2028, but IBM seems to think that mapping it the NEL (New Line) U+0085 control character is appropriate. This is supported by UNICODE standard annex UAX #13. However in UAX #14 U+0085 has not the line breaking property, so there is still some inconsistence in the UNICODE standard. But I don't think this is an major issue. It does mean only, that you will have problems to edit XML-documents generated in EBCDIC in some of the worser editors. We have lived all the years with the DOS CRLF/UNIX LF problem, so we will survive this too.

  65. Re:no. it's fine. by Anonymous Coward · · Score: 0

    wow, your knowledge of newlines is not just theoretical!

  66. Seriously, 0x85 is an ellipsis by spitzak · · Score: 2
    I don't think 0x85 (or any other character > 0x7f) should be used for any kind of control or newline purposes.

    The main reason is that is does not pass cleanly throught UTF-8 encoding and thus it is far too easy to write software to miss it, and writing correct software is inefficient. Anything that assigns a non-tokeninzing meaning to characters > 0x7f will mean the UTF-8 must be decoded to parse the file. If instead all characters > 0x7f are considered parts of identifiers/words then the parser can easily be byte-based and use lookup tables to correctly match words. In fact this can be very safe because it will prevent illegal UTF-8 encodings, provided the parsers check the encoding when *adding* words to their has tables, they have no need to check encodings when looking up words, as illegal long encodings will not match.

    In addition there is no good reason for the "hole" in 0x80 through 0x9F. It was done so that systems that stripped the high bit would not accidentally map a foreign letter to a control code. However all such systems are long obsolete and would barf on UTF-8 encoding anyway (since it does use these codes).

    My recommendation, strange as it may seem for something on Slashdot, is to use MicroSoft's assignments use in Word for these codes (usually seen as the "smart quotes" output). This is apparently called the "CP1252 superset" in MicroSoft-speak. In this encoding 0x85 is the ellipsis (...).

    Why use this arbitrary standard from the evil company? Mostly because it is by far the most-used standard for these codes. Also I think the assignments are pretty good, they were selected for actual use in the (admittedly euro-centric) modern world of office software, and are not tainted with the nasty political correctness that has so messed up and delayed Unicode assignments.

    I recommend two things: first that all characters >0x7f in Unicode be considered "part of an identifier" by all parsers. Second that Unicode have the MicroSoft CP1252 assignments added to the codes 0x80-0x9f.

    1. Re:Seriously, 0x85 is an ellipsis by Anonymous Coward · · Score: 0

      You can use 0x85 in an XML file to mean ellipses
      if you add the header

      and most mainstream XML parsers will accept it.

      An XML file can be written in almost any encoding.
      When it is parsed, it is converted to Unicode:
      most modern programming languages and operating systems uses Unicode now, so this usually does not represent any additional work: a system would be doing the encoding conversion anyway. Not only
      that, but parsing/transcoding cost is typically trivial compared to the cost of creating objects.

      XML 1.0 used Unicode 2.0. In Unicode 2.0, the character U+0085 is not defined but left vacant.
      So people who were sending the character in their XML 1.0 documents were doing so by private arrangement. Now that Unicode 3.n defines
      more clearly that U+0085 is NEL, this character is not available for private use anymore.

      So XML 1.1. is backwards-compatible in the sense that you can send the same data between two systems as you could with XML 1.0: the incompatability is merely that some characters have to use numeric character references rather than direct characters. It is safer that way: when someone sends ellipses and forgets to set the correct encoding declaration, their document will not be well-formed, which prevents corrupt data being received at the other end.

      And if you were sending those control characters in XML 1.0, you should have been aware that you were stretching what XML was designed to do anyway. In particular, you shouldn't have been sending your files over HTTP as text/* (because controls are not allowed in text/* files).

      As for the question on why not make all characters
      > 0x7F allowed anywhere, please see the paper
      http://www.topologi.com/public/XML_Naming_R ules.ht ml and also the article http://www.xml.com/pub/a/2002/09/18/euroxml.html
      which give background information.

      The arrival of Uniode 3.2, the advent of the Euro, the need to cope with IBM systems' NEL, the desire to make XML more robust, the desire to
      extend XML to send more kinds of data as text,
      the desire to conform to text/* more,
      the need to introduce support for normalization, and the desire to free up XML names for more characters: any one of these is probably not enough to justify XML 1.1 but together they
      are not bad.

      Rick

  67. XML, XML... LISP! by panserg · · Score: 3, Funny

    Compare the elegance of LISP brackets to overbloated XML tags. Compare the ability to share same syntax with code and data in LISP with oversimplified XML tags. Make a conclusion.

    --
    "I shall explain this by waving my hands about in an appropriate manner." -- Cambridge University Math Dept.
  68. Re:Mod Article -1 Troll by spitzak · · Score: 2
    No it doesn't. Unicode also defines paragraph and line seperator characters, in the same document that describes this NEL character, and those are not treated as newlines.

    There is a good reason for this: those characters are hard to identify in UTF-8 and some other encodings without actually decoding to Unicode. Unfortunatly this is also true of 0x85. Therefore I think this proposal is bad.

    Also there are a *lot* of people (all users of MicroSoft Word) who think this character is an ellipsis.

  69. Re:Mod Article -1 Troll by Anonymous Coward · · Score: 0


    Also there are a *lot* of people (all users of MicroSoft Word) who think this character is an ellipsis.


    Ummmm... Anyone using Microsoft Word to edit XML is going to have quite s few problems...

  70. Re:Mod Article -1 Troll by kalidasa · · Score: 2

    He's got a point, though I don't know if he's right with everything he's saying. You don't use MS Word to edit XML, but you often use XML to encode text that started out (on someone else's desktop, perhaps) as Word text.

  71. Can someone explain to me... by JordanH · · Score: 2
    why whitespace is significant in ANY XML spec?

    I know it is, I just don't understand the rationale.

    I've always thought that this was just begging for problems. I don't understand the upside. Seems like whitespace is presentation, which should be handled by Formatting Objects or something.

  72. mod parent -5 mis-informative by Anonymous Coward · · Score: 0

    Full details my ass. What a load of crap.

  73. Re:Mod Article -1 Troll by Anonymous Coward · · Score: 0


    you often use XML to encode text that started out (on someone else's desktop, perhaps) as Word text.


    That's fine, but still shouldn't be using 0x85 in the XML to represent "..." in the document.

  74. Writing correct software is inefficient? by GunFodder · · Score: 2

    This is the kind of approach that many companies (especially Microsoft) use that makes life difficult for others. Speed at the cost of correctness sounds good in the short term but always ends up biting you in the ass later. Efficient yet incorrect code is hard to maintain because it usually only makes sense to the original author. It breaks more easily than correct code since it is less tolerant to variations in the input. And eventually the performance delta is erased by hacks added in later revisions to get around problems caused by the original incorrect behavior.

    1. Re:Writing correct software is inefficient? by spitzak · · Score: 2
      I believe a parser that does not have to decode UTF-8 will be both faster and more correct. This is because it is much simpler and can use well-established tools that work with bytes. Even with modern memory sizes using 256-entry lookup tables is much more possible than a 2^31 entry lookup table.

      For these reason I would greatly prefer a scheme where all characters greater than 0x7f are treated identically.

  75. Or you can think about it a little more... by GCP · · Score: 3, Insightful

    ...and you can help support it. The complaints I keep reading are all about how tough it will be for those poor XML tools makers. As an XML tool maker, I assure you that this upgrade is no sweat. No one with the technical skill to create a serious XML tool is going to be challenged by this. The universality goals of Unicode and XML mesh very nicely and it's worth continuing to incorporate the lessons learned by each into the other.

    --
    "Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
  76. XML Blueberry by the+endless · · Score: 1

    These proposals have been around since at least June 2001, when the W3C published their Requirements document for what was then called XML Blueberry and has since become XML 1.1.

    And the complaints date from then as well... Elliote Rusty Harold complained almost as soon as the Requirements document and the first Working Draft were published. He makes a number of good points that highlight just how unnecessary XML 1.1 actually is. This link is actually him quoting himself for the time - the original post is probably available on the W3C forums, but I'm far too lazy to look.

  77. Re:Mod Article -1 Troll by kalidasa · · Score: 2

    That's fine, but still shouldn't be using 0x85 in the XML to represent "..." in the document.

    No, but a lot of the time, you're encoding text that was created by a user who doesn't know what the heck they're doing, and can't handle anything more sophisticated than MS Word.

  78. Last Post! by alpg · · Score: 1

    vi is [[13~^[[15~^[[15~^[[19~^[[18~^ a
    muk[^[[29~^[[34~^[[26~^[[32~^ch better editor than this emacs. I know
    I^[[14~'ll get flamed for this but the truth has to be
    said. ^[[D^[[D^[[D^[[D ^[[D^[^[[D^[[D^[[B^
    exit ^X^C quit :x :wq dang it :w:w:w :x ^C^C^Z^D
    -- Jesper Lauridsen from alt.religion.emacs

    - this post brought to you by the Automated Last Post Generator...