XML 1.1 Spec Hits Some Snags

Read the Unicode spec.... by vidarh · 2002-10-18 02:22 · Score: 5, Informative

Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).

Re:Read the Unicode spec.... by gorilla · 2002-10-18 02:59 · Score: 5, Informative

It's more complicated than that. Unicode has
2029 - paragraph seperator
2028 - line seperator
000D - CR
000A - LF
0085 - NEL (Next Line)
Any of these could be interpeted as the end of a logical line.
Re:Read the Unicode spec.... by vidarh · 2002-10-18 03:10 · Score: 5, Informative

And if you read the XML 1.1 spec you'll see that all of the characters you've listed above except for 0x2029 are interpreted by XML 1.0 as the end of a logical line.
Re:Read the Unicode spec.... by Anonymous Coward · 2002-10-18 03:42 · Score: 2, Informative

Ah, dude, the newline is the least of your problems. 2-byte Unicode characters are not exactly backwards compatible with 1-byte (or, pedantically, 7-bit) ASCII.
Re:Read the Unicode spec.... by julesh · 2002-10-18 03:57 · Score: 5, Informative

No, not really. It means that *some* XML files can't be edited with these editors. But then that was true already; some might have used \r or some other of the list of characters.

What it *does* mean is that editors on other systems than Unix are able to edit XML files. It means I can create an XML file in DOS 'edit' which uses \r\n, or on a mac with an editor that might use \r, or on (apparently) an IBM system where the standard text editors use \u85.

This is absolutely essential. It does however mean that in order to support *all* XML files, you need to recognise *all* of those line endings. As always, its easier to support a subset, but harder to support everything. However the fact that existing software works at all is very important, so I think they're moving in the right direction.
Re:Read the Unicode spec.... by innate · 2002-10-18 04:07 · Score: 5, Informative

There are no 2-byte Unicode characters, only encodings (such as UTF-16) which use two or more bytes to represent each character. Some Unicode characters, those not in the Basic Multilingual Plane (BMP), require more than two bytes to represent in UTF-16.
And 7-bit ASCII is a strict subset of UTF-8 encoding. UTF-8 encodes each character to one or more bytes, with characters up to 127 defined the same as in ASCII. If your text is strict 7-bit ASCII, it is also a UTF-8 file.
You could also use UTF-32 (UCS-4), which represents each character as 4 bytes, but that is overkill for most applications.
The main problem with multibyte encodings such as Shift-JIS and Big 5 was lead-byte detection: you couldn't jump into the middle of a string and determine if you were looking at the only, first, or second byte of a character. You had to start parsing at the beginning of the string in order to synchronize your character detection. Unicode has done away with this by strictly defining the lead byte ranges in such a way that there is never any ambiguity.

--
No, I don't want to explore the Recycle Bin.
Re:Read the Unicode spec.... by kalidasa · 2002-10-18 04:13 · Score: 3, Informative

Uh, dude, Unicode characters are not necessarily 2-byte. UTF-8 characters are 1 byte if ASCII (real ASCII, the "pedantic" 128-character set), 2 byte or more (up to 6 or 8 bytes, iirc) if not. UTF-16 characters are either 2 bytes or 4 bytes. UTF-32 characters are all 4 bytes. Read the spec.

So? by Anonymous Coward · 2002-10-18 02:25 · Score: 5, Informative

1.0 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:

the two-character sequence #xD #xA
the two-character sequence #xD #x85
the single character #x85
the single character #x2028
any #xD character that is not immediately followed by #xA or #x85.

I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?

Re:So? by Trusty+Penfold · 2002-10-18 02:30 · Score: 0, Informative

0x85 is à (a grave). So everyone in France?
Re:So? by vidarh · 2002-10-18 02:36 · Score: 5, Informative

In which character set? Certainly not in Unicode, so if anyone used 0x85 as à in XML documents using any Unicode encoding they've messed up. à (latin letter a with grave) is 0xE0, and À (lating chapital letter a with grave) is 0xC0.
Re:So? by Anonymous Coward · 2002-10-18 02:38 · Score: 1, Informative

Thanks, I was just trying to eyeball the Unicode database files myself then. Indeed, as you say, à is not 0x85 in Unicode, and as the major point of XML 1.1 is to add Unicode support, I don't see what the problem is with using 0x85 as an EOL character for the 1.1 specification.
Re:So? by vidarh · 2002-10-18 02:40 · Score: 5, Informative

To expand on that, in ISO-Latin in general and certainly ISO-Latin-1, and thus by extension Unicode (which maps to ISO-Latin-1 for code points in the range 0x00 to 0xff), the area 0x80 to 0x9f was on purpose not used for displayable glyphs in order not to cause interoperability problems with 7bit systems if an 8bit text was moved between systems and the 8bit was stripped off.
So unless you are using a non-Unicode, non-ISO-Latin encoding there are no printable characters in that range, and if you're using another character you will need to remap the characters before considering any of the rules in the XML spec anyway, since those rules refer to the unicode codepoints.
Re:So? by Sir+Tristam · 2002-10-18 02:45 · Score: 5, Informative

0x85 is à (a grave). So everyone in France?
No, you're looking at the extended ASCII chart. What this is talking about is Unicode. A Unicode 0x0085 is the control character NEL (http://www.unicode.org/charts/PDF/U0080.pdf, page 3) NEL is NExt Line.
Chris Beckenbach
Re:So? by greenius · 2002-10-18 03:28 · Score: 4, Informative

Hex dump of this message:

30 78 38 35 20 69 73 20 e0 20 20 28 61 20 67 72 : 0x85 is à (a gr
61 76 65 29 2e 20 20 53 6f 20 65 76 65 72 79 6f : ave). So everyo
6e 65 20 69 6e 20 46 72 61 6e 63 65 3f 0a 09 09 : ne in France?...

--
I copied this sig from someone else (but where did they get it from?)
Re:So? by Anonymous Coward · 2002-10-18 04:38 · Score: 1, Informative

So unless you are using a non-Unicode, non-ISO-Latin encoding there are no printable characters in that range, and if you're using another character you will need to remap the characters before considering any of the rules in the XML spec anyway, since those rules refer to the unicode codepoints.

But, unfortunately, the very commonly used Windows codepage 1252 puts printable glyphs in that code point range. Rather, recent versions of CP 1252 (not all versions of CP1252 are the same) have put things like trademark signs in that character codepoint range. Keep in mind that Microsoft still dominates the computer software industry more so than IBM and their attempts to keep things open and standards body approved.

What about poor old Acorn users? by PhilHibbs · 2002-10-18 02:27 · Score: 3, Informative

I don't know if it's still the same in RiscOS, but the BBC Micro used 0x0A+0x0D as it's end-of-line marker. Why doesn't XML support this? If 1.1 is going to modify the end-of-line specification, then this is the perfect time to correct this glaring omission.

Complain by almeida · 2002-10-18 02:42 · Score: 3, Informative

Don't like it? Complain. Comments can be viewed here.

XML 1.1 - Problem? by Plud · 2002-10-18 02:58 · Score: 5, Informative

A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.

Full Details by Matts · 2002-10-18 02:58 · Score: 5, Informative

Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.

Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.

--

Matt. Want XML + Apache + Stylesheets? Get AxKit.

Re:Full Details by MenTaLguY · 2002-10-18 04:57 · Score: 3, Informative

If all these people want to use 0x85 in their XML 1.1 documents, then they'll have to properly convert them to Unicode as the specification allows. Surprising, that.

Or specify iso-8859-1 as their encoding in the XML prologue, which they were supposed to do in XML 1.0 anyway.

--

DNA just wants to be free...

The proposed change by Srin+Tuar · 2002-10-18 03:01 · Score: 4, Informative

Rather than reading through the whole spec:
here is a summary of just the proposed change.

It seems to comply with unicode just fine, I dont see what the controversy is really.

For those missing the problem, also � by Theatetus · 2002-10-18 03:09 · Score: 2, Informative

First off, unicode 85H is NEXT LINE; ASCII 85H is ellipses. unicode 2028H is LINE SEPARATOR. AH and DH are the infamous CR/LF which annoy MS/UNIX text convertors.

Anyways, it sounds like a problem comes up here:
must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing,

That seems to be the sticking point there.

I suppose the charset is specified in the document, but then again I'm not sure how literally they intend implementors to take the phrase "before parsing", since getting at the charset description involves some degree of parsing the document

--
All's true that is mistrusted

Re:For those missing the problem, also � by Anonymous Coward · 2002-10-18 03:20 · Score: 5, Informative

From the XML 1.1 spec

The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0

So XML 1.0 used Unicode 2.0, but not properly. XML 1.1 fixes that, and defines that all Unicode 3.2 byte pairs are now valid when used in an XML document. As part of this change, XML 1.1 also correctly allows the use of the Unicode 0x0085 NEL character as an EOL marker, which is totally compliant and consistent with the Unicode 3.2 specification.

In other words, if you're using any character encoding other than Unicode 3.2, your XML document isn't compliant with XML 1.1 and you shouldn't ever expect ISO-Latin 0x85 to be displayed as an ellipses.

Re:how hard is it to do parsing? by vidarh · 2002-10-18 03:18 · Score: 3, Informative

This isn't about changing existing files, but about changing software that is in production use on mainframes. And it isn't IBM, but IBMs mainframe customers and anyone dealing with their mainframe using customers that would get the work.

Re:What? No newline? by dossen · 2002-10-18 03:32 · Score: 5, Informative

Where did you get that awful idea of using />? The proper way is to inclose the paragraph in p-tags like so:

<p> Your text here</p>

Using the paragraph tags as large linebreaks is a very bad habit from the bad old days of the web. Please head to W3C and study the recent standards, and validate your documents before publishing (using a validator, not a browser).
Ohh... And this is actually an issue about XML 1.1 unicode support, so worrying about HTML is quite premature (XHTML is still XML 1.0, and will remain so untill XML 1.1 becomes a standard (or recommendation in W3C-speak).

Re:New Newline Character? by gorilla · 2002-10-18 03:41 · Score: 5, Informative

Each of them has a different function. 000A and 000D are for compatability with ASCII. 0085 is for a unified character to replace the 000D 000A pair used on some OS's. However, some programs (eg notepad) use line breaks when they really mean paragraph seperators, so Unicode defined two codes which mean REAL line seperator, and REAL paragraph seperator. This report explains it quite clearly.

no. it's fine. by MenTaLguY · 2002-10-18 03:57 · Score: 4, Informative

Doesn't make this XML files uneditable with most editors, like vi, pico and gedit? They all use \n (byte 10) as newline character.

No. XML is defined in terms of Unicode, but XML files can be stored in any encoding with a known mapping to Unicode.

Most XML files these days are using iso-8859-1 or UTF-8, both of which manage fine in vi/pico/gedit.

chars 32-127 are identical in ASCII and Unicode, and iso-8859-1 is exactly identical to the bottom 8-bits worth of Unicode.

Also, UTF-8 is an encoding of the full Unicode range that is backwards-compatible with 7-bit ASCII.

in any case, note the entry for LF in your parent post -- 000A (hex) = 10 (decimal).

--

DNA just wants to be free...

Re:What do they mean, "XML 1.0 chokes"? by Dapnant · 2002-10-18 04:01 · Score: 2, Informative

Actually your understanding of whitespace in XML is almost completely incorrect.

Many XML applications ignore whitespace (after parsing). XML parsers are prohibited from deleting any whitespace that might be part of the data in a document. The xml:space attribute allows a document to indicate places where the author or encoder encourages normalization of space in some way.

This is all clearly explained in the standard itself (W3c XML pages).

Re:One tiny little update ??? by PainKilleR-CE · 2002-10-18 04:42 · Score: 5, Informative

The problem is that the data with the NEL character already exists, and is already generating these types of errors when it's translated into XML (or when XML is generated on these mainframes). From the original change proposal by IBM:

Problem areas include:

* Processing XML documents or DTDs generated on OS/390 systems, with XML 1.0 compliant parsers.
* Processing XML documents or DTDs, using native OS/390 system tools.
* Processing XML documents or DTDs retrieved from OS/390 database or file systems, in non-OS/390 environments.

XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.

Essentially 'just fix the software' involves operating system-level changes as well as possibly changes to most software that interprets NEL characters on that OS. As it stands, they're going to have a problem anyway, and it's probably best to simply add the change to the XML standard to fix what was essentially an osmission in the 1.0 standard.

--
-PainKilleR-[CE]

Re:One tiny little update ??? by p3d0 · 2002-10-18 04:47 · Score: 3, Informative

Now there are two and it will take years to roll out the new parsers universally.

You can't blame this on the NEL issue. XML 1.1 will arrive sooner or later regardless of now the NEL issue ends up.

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....

depends on the character set by MenTaLguY · 2002-10-18 05:11 · Score: 3, Informative

It's the same thing that happens any time you take codepoints across character sets.

iso-8859-1:

NEL (from EBCDIC) doesn't exist in iso-8859-1; what character gets used instead is up to the discretion of the tool doing the conversion. Since Unicode defines 0x0085 as a whitespace character, the tool could know to substitute another if that was desired.

UTF-8:

It's a non-lossy encoding of Unicode. All characters above 0x007f get stored as multi-byte sequences. 0x0085 becomes 0xc2 0x85.

--

DNA just wants to be free...

UTF-8 by MenTaLguY · 2002-10-18 05:14 · Score: 3, Informative

If you want to know more about UTF-8, see RFC 2279.

--

DNA just wants to be free...

Some clarifcation by kune · 2002-10-18 05:16 · Score: 3, Informative

The problem is cause be the EBDIC code. It has special codes for CARRIAGE RETURN (CR), LINE FEED (LF) and NEW LINE (NL). The problem is now how you convert NL into UNICODE, you could map it to LF, but that can't be mapped back. You could als map it to LINE SEPERATOR U+2028, but IBM seems to think that mapping it the NEL (New Line) U+0085 control character is appropriate. This is supported by UNICODE standard annex UAX #13. However in UAX #14 U+0085 has not the line breaking property, so there is still some inconsistence in the UNICODE standard. But I don't think this is an major issue. It does mean only, that you will have problems to edit XML-documents generated in EBCDIC in some of the worser editors. We have lived all the years with the DOS CRLF/UNIX LF problem, so we will survive this too.

Re:What? No newline? by PainKilleR-CE · 2002-10-18 05:26 · Score: 4, Informative

Before XHTML, the p tag had no closing tag. The end of a paragraph was defined by the beginning of the next one. So that wasn't necessarily a "bad habit" back then.

This is not correct. The P tag ALWAYS had a closing tag, but it was not REQUIRED. Here's the P tag section for HTML 2.0:
http://www.w3.org/MarkUp/html-spec/L2index.html# P

was still allowed, but all it did was start a new paragraph.

Actually, it just ends the current paragraph. You can't use the closing tag without the beginning tag and still have well-formed HTML. Since it wasn't required, though, there are probably many different methods of handling it as far as HTML parsers/browsers are concerned. The only reason it would start a new paragraph is because it designates the end of the current paragraph, and another paragraph is what typically follows the end of a paragraph within a document.

--
-PainKilleR-[CE]

34 of 257 comments (clear)