XML 1.1 Spec Hits Some Snags

It's only a candidate specification. by tomhudson · 2002-10-18 02:21 · Score: 5, Insightful

This specification is being put forth as a W3C Candidate Recommendation of XML 1.1.

If you don't like it, keep in mind that you CAN bitch about it and help change this.

Read the Unicode spec.... by vidarh · 2002-10-18 02:22 · Score: 5, Informative

Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).

Re:Read the Unicode spec.... by Anonymous Coward · 2002-10-18 02:52 · Score: 5, Insightful

Like the man says, read the Unicode specification! Unicode defines a far wider range of characters than simple 7 or 8bit ASCII text can cover, and the à is simply mapped into another Unicode byte pair. You won't loose the ability to use à in your XML documents, you just use Unicode.
Re:Read the Unicode spec.... by gorilla · 2002-10-18 02:59 · Score: 5, Informative

It's more complicated than that. Unicode has
2029 - paragraph seperator
2028 - line seperator
000D - CR
000A - LF
0085 - NEL (Next Line)
Any of these could be interpeted as the end of a logical line.
Re:Read the Unicode spec.... by vidarh · 2002-10-18 03:10 · Score: 5, Informative

And if you read the XML 1.1 spec you'll see that all of the characters you've listed above except for 0x2029 are interpreted by XML 1.0 as the end of a logical line.
Re:Read the Unicode spec.... by julesh · 2002-10-18 03:57 · Score: 5, Informative

No, not really. It means that *some* XML files can't be edited with these editors. But then that was true already; some might have used \r or some other of the list of characters.

What it *does* mean is that editors on other systems than Unix are able to edit XML files. It means I can create an XML file in DOS 'edit' which uses \r\n, or on a mac with an editor that might use \r, or on (apparently) an IBM system where the standard text editors use \u85.

This is absolutely essential. It does however mean that in order to support *all* XML files, you need to recognise *all* of those line endings. As always, its easier to support a subset, but harder to support everything. However the fact that existing software works at all is very important, so I think they're moving in the right direction.
Re:Read the Unicode spec.... by innate · 2002-10-18 04:07 · Score: 5, Informative

There are no 2-byte Unicode characters, only encodings (such as UTF-16) which use two or more bytes to represent each character. Some Unicode characters, those not in the Basic Multilingual Plane (BMP), require more than two bytes to represent in UTF-16.
And 7-bit ASCII is a strict subset of UTF-8 encoding. UTF-8 encodes each character to one or more bytes, with characters up to 127 defined the same as in ASCII. If your text is strict 7-bit ASCII, it is also a UTF-8 file.
You could also use UTF-32 (UCS-4), which represents each character as 4 bytes, but that is overkill for most applications.
The main problem with multibyte encodings such as Shift-JIS and Big 5 was lead-byte detection: you couldn't jump into the middle of a string and determine if you were looking at the only, first, or second byte of a character. You had to start parsing at the beginning of the string in order to synchronize your character detection. Unicode has done away with this by strictly defining the lead byte ranges in such a way that there is never any ambiguity.

--
No, I don't want to explore the Recycle Bin.

So? by Anonymous Coward · 2002-10-18 02:25 · Score: 5, Informative

1.0 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:

the two-character sequence #xD #xA
the two-character sequence #xD #x85
the single character #x85
the single character #x2028
any #xD character that is not immediately followed by #xA or #x85.

I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?

Re:So? by vidarh · 2002-10-18 02:36 · Score: 5, Informative

In which character set? Certainly not in Unicode, so if anyone used 0x85 as à in XML documents using any Unicode encoding they've messed up. à (latin letter a with grave) is 0xE0, and À (lating chapital letter a with grave) is 0xC0.
Re:So? by vidarh · 2002-10-18 02:40 · Score: 5, Informative

To expand on that, in ISO-Latin in general and certainly ISO-Latin-1, and thus by extension Unicode (which maps to ISO-Latin-1 for code points in the range 0x00 to 0xff), the area 0x80 to 0x9f was on purpose not used for displayable glyphs in order not to cause interoperability problems with 7bit systems if an 8bit text was moved between systems and the 8bit was stripped off.
So unless you are using a non-Unicode, non-ISO-Latin encoding there are no printable characters in that range, and if you're using another character you will need to remap the characters before considering any of the rules in the XML spec anyway, since those rules refer to the unicode codepoints.
Re:So? by Sir+Tristam · 2002-10-18 02:45 · Score: 5, Informative

0x85 is à (a grave). So everyone in France?
No, you're looking at the extended ASCII chart. What this is talking about is Unicode. A Unicode 0x0085 is the control character NEL (http://www.unicode.org/charts/PDF/U0080.pdf, page 3) NEL is NExt Line.
Chris Beckenbach

One tiny little update ??? by Dave21212 · 2002-10-18 02:26 · Score: 5, Interesting

Considering what some other vendors have done to standards, one tiny addition (which is an improvement) proposed by IBM shouldn't be a big deal. Sure, it feeds the news hounds, but seriously, compare the scale of the impact of one desirable change to all the suffering caused by other such changes in emerging standards (Microsoft's in particular).

IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?

--
"Whoever would overthrow the liberty of a nation must begin by subduing the freeness of speech."--Benjamin Franklin

Re:One tiny little update ??? by PainKilleR-CE · 2002-10-18 02:33 · Score: 5, Insightful

IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?

and, as an IBM rep pointed out in the article, XML documents are supposed to specify what version they're using at the top of the document. Any proper XML parser should read that it's 1.0 and interpret the newline character as 1.0 would.

--
-PainKilleR-[CE]
Re:One tiny little update ??? by PainKilleR-CE · 2002-10-18 04:42 · Score: 5, Informative

The problem is that the data with the NEL character already exists, and is already generating these types of errors when it's translated into XML (or when XML is generated on these mainframes). From the original change proposal by IBM:

Problem areas include:

* Processing XML documents or DTDs generated on OS/390 systems, with XML 1.0 compliant parsers.
* Processing XML documents or DTDs, using native OS/390 system tools.
* Processing XML documents or DTDs retrieved from OS/390 database or file systems, in non-OS/390 environments.

XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.

Essentially 'just fix the software' involves operating system-level changes as well as possibly changes to most software that interprets NEL characters on that OS. As it stands, they're going to have a problem anyway, and it's probably best to simply add the change to the XML standard to fix what was essentially an osmission in the 1.0 standard.

--
-PainKilleR-[CE]

Here's A Good Point by LISNews · 2002-10-18 02:28 · Score: 5, Interesting

From the article, which kind of put it into perspective for me:

"The truth is that there are a lot of IBM mainframe systems out there, and they're very important," said Ronald Schmelzer, an analyst with ZapThink. "The truth is that this is not really for IBM's benefit, it's for IBM's customers' benefit. And I think that's fair. An international standard shouldn't change for the benefit of a company's future project, but it's clear that end-of-line characters are not a strategic business strategy for IBM."

Considering ... by DigitalDreg · 2002-10-18 02:30 · Score: 5, Insightful

That IBM gave the world SGML and XML by derivative ....

That a lot of useful data exists on IBM mainframes ....

That EBCDIC doesn't "cleanly" map into Unicode by design like ASCII/UTF-8 does ...

That this benefits IBM users and customers, not IBM because there is no strategic market position related to new-line characters ...

That this was a recommendation reached by a group ...

Let it live and get a life.

2 line summary by Shagg · 2002-10-18 02:39 · Score: 5, Insightful

1) XML 1.0 does not follow the Unicode spec
3) XML 1.1 makes a change so that it does follow the spec

What's the complaint again?

--
Unix is user friendly, it's just selective about who its friends are.

Re:XML rant. by russellh · 2002-10-18 02:40 · Score: 5, Funny

parse error, line 1: no trailing '

--
must... stay... awake...

What do they mean, "XML 1.0 chokes"? by st.+augustine · 2002-10-18 02:42 · Score: 5, Insightful

Does anyone have a link to a page explaining what's really going on? Last I heard, XML doesn't even have a concept of newlines -- most of the time all white space gets normalized (collapsed). The only problem that I could see is if the character wasn't part of the spec for white space. Now, people may have written XML software that chokes, but I think that's a slightly different story. So is the problem that the new character shows up as bogus text content in elements? And is that true for all XML processing software, or does software that relies on a proper Unicode engine not have the problem? What's the deal?

--

-- Some things are to be believed, though not susceptible to rational proof.

W3C still meeting over IBM resolutions by spoonyfork · 2002-10-18 02:51 · Score: 5, Funny

World Wide Web Consortium still meeting over IBM resolutions

Posted: Sat, 19 Oct 2002 0:18 AEST

The five permanent members of the World Wide Web Consortium are meeting overnight in an attempt to agree on a resolution on IBM. W3C diplomats say there are signs HP and Sun are now moving towards a compromise, after weeks of wrangling over the XML issue.

HP wants clear instructions given to Microsoft that it return to the World Wide Web Consortium before taking legal action, while the Microsoft wants more leeway for itself and its allies. Meanwhile, the HP CEO, Carly Fiorina, has given another strong warning against the use of force against IBM.

Speaking at the opening of a summit of XML using companies in the Silicon Valley, Mrs Fiorina said legal force must only be used as a last resort. She called for all conflicts to be resolved in ways respecting international law, as this was the only guarantee against what she described as "adventurist" policies.

--
Speak truth to power.

*Shrug* by Fweeky · 2002-10-18 02:53 · Score: 5, Insightful

If you're using the XML prologue like you're supposed to, your XML 1.0 documents will have:

<?xml version="1.0" ?>

At the top. The parsers will then parse using the XML 1.0 specification and you won't notice a thing.

If you don't use it, tough luck, you should have followed the original recommendation more closely. Lucky for you it's not exactly difficult to automatically process XML documents and add the prologe later.

XML 1.1 - Problem? by Plud · 2002-10-18 02:58 · Score: 5, Informative

A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.

Full Details by Matts · 2002-10-18 02:58 · Score: 5, Informative

Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.

Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.

--

Matt. Want XML + Apache + Stylesheets? Get AxKit.

Re:Full Details by Anonymous Coward · 2002-10-18 03:14 · Score: 5, Interesting

So I want off and read it (Or at least, what appears to be it. There is a rant someway down the page you link to. Is that it?)

So anyway, I read it. Surprise the surprise, the guy doesn't actually offer any actual examples of where this change would actually cause a break in itself. All he basically does is cry that 0x85 is designated as a new line character, and how dare IBM do such a thing! Then he goes into a rant about IBM, monopolies and patents. Uh huh.

The fact is that 0x0085 is designated as NEL (NEw Line) as part of the Unicode specification. XML 1.1 allows the use of Unicode, which XML 1.0 did not. Therefore, if you are using XML 1.1, and you are using 0x85 and expect to see a grave a, your document isn't a Unicode compliant document anyway, and you shouldn't be complaining that a non compliant document doesn't work with a compliant parser.

If all these people want to use 0x85 in their XML 1.1 documents, then they'll have to properly convert them to Unicode as the specification allows. Surprising, that.

New Line and Characters by JasonSkywalker · 2002-10-18 03:04 · Score: 5, Funny

Let us not forget that it was New Line that callously yanked Tolkien's loveable Tom Bombadil! It was New Line that turned Arwen into a heroic Nazgul-racing babe-elf! It was New Line that left out poor Glorfindel and his big moment at the river altogether!

I don't know about anyone else, but I think it's only fitting that a New Line character be messed with.

---

--
I have Unix underpants.

Re:For those missing the problem, also � by Anonymous Coward · 2002-10-18 03:20 · Score: 5, Informative

From the XML 1.1 spec

The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0

So XML 1.0 used Unicode 2.0, but not properly. XML 1.1 fixes that, and defines that all Unicode 3.2 byte pairs are now valid when used in an XML document. As part of this change, XML 1.1 also correctly allows the use of the Unicode 0x0085 NEL character as an EOL marker, which is totally compliant and consistent with the Unicode 3.2 specification.

In other words, if you're using any character encoding other than Unicode 3.2, your XML document isn't compliant with XML 1.1 and you shouldn't ever expect ISO-Latin 0x85 to be displayed as an ellipses.

Re:What? No newline? by dossen · 2002-10-18 03:32 · Score: 5, Informative

Where did you get that awful idea of using />? The proper way is to inclose the paragraph in p-tags like so:

<p> Your text here</p>

Using the paragraph tags as large linebreaks is a very bad habit from the bad old days of the web. Please head to W3C and study the recent standards, and validate your documents before publishing (using a validator, not a browser).
Ohh... And this is actually an issue about XML 1.1 unicode support, so worrying about HTML is quite premature (XHTML is still XML 1.0, and will remain so untill XML 1.1 becomes a standard (or recommendation in W3C-speak).

Re:New Newline Character? by gorilla · 2002-10-18 03:41 · Score: 5, Informative

Each of them has a different function. 000A and 000D are for compatability with ASCII. 0085 is for a unified character to replace the 000D 000A pair used on some OS's. However, some programs (eg notepad) use line breaks when they really mean paragraph seperators, so Unicode defined two codes which mean REAL line seperator, and REAL paragraph seperator. This report explains it quite clearly.

28 of 257 comments (clear)