XML 1.1 Spec Hits Some Snags
oever writes "News.com reports that the new XML 1.1 specification defines a new newline character, making it incompatible with the 1.0 specifiation. Apparently, IBM has been pushing the new character to avoid having to modify their software, thereby invalidating everybody else's XML software."
If you don't like it, keep in mind that you CAN bitch about it and help change this.
That explains why my website was one long paragraph.
Uttering logically derived and empirically supported truths to the disciples of the orthodox establishment.
Why don't they make new-lines overridable? Then IBM can put the override at the beginning of their files.
Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).
1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:
I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?
Typically, don't version-naming schemes imply something along the lines of: versions x.0, x.1, x.2, etc... are all compatible. And if the next version is NOT compatible, then it should be labeled as "(x+1).0"?
I guess there's no law stating that this must always be the case, but if these two specifications are NOT compatible, then it would make sense that they would name the new one XML2.0 no?
Karma: NaN
If we don't allow the IBM EOL in XML 1.1 ...
How will we ever communicate with Master Control Program?
End Of Line
www.bannination.com Two things float to the top he
Considering what some other vendors have done to standards, one tiny addition (which is an improvement) proposed by IBM shouldn't be a big deal. Sure, it feeds the news hounds, but seriously, compare the scale of the impact of one desirable change to all the suffering caused by other such changes in emerging standards (Microsoft's in particular).
IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?
"Whoever would overthrow the liberty of a nation must begin by subduing the freeness of speech."--Benjamin Franklin
I don't know if it's still the same in RiscOS, but the BBC Micro used 0x0A+0x0D as it's end-of-line marker. Why doesn't XML support this? If 1.1 is going to modify the end-of-line specification, then this is the perfect time to correct this glaring omission.
"The truth is that there are a lot of IBM mainframe systems out there, and they're very important," said Ronald Schmelzer, an analyst with ZapThink. "The truth is that this is not really for IBM's benefit, it's for IBM's customers' benefit. And I think that's fair. An international standard shouldn't change for the benefit of a company's future project, but it's clear that end-of-line characters are not a strategic business strategy for IBM."
That IBM gave the world SGML and XML by derivative ....
....
...
...
...
That a lot of useful data exists on IBM mainframes
That EBCDIC doesn't "cleanly" map into Unicode by design like ASCII/UTF-8 does
That this benefits IBM users and customers, not IBM because there is no strategic market position related to new-line characters
That this was a recommendation reached by a group
Let it live and get a life.
1) XML 1.0 does not follow the Unicode spec
3) XML 1.1 makes a change so that it does follow the spec
What's the complaint again?
Unix is user friendly, it's just selective about who its friends are.
parse error, line 1: no trailing '
must... stay... awake...
Does anyone have a link to a page explaining what's really going on? Last I heard, XML doesn't even have a concept of newlines -- most of the time all white space gets normalized (collapsed). The only problem that I could see is if the character wasn't part of the spec for white space. Now, people may have written XML software that chokes, but I think that's a slightly different story. So is the problem that the new character shows up as bogus text content in elements? And is that true for all XML processing software, or does software that relies on a proper Unicode engine not have the problem? What's the deal?
-- Some things are to be believed, though not susceptible to rational proof.
Don't like it? Complain. Comments can be viewed here.
is 0x156C in my programming area, 'nough said. EBCDIC is still live. Did you know that about 90% of todays enterprise data is stored in EBCDIC chars? You better update the XML specs :)
Anybody care to explain to me _why_ we need so many different newline characters | sequences? I see a point in having a single \x0a character, because a newline is one character. I see a point in having \x0a\x0d and \x0d\x0a, because they represent more accurately how a typewriter does it (and conform better to the original ASCII standard, I think). However, one of these is kind of redundant, and history seems to have decided that this is \x0a\x0d. But why, for goodness's sake, do we need all those others??? Why is it that people always do things their own way instead of following standards that work fine???
Please correct me if I got my facts wrong.
As has been pointed out by many people, this whole issue is stilly since 1.1 actually follows the Unicode spec more closely...
(so who's code is broken now? huh?)
Personally I don't see the big deal over XML itself. Its just a way of organizing data hierarchially and expressing it in a nice format (aka a TREE)
I still don't know how people manage to write 500+ page books on it.
Maybe I"m just completely stupid -- please, someone enlighten me to the great wonder that is XML.
World Wide Web Consortium still meeting over IBM resolutions
Posted: Sat, 19 Oct 2002 0:18 AEST
The five permanent members of the World Wide Web Consortium are meeting overnight in an attempt to agree on a resolution on IBM. W3C diplomats say there are signs HP and Sun are now moving towards a compromise, after weeks of wrangling over the XML issue.
HP wants clear instructions given to Microsoft that it return to the World Wide Web Consortium before taking legal action, while the Microsoft wants more leeway for itself and its allies. Meanwhile, the HP CEO, Carly Fiorina, has given another strong warning against the use of force against IBM.
Speaking at the opening of a summit of XML using companies in the Silicon Valley, Mrs Fiorina said legal force must only be used as a last resort. She called for all conflicts to be resolved in ways respecting international law, as this was the only guarantee against what she described as "adventurist" policies.
Speak truth to power.
If you don't use it, tough luck, you should have followed the original recommendation more closely. Lucky for you it's not exactly difficult to automatically process XML documents and add the prologe later.
A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.
Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.
Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
here is a summary of just the proposed change.
It seems to comply with unicode just fine, I dont see what the controversy is really.
Let us not forget that it was New Line that callously yanked Tolkien's loveable Tom Bombadil! It was New Line that turned Arwen into a heroic Nazgul-racing babe-elf! It was New Line that left out poor Glorfindel and his big moment at the river altogether!
I don't know about anyone else, but I think it's only fitting that a New Line character be messed with.
---
I have Unix underpants.
As the quote from IBM points out in the article, this issue is just a subset of the larger problem with Unicode compatibility in XML 1.0. And as someone else pointed out, if document creators are using the XML headers appropriately to begin with, then parsers would handle documents correctly anyway. I'm also willing to bet that the percentage of existing XML documents which contains this particular character (0x85), and which are not already on IBM mainframes, is *extremely* small.
Face it: this just isn't that big a deal. It's good for industry acceptance and propagation of the standard, at very low cost. Move along, there's nothing to see here.
EBCDIC has won yet another chess game against the Grim Reaper.
This isn't about changing existing files, but about changing software that is in production use on mainframes. And it isn't IBM, but IBMs mainframe customers and anyone dealing with their mainframe using customers that would get the work.
From the XML 1.1 spec
The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0
So XML 1.0 used Unicode 2.0, but not properly. XML 1.1 fixes that, and defines that all Unicode 3.2 byte pairs are now valid when used in an XML document. As part of this change, XML 1.1 also correctly allows the use of the Unicode 0x0085 NEL character as an EOL marker, which is totally compliant and consistent with the Unicode 3.2 specification.
In other words, if you're using any character encoding other than Unicode 3.2, your XML document isn't compliant with XML 1.1 and you shouldn't ever expect ISO-Latin 0x85 to be displayed as an ellipses.
The real issue here is standards-committee wankery and the tendency of some people to accuse anyone who doesn't agree with them 100% of being proprietary, monopolistic, etc. This is exactly the sort of non-issue that doesn't deserve such rhetoric, and those who insist on crying wolf should be ejected from the process until they learn that "collaboration" doesn't mean "we rubber-stamp your ideas just because they're yours".
Slashdot - News for Herds. Stuff that Splatters.
No. XML is defined in terms of Unicode, but XML files can be stored in any encoding with a known mapping to Unicode.
Most XML files these days are using iso-8859-1 or UTF-8, both of which manage fine in vi/pico/gedit.
chars 32-127 are identical in ASCII and Unicode, and iso-8859-1 is exactly identical to the bottom 8-bits worth of Unicode.
Also, UTF-8 is an encoding of the full Unicode range that is backwards-compatible with 7-bit ASCII.
in any case, note the entry for LF in your parent post -- 000A (hex) = 10 (decimal).
DNA just wants to be free...
damn that lameness filter
Reason: Your comment looks too much like ascii art.
ever wondered what alt-gr is for?
raw a e i o u 4 `
alt-gr á é í ó ú ¦
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
The Slashdot commentary has been pretty one-sided so I'll try and address the other side. First, IBM has said that this fix is for their mainframe customers, not for themselves. But nobody in the XML world has heard from these customers. As far as I know, no user has submitted a request for this NEL feature. No user has sent a message to the many XML mailing lists. No user has posted to Slashdot. Updating all of the XML parsers in the world is really expensive and if the mainframers don't care enough about the problem to storm the gates then maybe it isn't hurting them that badly. So from a democratic point of view, we're going to make life harder for the people who care enough to scream out loud in order to make life easier for the small minority who perhaps are not even that badly impacted.
Further discussion is on xml.com.
This is just a way to spark a holy war "my newline character is better than yours" debate. The proposal makes perfect sense - it brings XML into line with Unicode and ISO-10646.
It's the same thing that happens any time you take codepoints across character sets.
iso-8859-1:
NEL (from EBCDIC) doesn't exist in iso-8859-1; what character gets used instead is up to the discretion of the tool doing the conversion. Since Unicode defines 0x0085 as a whitespace character, the tool could know to substitute another if that was desired.
UTF-8:
It's a non-lossy encoding of Unicode. All characters above 0x007f get stored as multi-byte sequences. 0x0085 becomes 0xc2 0x85.
DNA just wants to be free...
If you want to know more about UTF-8, see RFC 2279.
DNA just wants to be free...
The problem is cause be the EBDIC code. It has special codes for CARRIAGE RETURN (CR), LINE FEED (LF) and NEW LINE (NL). The problem is now how you convert NL into UNICODE, you could map it to LF, but that can't be mapped back. You could als map it to LINE SEPERATOR U+2028, but IBM seems to think that mapping it the NEL (New Line) U+0085 control character is appropriate. This is supported by UNICODE standard annex UAX #13. However in UAX #14 U+0085 has not the line breaking property, so there is still some inconsistence in the UNICODE standard. But I don't think this is an major issue. It does mean only, that you will have problems to edit XML-documents generated in EBCDIC in some of the worser editors. We have lived all the years with the DOS CRLF/UNIX LF problem, so we will survive this too.
Compare the elegance of LISP brackets to overbloated XML tags. Compare the ability to share same syntax with code and data in LISP with oversimplified XML tags. Make a conclusion.
"I shall explain this by waving my hands about in an appropriate manner." -- Cambridge University Math Dept.
...and you can help support it. The complaints I keep reading are all about how tough it will be for those poor XML tools makers. As an XML tool maker, I assure you that this upgrade is no sweat. No one with the technical skill to create a serious XML tool is going to be challenged by this. The universality goals of Unicode and XML mesh very nicely and it's worth continuing to incorporate the lessons learned by each into the other.
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."