XML 1.1 Spec Hits Some Snags
oever writes "News.com reports that the new XML 1.1 specification defines a new newline character, making it incompatible with the 1.0 specifiation. Apparently, IBM has been pushing the new character to avoid having to modify their software, thereby invalidating everybody else's XML software."
If you don't like it, keep in mind that you CAN bitch about it and help change this.
I wonder if this will have any impact on MS plans for making the next generation of Office. AFAIK, they're planning to make all the applications work together through XML... Then again, it is "only" a newline character... :P
God does not play dice - Albert Einstein
Well Microsoft did it with Java and C, why cant IBM do it with XML? Think about it!
That explains why my website was one long paragraph.
Uttering logically derived and empirically supported truths to the disciples of the orthodox establishment.
Why don't they make new-lines overridable? Then IBM can put the override at the beginning of their files.
Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).
1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:
I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?
Typically, don't version-naming schemes imply something along the lines of: versions x.0, x.1, x.2, etc... are all compatible. And if the next version is NOT compatible, then it should be labeled as "(x+1).0"?
I guess there's no law stating that this must always be the case, but if these two specifications are NOT compatible, then it would make sense that they would name the new one XML2.0 no?
Karma: NaN
If we don't allow the IBM EOL in XML 1.1 ...
How will we ever communicate with Master Control Program?
End Of Line
www.bannination.com Two things float to the top he
Considering what some other vendors have done to standards, one tiny addition (which is an improvement) proposed by IBM shouldn't be a big deal. Sure, it feeds the news hounds, but seriously, compare the scale of the impact of one desirable change to all the suffering caused by other such changes in emerging standards (Microsoft's in particular).
IBM has contributed so much, it's only natural that some changes might be characterized in the news as benefitting them more than other parties. Is anyone that worried about adding a new EOL character in 1.1 that XML 1.0 "chokes" on ?
"Whoever would overthrow the liberty of a nation must begin by subduing the freeness of speech."--Benjamin Franklin
<rant about="IBM" why="'FOR BREAKING COMPATIBILLITY">IBM SUCKS</rant>
I don't know if it's still the same in RiscOS, but the BBC Micro used 0x0A+0x0D as it's end-of-line marker. Why doesn't XML support this? If 1.1 is going to modify the end-of-line specification, then this is the perfect time to correct this glaring omission.
IBM will make the new newline character a '$'
(Just kidding)
"The truth is that there are a lot of IBM mainframe systems out there, and they're very important," said Ronald Schmelzer, an analyst with ZapThink. "The truth is that this is not really for IBM's benefit, it's for IBM's customers' benefit. And I think that's fair. An international standard shouldn't change for the benefit of a company's future project, but it's clear that end-of-line characters are not a strategic business strategy for IBM."
That IBM gave the world SGML and XML by derivative ....
....
...
...
...
That a lot of useful data exists on IBM mainframes
That EBCDIC doesn't "cleanly" map into Unicode by design like ASCII/UTF-8 does
That this benefits IBM users and customers, not IBM because there is no strategic market position related to new-line characters
That this was a recommendation reached by a group
Let it live and get a life.
... if this goes through, to write a batch converter to fix the newline characters?
$ fixnewline *.xml
Shouldn't take too much Perl... hell, a shell script could probably do it. Or am I missing something?
Real Daleks don't climb stairs - they level the building.
1) XML 1.0 does not follow the Unicode spec
3) XML 1.1 makes a change so that it does follow the spec
What's the complaint again?
Unix is user friendly, it's just selective about who its friends are.
from the article:
Although not referring specifically to the Mallinson case, he added it may be necessary to "weed out" employees who did not live up to Microsoft's code of behaviour.
I'd like to see someone at Microsoft do another 'ape routine', personally.
In other news - Julius Caesar stabbed and died.
Anyway - for how long 1.1 draft has been out?
Does anyone have a link to a page explaining what's really going on? Last I heard, XML doesn't even have a concept of newlines -- most of the time all white space gets normalized (collapsed). The only problem that I could see is if the character wasn't part of the spec for white space. Now, people may have written XML software that chokes, but I think that's a slightly different story. So is the problem that the new character shows up as bogus text content in elements? And is that true for all XML processing software, or does software that relies on a proper Unicode engine not have the problem? What's the deal?
-- Some things are to be believed, though not susceptible to rational proof.
Don't like it? Complain. Comments can be viewed here.
is 0x156C in my programming area, 'nough said. EBCDIC is still live. Did you know that about 90% of todays enterprise data is stored in EBCDIC chars? You better update the XML specs :)
Anybody care to explain to me _why_ we need so many different newline characters | sequences? I see a point in having a single \x0a character, because a newline is one character. I see a point in having \x0a\x0d and \x0d\x0a, because they represent more accurately how a typewriter does it (and conform better to the original ASCII standard, I think). However, one of these is kind of redundant, and history seems to have decided that this is \x0a\x0d. But why, for goodness's sake, do we need all those others??? Why is it that people always do things their own way instead of following standards that work fine???
Please correct me if I got my facts wrong.
This is why there is a process for International Standards (for the moment, let's ignore fast-tracking of standards, which isn't being done for XML 1.1). If one company wants something, they can feel free to propose it, but the other members of the committee can vote it down. If this really does cause a significant incompatibility with a previous version, then the committee will realize it and vote it down. So, no one company can push something down the others' throats.
Here's more info.
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
I am wondering how hard is it for IBM or anybody else out there to write a small parsing utility to change the end of line character to the corrected one. As a computer science student I can say that this is not hard. Now on the other hand IBM probably has many xml files that would need to change but then again, we are talking about IBM. They have the power to do that.
since they are not industry STANDARDS it doesnt matter
windows is not an industry standard either.
As has been pointed out by many people, this whole issue is stilly since 1.1 actually follows the Unicode spec more closely...
(so who's code is broken now? huh?)
Personally I don't see the big deal over XML itself. Its just a way of organizing data hierarchially and expressing it in a nice format (aka a TREE)
I still don't know how people manage to write 500+ page books on it.
Maybe I"m just completely stupid -- please, someone enlighten me to the great wonder that is XML.
World Wide Web Consortium still meeting over IBM resolutions
Posted: Sat, 19 Oct 2002 0:18 AEST
The five permanent members of the World Wide Web Consortium are meeting overnight in an attempt to agree on a resolution on IBM. W3C diplomats say there are signs HP and Sun are now moving towards a compromise, after weeks of wrangling over the XML issue.
HP wants clear instructions given to Microsoft that it return to the World Wide Web Consortium before taking legal action, while the Microsoft wants more leeway for itself and its allies. Meanwhile, the HP CEO, Carly Fiorina, has given another strong warning against the use of force against IBM.
Speaking at the opening of a summit of XML using companies in the Silicon Valley, Mrs Fiorina said legal force must only be used as a last resort. She called for all conflicts to be resolved in ways respecting international law, as this was the only guarantee against what she described as "adventurist" policies.
Speak truth to power.
The article doesn't explain the technical problems in any depth at all.
Sticking feathers up your butt does not make you a chicken - Tyler Durden
Does this mean that XML has reached the end of the line and it is time to start working on the next big thing?
... 8 bits should be enough for anybody.
to paraphrase Bill G.
Trolling is a art,
If you don't use it, tough luck, you should have followed the original recommendation more closely. Lucky for you it's not exactly difficult to automatically process XML documents and add the prologe later.
A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.
Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.
Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
here is a summary of just the proposed change.
It seems to comply with unicode just fine, I dont see what the controversy is really.
Let us not forget that it was New Line that callously yanked Tolkien's loveable Tom Bombadil! It was New Line that turned Arwen into a heroic Nazgul-racing babe-elf! It was New Line that left out poor Glorfindel and his big moment at the river altogether!
I don't know about anyone else, but I think it's only fitting that a New Line character be messed with.
---
I have Unix underpants.
IBM isn't trying to make up for a flaw on their part, their trying to introduce a new feature to everyone.
The $ sign is used as end of line marker for function 09h of int 21h (print string), e.g.
;)
TXT db 'hello, world!$'
mov ah,09h
mov dx, TXT
int 21h
int 20h
My x86 ain't what it's used to so I'm awaiting endless corrections to this, but don't miss the point people
PGP KeyId: 0x08D63965
2) ???
!#@%*)anks for hanging up the phone, dear.
First off, unicode 85H is NEXT LINE; ASCII 85H is ellipses. unicode 2028H is LINE SEPARATOR. AH and DH are the infamous CR/LF which annoy MS/UNIX text convertors.
Anyways, it sounds like a problem comes up here:
must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing,
That seems to be the sticking point there.
I suppose the charset is specified in the document, but then again I'm not sure how literally they intend implementors to take the phrase "before parsing", since getting at the charset description involves some degree of parsing the document
All's true that is mistrusted
As the quote from IBM points out in the article, this issue is just a subset of the larger problem with Unicode compatibility in XML 1.0. And as someone else pointed out, if document creators are using the XML headers appropriately to begin with, then parsers would handle documents correctly anyway. I'm also willing to bet that the percentage of existing XML documents which contains this particular character (0x85), and which are not already on IBM mainframes, is *extremely* small.
Face it: this just isn't that big a deal. It's good for industry acceptance and propagation of the standard, at very low cost. Move along, there's nothing to see here.
main()
{
bool misspellings = true;
"IBM also dismissed the notion it was railroading the XML working group to serve its own ends."
What happens when a railroad meets the end of the track? I think I've seen something like that in the movies...
"The debate over XML 1.1, formerly known as XML Blueberry, has raged on Internet discussion forums"
Yes, the debate has been tremendous. But, of course it has something to do with godzilla...I mean, mozilla.
Also, there seems to be alot of talk of unicorns...I mean, unicode.
"...an increasingly global standard for representing characters in computerized text."
First Earth, then the Milky Way, and eventually the Whole Universe(tm).
Ok, I am done with the sarcasm.(obligitory simpson's quote)
"Do I know what rhetorical means?"
-Homer
EBCDIC has won yet another chess game against the Grim Reaper.
I'll admit that I don't know much about the technical side of xml (and I really can't see all of the great advantages to it, either), but since when does a parser care about whitespace? Wouldn't it make more sense to let the newline character match that of the overlying OS so people can actually TYPE those newline characters? Switching to unicode is fine and dandy, but what about all of those legacy systems that don't support it?
Do you really need reason for beer? Wingman Brewers
Don't get too worked up about this. They aren't deleting any newline characters like \n. Just adding one to the list that XML considers whitespace.
If one were to replace "IBM" with "Microsoft", I wonder what sort of self-righteous fury would be rained down for this sort of legacy document-smashing behavior (even if at the root, it's just bringing things in line with Unicode).
Easy does it!
This comment has been submitted already, 276865 hours , 59 minutes ago. No need to try again.
Just use Lisp sexps.
and if you want a markup language, instead of an innefficient tree description language, XML's no use anyway...
The real issue here is standards-committee wankery and the tendency of some people to accuse anyone who doesn't agree with them 100% of being proprietary, monopolistic, etc. This is exactly the sort of non-issue that doesn't deserve such rhetoric, and those who insist on crying wolf should be ejected from the process until they learn that "collaboration" doesn't mean "we rubber-stamp your ideas just because they're yours".
Slashdot - News for Herds. Stuff that Splatters.
No. XML is defined in terms of Unicode, but XML files can be stored in any encoding with a known mapping to Unicode.
Most XML files these days are using iso-8859-1 or UTF-8, both of which manage fine in vi/pico/gedit.
chars 32-127 are identical in ASCII and Unicode, and iso-8859-1 is exactly identical to the bottom 8-bits worth of Unicode.
Also, UTF-8 is an encoding of the full Unicode range that is backwards-compatible with 7-bit ASCII.
in any case, note the entry for LF in your parent post -- 000A (hex) = 10 (decimal).
DNA just wants to be free...
No, I'm not.
Karma: Undead.
damn that lameness filter
Reason: Your comment looks too much like ascii art.
ever wondered what alt-gr is for?
raw a e i o u 4 `
alt-gr á é í ó ú ¦
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
The Slashdot commentary has been pretty one-sided so I'll try and address the other side. First, IBM has said that this fix is for their mainframe customers, not for themselves. But nobody in the XML world has heard from these customers. As far as I know, no user has submitted a request for this NEL feature. No user has sent a message to the many XML mailing lists. No user has posted to Slashdot. Updating all of the XML parsers in the world is really expensive and if the mainframers don't care enough about the problem to storm the gates then maybe it isn't hurting them that badly. So from a democratic point of view, we're going to make life harder for the people who care enough to scream out loud in order to make life easier for the small minority who perhaps are not even that badly impacted.
Further discussion is on xml.com.
This is just a way to spark a holy war "my newline character is better than yours" debate. The proposal makes perfect sense - it brings XML into line with Unicode and ISO-10646.
Okay...maybe I'm not looking at this incorrectly, but...
If IBM problem is they don't want to force everyone to update their Mainframes and cause them a head ache...but won't they still have to upgrade their Mainframes to support XML 1.1 with new XML 1.1 compatible parsers?
Eric B
ebresie@gmail.com
Who ever heard of an end-of-line character on a mainframe? Everything was always fixed length, with the length in the DCB, or variable length, with a length prefix, at least back when I used them. There was a "record mark" character, but that was used back on the 1401's, back in ancient times.
Just to clarify, what happens to a 0085 (NEL) character when a Unicode file is saved in an iso-8859-1 or UTF-8 encoding?
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
æe
Consider the brilliance of XHTML 2.0 abolishing the/> tag in favor of something that would read like this:
<p>
<line>public class HelloWorld {</line>
<line>public static void main (String[] args){ </line>
<line>System.out.println("Hello world!"); </line>
<line>}</line>
<line>}</line>
</p>
Why in the world should you expect the XML standard to be in complance with the Unicode standard?
I'm dealing with some cross-platform XML these days. It's generally pretty wonderful, but the newline character is something that drives me a bit batty. If anyone can bring some unity to this disunity, I'm sure that all of the XML world and the Java world would be better off. It's an anachronism.
There's a reason XML supports multiple types of newlines: because there are platforms out there that use those other types of newlines. XML 1.1 standardizes on a 0x0a newline, but recognizes that there are other types out there, and requires parsers to normalize this before parsing the XML. All the specification is doing is adding another type of newline that a fairly popular platform uses. This is no different than adding MacOS-specific and Windows-specific newline types to XML in the first place. The goal is to allow platforms to store XML data as a native text document instead of forcing newlines that cause XML documents to be treated as awkward binary data on those platforms that don't have XML-compatible newline conventions.
This whole thread is retarded. Few people posting all of this FUD seem to know the difference between a character encoding, a character set, and a pimple on their ass. This change in XML 1.1 changes nothing for anyone, except those that want to write XML 1.1 parsers and those on platforms that use this Unicode newline as their native newline character.
If we're going to throw up such a fuss about this one addition (which in NO way breaks ANY existing XML 1.0 documents), why aren't we throwing up a fuss about including MacOS or Windows newlines into XML in the first place? GET RID OF THEM ALL! UNIX NEWLINES ARE THE ONLY TRUE NEWLINES!!#!!@#$
Jesus, people...
It's the same thing that happens any time you take codepoints across character sets.
iso-8859-1:
NEL (from EBCDIC) doesn't exist in iso-8859-1; what character gets used instead is up to the discretion of the tool doing the conversion. Since Unicode defines 0x0085 as a whitespace character, the tool could know to substitute another if that was desired.
UTF-8:
It's a non-lossy encoding of Unicode. All characters above 0x007f get stored as multi-byte sequences. 0x0085 becomes 0xc2 0x85.
DNA just wants to be free...
If you want to know more about UTF-8, see RFC 2279.
DNA just wants to be free...
The problem is cause be the EBDIC code. It has special codes for CARRIAGE RETURN (CR), LINE FEED (LF) and NEW LINE (NL). The problem is now how you convert NL into UNICODE, you could map it to LF, but that can't be mapped back. You could als map it to LINE SEPERATOR U+2028, but IBM seems to think that mapping it the NEL (New Line) U+0085 control character is appropriate. This is supported by UNICODE standard annex UAX #13. However in UAX #14 U+0085 has not the line breaking property, so there is still some inconsistence in the UNICODE standard. But I don't think this is an major issue. It does mean only, that you will have problems to edit XML-documents generated in EBCDIC in some of the worser editors. We have lived all the years with the DOS CRLF/UNIX LF problem, so we will survive this too.
wow, your knowledge of newlines is not just theoretical!
The main reason is that is does not pass cleanly throught UTF-8 encoding and thus it is far too easy to write software to miss it, and writing correct software is inefficient. Anything that assigns a non-tokeninzing meaning to characters > 0x7f will mean the UTF-8 must be decoded to parse the file. If instead all characters > 0x7f are considered parts of identifiers/words then the parser can easily be byte-based and use lookup tables to correctly match words. In fact this can be very safe because it will prevent illegal UTF-8 encodings, provided the parsers check the encoding when *adding* words to their has tables, they have no need to check encodings when looking up words, as illegal long encodings will not match.
In addition there is no good reason for the "hole" in 0x80 through 0x9F. It was done so that systems that stripped the high bit would not accidentally map a foreign letter to a control code. However all such systems are long obsolete and would barf on UTF-8 encoding anyway (since it does use these codes).
My recommendation, strange as it may seem for something on Slashdot, is to use MicroSoft's assignments use in Word for these codes (usually seen as the "smart quotes" output). This is apparently called the "CP1252 superset" in MicroSoft-speak. In this encoding 0x85 is the ellipsis (...).
Why use this arbitrary standard from the evil company? Mostly because it is by far the most-used standard for these codes. Also I think the assignments are pretty good, they were selected for actual use in the (admittedly euro-centric) modern world of office software, and are not tainted with the nasty political correctness that has so messed up and delayed Unicode assignments.
I recommend two things: first that all characters >0x7f in Unicode be considered "part of an identifier" by all parsers. Second that Unicode have the MicroSoft CP1252 assignments added to the codes 0x80-0x9f.
Compare the elegance of LISP brackets to overbloated XML tags. Compare the ability to share same syntax with code and data in LISP with oversimplified XML tags. Make a conclusion.
"I shall explain this by waving my hands about in an appropriate manner." -- Cambridge University Math Dept.
There is a good reason for this: those characters are hard to identify in UTF-8 and some other encodings without actually decoding to Unicode. Unfortunatly this is also true of 0x85. Therefore I think this proposal is bad.
Also there are a *lot* of people (all users of MicroSoft Word) who think this character is an ellipsis.
Also there are a *lot* of people (all users of MicroSoft Word) who think this character is an ellipsis.
Ummmm... Anyone using Microsoft Word to edit XML is going to have quite s few problems...
He's got a point, though I don't know if he's right with everything he's saying. You don't use MS Word to edit XML, but you often use XML to encode text that started out (on someone else's desktop, perhaps) as Word text.
I know it is, I just don't understand the rationale.
I've always thought that this was just begging for problems. I don't understand the upside. Seems like whitespace is presentation, which should be handled by Formatting Objects or something.
Full details my ass. What a load of crap.
you often use XML to encode text that started out (on someone else's desktop, perhaps) as Word text.
That's fine, but still shouldn't be using 0x85 in the XML to represent "..." in the document.
This is the kind of approach that many companies (especially Microsoft) use that makes life difficult for others. Speed at the cost of correctness sounds good in the short term but always ends up biting you in the ass later. Efficient yet incorrect code is hard to maintain because it usually only makes sense to the original author. It breaks more easily than correct code since it is less tolerant to variations in the input. And eventually the performance delta is erased by hacks added in later revisions to get around problems caused by the original incorrect behavior.
...and you can help support it. The complaints I keep reading are all about how tough it will be for those poor XML tools makers. As an XML tool maker, I assure you that this upgrade is no sweat. No one with the technical skill to create a serious XML tool is going to be challenged by this. The universality goals of Unicode and XML mesh very nicely and it's worth continuing to incorporate the lessons learned by each into the other.
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
These proposals have been around since at least June 2001, when the W3C published their Requirements document for what was then called XML Blueberry and has since become XML 1.1.
And the complaints date from then as well... Elliote Rusty Harold complained almost as soon as the Requirements document and the first Working Draft were published. He makes a number of good points that highlight just how unnecessary XML 1.1 actually is. This link is actually him quoting himself for the time - the original post is probably available on the W3C forums, but I'm far too lazy to look.
That's fine, but still shouldn't be using 0x85 in the XML to represent "..." in the document.
No, but a lot of the time, you're encoding text that was created by a user who doesn't know what the heck they're doing, and can't handle anything more sophisticated than MS Word.
vi is [[13~^[[15~^[[15~^[[19~^[[18~^ a :x :wq dang it :w:w:w :x ^C^C^Z^D
muk[^[[29~^[[34~^[[26~^[[32~^ch better editor than this emacs. I know
I^[[14~'ll get flamed for this but the truth has to be
said. ^[[D^[[D^[[D^[[D ^[[D^[^[[D^[[D^[[B^
exit ^X^C quit
-- Jesper Lauridsen from alt.religion.emacs
- this post brought to you by the Automated Last Post Generator...