XML 1.1 Spec Hits Some Snags
oever writes "News.com reports that the new XML 1.1 specification defines a new newline character, making it incompatible with the 1.0 specifiation. Apparently, IBM has been pushing the new character to avoid having to modify their software, thereby invalidating everybody else's XML software."
Unicode 3.2 define 0x85 as a newline character. This change just make XML follow the Unicode spec, which isn't unreasonable considering that the parser is expected to use Unicode internally (or to act as if it does).
1.1 : To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:
I don't get it, whats the problem here? Surely the 1.1 spec simply extends the available EOL characters. It certainly doesn't remove any existing characters that are present in the 1.0 spec. How does it break backwards compatability?
A newline character should have no impact on with respect to backwards compatibility. The only negative impact with regards to a newline character should be contained to poorly written DOM code that parses out all nodes instead of just relavent nodes. Similar issue with SAX. Even if there were a backwards compatibility issue with a new XML spec, most people define their version number in their documents so the parser knows which spec to follow while parsing it.
Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.
Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
here is a summary of just the proposed change.
It seems to comply with unicode just fine, I dont see what the controversy is really.
From the XML 1.1 spec
The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0
So XML 1.0 used Unicode 2.0, but not properly. XML 1.1 fixes that, and defines that all Unicode 3.2 byte pairs are now valid when used in an XML document. As part of this change, XML 1.1 also correctly allows the use of the Unicode 0x0085 NEL character as an EOL marker, which is totally compliant and consistent with the Unicode 3.2 specification.
In other words, if you're using any character encoding other than Unicode 3.2, your XML document isn't compliant with XML 1.1 and you shouldn't ever expect ISO-Latin 0x85 to be displayed as an ellipses.
Using the paragraph tags as large linebreaks is a very bad habit from the bad old days of the web. Please head to W3C and study the recent standards, and validate your documents before publishing (using a validator, not a browser).
Ohh... And this is actually an issue about XML 1.1 unicode support, so worrying about HTML is quite premature (XHTML is still XML 1.0, and will remain so untill XML 1.1 becomes a standard (or recommendation in W3C-speak).
Each of them has a different function. 000A and 000D are for compatability with ASCII. 0085 is for a unified character to replace the 000D 000A pair used on some OS's. However, some programs (eg notepad) use line breaks when they really mean paragraph seperators, so Unicode defined two codes which mean REAL line seperator, and REAL paragraph seperator. This report explains it quite clearly.
No. XML is defined in terms of Unicode, but XML files can be stored in any encoding with a known mapping to Unicode.
Most XML files these days are using iso-8859-1 or UTF-8, both of which manage fine in vi/pico/gedit.
chars 32-127 are identical in ASCII and Unicode, and iso-8859-1 is exactly identical to the bottom 8-bits worth of Unicode.
Also, UTF-8 is an encoding of the full Unicode range that is backwards-compatible with 7-bit ASCII.
in any case, note the entry for LF in your parent post -- 000A (hex) = 10 (decimal).
DNA just wants to be free...
The problem is that the data with the NEL character already exists, and is already generating these types of errors when it's translated into XML (or when XML is generated on these mainframes). From the original change proposal by IBM:
Problem areas include:
* Processing XML documents or DTDs generated on OS/390 systems, with XML 1.0 compliant parsers.
* Processing XML documents or DTDs, using native OS/390 system tools.
* Processing XML documents or DTDs retrieved from OS/390 database or file systems, in non-OS/390 environments.
XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.
Essentially 'just fix the software' involves operating system-level changes as well as possibly changes to most software that interprets NEL characters on that OS. As it stands, they're going to have a problem anyway, and it's probably best to simply add the change to the XML standard to fix what was essentially an osmission in the 1.0 standard.
-PainKilleR-[CE]
Before XHTML, the p tag had no closing tag. The end of a paragraph was defined by the beginning of the next one. So that wasn't necessarily a "bad habit" back then.
# P
This is not correct. The P tag ALWAYS had a closing tag, but it was not REQUIRED. Here's the P tag section for HTML 2.0:
http://www.w3.org/MarkUp/html-spec/L2index.html
was still allowed, but all it did was start a new paragraph.
Actually, it just ends the current paragraph. You can't use the closing tag without the beginning tag and still have well-formed HTML. Since it wasn't required, though, there are probably many different methods of handling it as far as HTML parsers/browsers are concerned. The only reason it would start a new paragraph is because it designates the end of the current paragraph, and another paragraph is what typically follows the end of a paragraph within a document.
-PainKilleR-[CE]