Does the World Need Binary XML?
sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."
← Back to Stories (view on slashdot.org)
Binary XML is nothing new, as I wager that many people here are already using it, albeit unknowingly.
One of the earliest projects that has tried to make a binary XML (as far as I'm aware) was the EBML (Extensible Binary Meta-Language) which is used in the Matroska media container.
gzip uncompression is built into internet explorer, it's used all the time for speeding up the transfer of html to clients.
There's no reason why it couldn't be used for xml just as it is for html.
Ewan
While I do like Bzip2 and Gzip better, zip is open. There are numerous open source compression/decompression libraries for it.
http://news.com.com/5208-7345-0.html?forumID=1&thr eadID=4163&messageID=23888&start=-1
The design goals for XML are:
1. XML shall be straightforwardly usable over the Internet.
Grade: A
2. XML shall support a wide variety of applications.
Grade: B
3. XML shall be compatible with SGML.
Grade: don't know / don't care
4. It shall be easy to write programs which process XML documents.
Grade: F
5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
Grade: F
6. XML documents should be human-legible and reasonably clear.
Grade: F
7. The XML design should be prepared quickly.
Grade: F
8. The design of XML shall be formal and concise.
Grade: C
9. XML documents shall be easy to create.
Grade: C
10. Terseness in XML markup is of minimal importance.
Grade: A+
The problem is that XML is being used for web services which are unlike HTML: the requesting machine will not like waiting 2-3 seconds for the response to the method call. These are interoperating applications, not people downloading text to read, so the response time is much more critical.
I agree that gzip compression is a simple solution to the network problem. It does not address the parsing time problem, and in fact exacerbates it, but in my opinion the network issue is the big one. Time works in favor of faster parsing (faster processors), but works against network issues (more congestion). I would go with compression, test the results, and only then look into a binary solution.
On the surface that works, but it only solves a portion of the problem.
.xml.gz
.xml.gzxml
Data => XML.
XML == large (lots of verbose tags)
XML == slow (have to parse it all [dom], or
build big stacks [sax] to get at data)
Solution:
XML =>
You've solved (kindof) the large problem, but you still keep the slow problem.
What they're suggesting is nothing more than:
XML =>
Basically using a specialized compression schemes that understand the ordered structure of XML, tags, etc, and probably has some indexes to say "here's the locations of all the [blah] tags", attributes so you can just fseek() instead of having to do domwalking or stack-building. This is important for XML selectors (XQuery), and for "big iron" junk, it makes a lot of sense and can save a lot of processing power. Consider that Zip/Tar already do something similar by providing a file-list header as part of their specifications (wouldn't it suck to have completely to unzip a zip file when all you wanted was to be able to pull out a list of the filenames / sizes?)
"Consumer"/Desktop applications already do compress XML (look at star-office as a great example, even JAR is just zipped up stuff which can include XML configs, etc). It's the stream-based data processors that really benefit from a standardized binary-transmission format for XML with some convenient indexes built in.
That is all.
--Robert
The problem is that not everything in a typical XML message is text, so there can be a lot of translation going on between XML text and the binary format that an application needs (e.g., double). In our tests we've found XML to be 100x - 250x SLOWER than other approaches (e.g., JMS MapMessage). (FWIW, the 100x is using the MS parser, the 250x is with Xerces/Xalan). For high-volume, high-performance apps that's simply intolerable. Note that this has nothing to do with size on the wire, which is another consideration entirely.
Three ideas, in order of increasing significance and increasing difficulty:
Stop using bad DTDs. There seems to be a DTD style in which you avoid using attributes and instead add a whole lot of tags containing text. Any element with a content type of CDATA should be an attribute on its parent, which improves the readability of documents and lets you use ID/IDREF to automatically check stuff. Once you get rid of the complete cruft, it's not nearly so bad.
Now that everything other than HTML is generally valid XML, it's possible to get rid of a lot of the verbosity of XML, too. A new XML could make all close tags "</", since the name of the element you're closing is predetermined and there's nothing permitted after a slash other than a >. The > could be dropped from empty tags, too. If you know that your DTD will be available and not change during the life of the document, you could use numeric references in open tags to refer to the indexed child element type of the type of the element you're in, and numeric references for the indexed attribute of the element it's on. If you then drop the spaces after close quotes, you've basically removed all of the superfluous size of XML without using a binary format, as well as making string comparisons unnecessary in the parser.
Of course, you could document it as if it were binary. An open tag is indicated with an 0x3C, followed by the index of the element type plus 0x30 (for indices under 0xA). A close tag is (big-endian) 0x3C2F. A non-close tag is an open tag if it ends with an 0x3E and an empty tag if it ends with an 0x2F. Attribute indices are followed with an 0x3D. And so forth.
What should we use instead of XML to encapsulate RPC calls? Something at least semi-human-readable, please. I don't need to be able to read a graphic image, but I'd like to see the name of the method I'm calling, and at least string and text parameters.
And when someone sends me a bunch of data they want importing into a database, in what format should they send it? I'd like to be able to ensure that their data is correct before giving it to my import routine, and when my validator says there's an error, I'd like to be able to see what's wrong by eye.
Suggestions?
to make inaccurate interpretations of the data and not using proper and accurate specifications.
:(.
Many people claim that XML is so great because you can "just read and understand it" without having to use cumbersome and hard to understand specifications. This exactly is what makes XML, indeed, nice for typesetting purposes like HTML, maybe as an alternative for simple configuration files etc, but indeed NOT for RPC and databases as you write. I couldn't agree more.
I have seen so much time and money lost due to intuitive but false interpretations of XML schema's. People think that because its human readable with "meaningful" tagnames that they don't need a proper spec no more. Well I guess it fits in nicely with todays "cut and paste" programmers who don't really know what they're doing
The "Fast Infoset Project" for creating Binary XML as mentioned in the article is using ASN.1. See this blog entry by Rick Jelliffe for details.
Fast Infoset is to ASN.1 what XML is to SGML. At least if it becomes the standard anyway.