Does the World Need Binary XML?
sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."
← Back to Stories (view on slashdot.org)
On the face of it, compressing XML documents by using a different file format may seem like a reasonable way to address sluggish performance. But the very idea has many people -- including an XML pioneer within Sun -- worried that incompatible versions of XML will result.
I agree with his point.
What's wrong with just compressing the XML as it is with an open and easy-to-implement algorithm like gzip or bzip2?
I don't need no instructions to know how to rock!!!!
Somebody fill me in ...
... its called zipping, most webservers have it as an option to zip the data up as it streams to the client browser
i fail to see the need to have a "binary xml" file format when there are already facilities in place to compress text streams
IBM has actually tried to introduce some goofy stuff into the XML standards, like line breaks, etc, that should not be in a pure node-based system like XML. Why are not you picking on them in your comment?
As far as SOAP and XML Web Services (standardized protocols for XML RPC transactions) Microsoft was way ahead of the pack. And I rather enjoy using their rich set of .NET XML classes to talk to our Unix servers. It helps my company interop.
Great ideas often receive violent opposition from mediocre minds. - Albert Einstein
However, if anything, XML has shown us the power of well-structured information. XML has given the possibility of universal interoperability. Developments in XML-based technologies have led us to the point where we know enough now to create a standard for structured information that will last for several decades.
It's time that we had a new ASCII. That standard should be binary XML.
When I think of the time that has been wasted by every developer in the history of Computer Science, writing and rewriting basic parsing code, I shudder. Binary XML would produce a standard such that an efficient, universal data structure language would allow significant advances in what is technically possible with our data. For example: why is what we put on disk any different from what's in memory? Binary XML could erase this distinction.
A binary XML standard needs to become ubiquitous, so that just as Notepad can open any ASCII file today, SuperNotepad could open any file in existance, or look at any portion of your computer's memory, in an informative, structured manner. What's more, we have the technology to do this now.
In my hands, bzip compresses better, but is somewhere between somewhat slower and orders of magnitude slower on my system, depending on the options used to invoke the command and the size of the file being compressed. gzip is fast, works on streams instead of blocks, and is available on nearly every system.
I think that's where the true problem lies. HTTP.
.gz files, .zip files etc. since that would be pointless).
We need to look towards http 2.0. What I would want:
- pipelining that works, so that it could be enabled for use on any server that supports http 2.0
- gzip and 7zip support.
- All data is compressed by default (a few excludes such as
- Option to initiate persistant connection (remove the stateless protocol concept), via a http header on connect. This would allow for a whole new level for web applications via SOAP/XML.
There are tons of other things that could be enhanced for today's uses.
HTTP is the problem. Not XML
form Re: Lisp syntax, what about resynchronization?
Attributes in XML are inherited from SGML and they were thingking markup for textual documents. When you want to represent data it being attribute or not is completely irrelevant.
Deep explanation: From:The horror that is XML
Dyslexics have more fnu.
I think that Binary XML misses the mark on bringing any real benefits other than transmission compression. XML can be a huge benefit from a human and coding perspective, but it also has drawbacks in transmission (due to size) and in processing (again due to size). A lot of XML data goes thru many different processing systems that never need descriptive tags and the overhead of document size can bog down some very large computers.
I know, try to do an XSLT on a 60 meg file.
One approach that could potentially benefit everyone is to have interchangable namespaces. By that I mean have a human readable namespace that also had a machine friendly name space.
In the Human version you could have those wonderful long tags like [FirstNameOfMyGrandmothersThirdCousin] and have a transform that would make that [ID1001] for maching processing.
You can save a ton of space by swapping out all of the Elment and Attribute names, holding structure, allow for machines to more efficiently process, and then if a human or UI needs descriptive information, you could go grab the friendly Namespace and be back to your large XML file.
Had data to be delivered to client, dumped from a database. As flat files they were ~20mb in size as flat files. That bloated ~120mb after conversion to XML.
Client attempted to open in a DOM based application which I suspect used recursion to parse the data (easy to code, recursion). Needless to say it brought their server to its knees.
We switched to flat files shortly there after.
In my problem domain, where 20MB is a small data set, XML is useless. XML seems does not scale well at all (though using a SAX parser helps at times).
YMMV.
putting the 'B' in LGBTQ+
Wbxml is very compact, easy to parse and it's standardized too. Have a look at http://www.w3.org/TR/wbxml/ .
<SomeTagName>some character data</SomeTagName>
According to the XML spec, the closing tag must close the nearest opening tag. So why does it have to include the opening tag's name? This is 100% redundant information, and is included in every XML tag with children or cdata. An obvious compression would be to replace this with:
<SomeTagName>some character data</>
I really don't know why this wasn't done from the outset (backwards compatibility with HTML, where tags often overlap - although they're not meant to - I suppose). Either allow tags to overlap (which allows some more interesting data structures to be easily encoded in XML) or make the name optional in the closing tag.
I am TheRaven on Soylent News
take an example on microsoft XML formats. Word, or the MSN messages format... they're _NOT_ xml. They're proprietary formats DISGUISED as XML.
If Microsoft doesn't respect text-only XML, what do you think will happen when^H^H^H^Hif binary XML is out?
DNS is binary; does that make it proprietary? Not at all. It is a published open standard in RFC 883 and later documents. Other examples include ASN.1/BER as used in SNMP. It's not whether it is binary or text that matters; it's whether it is openly documented and unencumbered by intellectual property claims (a separate issue some of XML has).
The decision of binary vs. text for a format should be the result of specific needs. XML is verbose. XML can be compressed for transmission purposes, but it still has to be uncompressed to its verbose form for parsing. If speed in parsing is necessary (it might be as I have noticed quite many XML based progams are rather slow), a binary format can have things like length prefixes and continuation tags, instead of having to detect and verify collection of characters whose position is unknown. A parser that does not recognize a given tag, or does not need to process it, in a binary format can simply skip it by jumping the specified number of bytes. Binary format is very optimal for machine processing.
The usual argument for a text format spans the range of permitting humans to create the content for most things directly in an editor like vi or emacs (no wars here, I listed my favorite last), or reading that content directly, such as to diagnose the real cause of misunderstood errors. XML is too utterly complex for human creation or interpretation to be effective on a direct basis. There may be some argument that it can still be effective for diagnostic purposes (I have in fact needed to do so many times). Given that it is the powerful tools of XML that are used as the basis for the benefit of XML and promoting it, then what does it really matter what format is underneath as long as it is open and unencumbered?.
A binary format for XML will absolutely not kill XML. DNS is obviously not dead (and you'll love it even more when IPv6 rolls into your network). What a binary format might do is weed out some of the weaker programmers who are sticking their fingers a bit too deep into the inner workings of some applications and tools.
now we need to go OSS in diesel cars
I don't know that I care about or for "binary XML". I don't terribly worry about the efficiency that might be gained by converting a textual integer like 3,000,000,000 into a 32 bit binary integer.
/a/b/c /a/b/c || byte_position=5454786
/a/b/c. It would be bad if it was not. But I think that sort of thing could be checked in the same way that we check DTDs.
However, I might be interested in a "Pointer XML" - in an XML that allows me to use lseek like operations to efficiently move around a document.
XPaths conceptually require parsing lots of the document. It's hard to skip over pieces - you have to process all of the byres from the start of the document to the first place where the XPath matches.
Most of the "optimized XML" formats create a hash table from Xpath to file location or binary. But this is still at least O(length of Xpath string).
If there was a way of providing the link as a textual integer, and then lseeking to this, it's O(lg NbytesInXmlDoc). That might be a saving.
(Adage: don't worry about constants like 2X or 4X. Do worry about changing the O() efficiency.)
There would be no reason that such a "Pointer XML" could not remain entirely textual. It might simply be an extra syntax or modifier to an Xpath:
Instead of linking to xpath
Link to
The lseek positions would have to be in bytes, not characters, and would get confused if the coding system were changed. But they would at least be useful an usable if the coding system were not changed.
The hard part would be ensuring consistency. E.g. in the example above, you would want to ensure that the element at byte_position=5454786 really was the xpath
Also, some minor annotations, such as placing anchors at the lseek-ed to byte position, might help in maintaining such consistency.
Moreover, I would never advocate abandoning XPaths - I would just be suggesting including the lseekable byte positions as a performance hint. It should also be correct to ignore the byte possitions and just use the XPath links.
By adding padding (blanks, whatever) you could avoid the need to change all of the lseekable byte position hints whenever you changed an element value.
I cannot believe that your naieve post was modded up to a 5. FWIW the answer to all of your above questions is a resounding "Yes!", although some deserve a stronger "Yes!" than others. Let me state for the record that, from your newbie questions, you are XML-ignorant. And you apparently did not take compiler theory, where you would have learned how computationally expensive parsing was. But you are hardly alone; the industry is full of dumbasses who don't understand what's happening. I, on the other hand, predicted these problems four years ago and have yet to receive my Nobel Prize.
XML is a cluster fuck for the following reasons. Any message must be:
Note that at every step XML requires more CPU, more memory and more bandwidth. This is true for every component of the network! There is no way around these problems other than sheer computing power and throughput. So, one might say, the problem will disappear if we merely wait a few years. Unfortunately other factors are loading the Internet even more than XML, sapping Moore's Law.
And that's without considering the problems of the W3C's various XML committees! But don't get me started.