Does the World Need Binary XML?
sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."
← Back to Stories (view on slashdot.org)
For starters, keep Microsoft out of it.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Binary XML = zip file.xml > file.xml.zip
Thats all you need. XML compresses great.
Binary XML would destroy what makes xmal powerful: being able to use vi or emacs to understand its content, no fuss, no adobe reader like software, no nothing.
Programs written in assembly can run faster than programs written in C, but it's easier for someone to open a .c file and figure out what's going on.
I'm sure when C came out, the argument was similar that the performance hit doesn't make up for the readability or cross compatibility. But as computers and network connections became faster, C becomes a more viable alternative.
Text compresses quite well, especially redundant text like the tags. So why not just leave XML alone and compress it at the transportation level with protocols like sending it as a zip, let v.92 modems do it automatically, or whatever. No need to touch XML itself at all.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
But secondly, no, you don't need Binary XML, all you need to do is Gzip it on the wire. It gets as small as Binary XML.
One of the easiest ways to shrink your XML by about 90% is use tags like:instead ofYou can use a transformation to use the short names or long names on the wire.
XML, as implemented today, is often little more than a thin wrapper for huge gobs of proprietary-format data. Thus, any given XML parser can identify the contents as "a huge gob of proprietary data", but can't do a damned thing with it.
Too many developers have "embraced" XML by simply dumping their data into a handful of CDATA blocks. Other programmers don't want to reveal their data structure, and abuse CDATA in the same way. Thus, a perfectly good data format has been bastardized by legions of lazy/overprotective coders.
The slew publications exist for the sole purpose of "clarifying" XML serves as testament to the abuse of XML.
Obliteracy: Words with explosions
Of course binary doesn't equal proprietary. Those are two completely different concepts.
PNG is a binary format. It isn't proprietary, though. And although I can't immediately find a text-based proprietary format, such formats are not impossible (although arguably easier to reverse-engineer than binary proprietary formats).
But if the XML is really such a problem, I suggest the simple solution. Compressing XML with a simple and open algorithm like gzip or bzip2, is the way to go. XML usually compresses very easily.
A huff transform will give you entropy +1 compression. Not suitable for larger data sets (dictionary based compression is even better for this). 7z compression (or is it z7?) will give you a neat storage format.
u itcake
Lets talk about where this verbose talk of verbosity is stemming from:
apple
orange
pineapple
this is a data set. Noone knows what it is.
Here it is again with some pseudo xml style tags
I am listing vegetables here
this is a list of vegetables
vegetables are listed on thier own without any children pr parent tags, there can be one or more of them, this is version 1 of the document
here now follows a vegetable
tomato
that was a vegetable
here now follows a vegetable
leek
that was a vegetable
here now follows a vegetable
potato
that was a vegetable
here now follows a vegetable
haddock
that was a vegetable
as you can see, this is (albeit slightly weird looking) list of items called 'vegetables'.
The beauty of XML is two fold, the description of the document format (DTD and schemas) and the abilty to verify a document is valid, for any specified format.
XML is a human readable file specification language, and file format, all in one, written in itself!
A binary format of XML would be nice, you can make it yourself though.
veg:http://slashdot.org/veg.xml
v:tomato
v:fr
v:lemongrass
v:cat
this is a minimal way to represent the same xml like structure, in a less verbose way.
This is undeniable complexity, a binary format is just like a way of saying introduce a standard loosless compression format for XML, without changing what XML is.
I say anything that gets the W3C stamp of 'this is official' gets my vote. After all, 1 bad standard is better than 11 good proprietary solutions in a world of millions of interconnected systems.
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Any programmer worth his salt can put together a really good/efficient binary representation of XML in a few days. That's not the issue. The issue here is standardization.
The XML guys are funny. First make a text version of binary protocols to make it easy to sell XML them to the mass of "31137 HTML PRogrammers" who feel comfortable "programming" in dreamweaver; and then make a binary version to make it work.
When the XML is in text you still need to parse it. Sounds like an easy job if you're just doing it on your home computer. But a server handling thousands of simultaneous transactions can get bogged down parsing text down to binary when it can just get sent in binary to begin with.
MUCH faster. And you don't have the overhead of compression. Sure, gzip/bzip2 will cut down on network overhead, but what about processor overhead?
FTFA "The goal of the Fast Infoset project is to generate interest among developers and eventually create a standardized binary format." I'm not sure why they think that one has to come before the other.
Because standards written in a vacuum tend to suck. Why wouldn't you want input from developers with different backgrounds and needs, then cherry pick the best ideas (many of which you didn't think of), toss out universally reviled ones, and implement a broad, useable standard?
Here come da fudge!
I'll say it again.. Its not the size of the document its the overhead in parsing.
Yes but every time I try to see it your way, I get a headache.
That's the dumbest statement I've ever heard.
As long as it's standardized, the standard is freely available to anyone who wants it, it does not depend on an external library, and it is unencumbered by any sort of patent, it isn't proprietary.
I hate XML right now because of all the string processing and parsing. Text is a sloppy way of defining something, and it begets lots of big processing libraries. It's OK for big PC memory hog apps, but I can't build a small enough one that is still robust enough to want to integrate it into the work I do (small, compact stuff). I find myself doing other, backwards things, or worse, fracturing XML into useable subsets. It somewhat defeats its utility.
Binary XML sounds like a great idea to me, as long as we're clear on a few things. One, it has to be totally documented in a standard (see above for my definition). Two, the standard must define a tool that can read an XML file and say "Yes this is XML" or "No, this is some [microsoft] non-compliant crap". Three, keep it simple: no compression, no outside library dependencies, no cruft.
If those things cannot be achieved then it will not reach maximum utility and something proprietary will swoop down and take over (*cough* microsoft *cough*).
This is all about different companies trying to get THEIR binary format to be the "standard" with XML.
From the article Images are already binary data. They really don't compress much more (if you've chosen the right format). That means that they will take the same amount of time to download, binary XML format or not.
of "I told you so!" coming over. Between all the people who jumped on the web services bandwagon without any clue how to handle distributed systems efficiently and the "OMG! It's human readable!" crowd, the architecture de jour has become a bloated PITA. Why this wasn't built into the spec in the first place alludes me. If we can use tools like ethereal to read those binary IP datagrams, why wouldn't the same concept be used for this standard? A standardized, compressed, data format with a standardized API for outputting plaintext (XML), would have allowed this system to be much more efficient.
Didn't anyone remember that text processing was bulky and expensive? Sometimes the tech community seems to share the same uncritical mind as people who order get-rich-quick schemes off late night infomercials. I doubt XML would have gotten out of the gate as is, had the community demanded these kinds of features from the get-go.
Arrogance is Confidence which lacks integrity. -- me
Clearly these fundamental tenets have escaped the People In Charge in many places, who are now discovering that their brilliant ideas to represent images, large databases, etc, as XML were, in fact, fucking stupid ideas.
Enter binary XML. This lets the People In Charge save face by saying that the system still uses XML, so they must have been right when they designed/required it to use XML in the first place. And now it's 50% faster!
The article has an excellent point, there will be compatibility problems and we'll "degrade" to different binary XML formats - each best suited to a particular niche. That's exactly where the world was before XML came along - data formats designed for (and reasonably appropriate for) particular applications. Those formats are invariably more efficient than XML, and are often simpler and easier to parse than XML. Binary XML attempts to combine those old-fashioned file formats with XML, resulting in a system that's more bloated (and slower) than the old way but not quite as bad as XML in its current form. So now we've come full circle, except that we've added an extra layer of bloat to something that worked well enough to begin with. Congratulations, Mr. Binary XML person! You fail it!
While I'm at it: If network bandwidth is really the bottleneck, use zlib. XML's best feature is that it compresses really well.
Then don't use string operations! If you write a dumb parser that uses the language's string functions, breaking off and allocating little chunks of memory, of course it's going to be slow as molasses. The way to do it is to hold the entire xml file in a single string in memory (up to memory capacity), then tightly code a c-language loop to scan it. If done properly the overhead can be barely more than the time it takes to iterate through that many characters, and that overhead may be swamped by the internal table building, binary lookups, and consistency checking that you have to do anyway, no matter what the format.
The real problem with XML is that it adds the extra verbosity of the metadata text tag for EACH INSTANCE of a pice of data even in cases where that metadata is identical for row after row of data. In the case of table data, that is really stupid. There should be some sort of XML means to handle a table of values better. A way to say "Column 1 has the following XML properties: name, etc", then "Column 2 has the following XML properties: name, etc".... and then after that section, a way to syntactically list just the values up until the end of the loop.
This is what made us balk at using XML for storing NMR spectroscopy data, even though it is already in a textual form to begin with. The current textual form is whitespace-separated, little short numbers less than 5 digits long, for hundreds of thousands of rows. That isn't really that big in ascii form. But turn it into XML, and a 1 meg ascii file turns into a 150 meg XML file because of the extra repetative tag stuff.
In another bit of irony, we can't find an in-memory representation of the data as a table which is more compact than the ascii file is. The original ascii file is even more compact than a 2-D array in RAM. (because it takes 4 bytes to store an int even when that int is typically just one digit and is only larger on rare occasions.)
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
It doesn't tell us what the specific performance problems are with XML. Does it take too long to transmit? Does it take too long to validate? Does it take too long to parse? Does it take too long to format? What's the real problem here?
From experience, I can state that using XML in any high performance situation is easy to screw up. But once you get past the basic mistakes at that level, what other inherent problems are there?
Oh, and just stating "well, the format is obviously wasteful" just because it's human readable (one of its primary, most useful, features) is NOT an answer.
I get the feeling that this perception of XML is being perpetuated by vendors who do not really want to open up their data formats. Allowing them to successfully propagate this impression would be a very real step backwards for all IT professionals.
Please mod this post only if you think others should/n't read this. I have enough ego^H^H^Hkarma. Thanks!
Dude! Wake up! How often do you open an XML-RPC packet trace with your morning coffee and think 'Gosh, how cool it's in a readble bloated text format and I don't need to parse it with Ethereal !'
:-/
Seriously, the only time readability is needed is when you edit an XML web page with a notepad. Otherwise it's a brain-dead technology that first got popular among scripting developers, which are notoriously afraid of anything binary, and then it got pushed into the areas where it didn't belong.
Unfortunately, the majority of XML zealots are plain ignorant. Should they took time to learn what the byte ordering and TLV encoding mean, we would've not probably have this XML craze now
Don't get me wrong, XML has its place. But it is next to HTML, and not next to RPC or databases!
3.243F6A8885A308D313
The more compression that is done, the greater the CPU usage. Eventually it reaches a point of diminishing returns where there is no point in trying to compress a network stream any further because you are merely turning it from an I/O bound task to a CPU bound task. Also, to get really good compression, you need to look ahead and see a lot of the bytes of the file to look for similarites. But in a stream application, you don't have the luxury of holding giant buffers for each stream of bytes - so you have to make do with finding what compression you can in the smaller buffered chunks of the data that you pass through a little at a time. Therefore, although compression is used, it's not going to be the really good kind of compression we're used to seeing with something like "gzip -9".
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Binary formats contain pointers all over the place... pointers that say "this many bytes to the next record", or if the binary format is designed to be very fast to read, will even contain pointers that say "record 22031 is at offset XXX, record 22032 is at offset YYY". It's very quick to get to record 22032 for these formats, you just jump there and don't even have to wait eons for a physical disk to read in every single byte in between.
Now, compare to XML. EVEN IF every record was a single xml tag, the parser would have to look for "<", followed by "</", and would have to repeat that 22030 more times.
That may seem like an extreme example, but 1) most XML "records" are much more complex to parse, and 2) this demonstrates THE MOST MAJOR DOWNSIDE that human-writable formats have... they can't have these "jump to byte XXXX" markers in them, because humans don't want to constantly be updating these references every time they add or subtract a byte.
Machine-writable file formats realize that inserting or deleting bytes in the middle of a file is a big no-no, so they use several tricks to make sure they don't have to do that. All of these tricks annoy the heck of humans (they either require updating a lot of bytes, or require writing/reading the file in "pages" which bug humans because you can't "see" a whole section at the same time, or other tricks).
Therefore, human-writable formats should NOT be used as the most basic storage/access format. Agreeing to put an extremely minimal storage layer below XML is simply accepting that machines are more optimized to read/write a different kind of format than humans are.
However (as I tried to emphasize), ASCII is binary too. It's not that binary is inherently more difficult to debug. It's that we need a binary standard as universal as ASCII has become.
Imagine debugging before in the 1960's, when ASCII wasn't standardized. We forget about those times now, because ASCII has been there for nearly 50 years. But go ahead, take a look.
Believe it or not, there were over 60 binary text standards in use before ASCII. I think we should be thanking Bob Bemer (the father of ASCII) a whole lot more often.
As others have pointed out, most of those features are here today.
Please remember that not all XML data is transmitted by HTTP however (thank god).
- Michael T. Babcock (Yes, I blog)
If you ask me, the transparency of a text stream far outways any cost in performance.
It far outweighs it huh? I guess you have never heard of a large segment of the computing world refered to as embeded systems.
If you can develop a good parser (not that hard), the cost difference is negligable, if any.
This is simply untrue, development of a good parser is easy, but it's added bloat that isn't negligable for many computing devices outside of the PC/Server realm. Not to mention the added network traffic that uncompressed text yeilds (embeded devices don't always have the fastest I/O). Some say that the solution to reducing the network overhead of XML is compression. Compression takes CPU power, another thing lacking in may embeded devices.
My point is that there are actually a lot of applications where XML is just not well suited.
It seems to me that the problem isn't with XML, it's with what people are using it for. I read some complaints here from people saying "I tried to use XML for BLAH and it was too slow." However, if they'd thought about it, BLAH would have been better served by some binary format in the first place. The article also discusses the fact that mobile devices need something less cumbersome for transferring pictures/media. Why are they using XML for that at all? One of the benefits of XML is that it's human readable, but in those applications you don't need that benefit, so don't use XML. Instead of coming up with a binary XML standard, come up with a generic binary standard that does exactly what you want. Too many people have been given the hammer of XML and now everything looks like a nail.
You had me until then; no self-respecting engineer would ever use those terms.
Dewey, what part of this looks like authorities should be involved?
Stop using bad DTDs. There seems to be a DTD style in which you avoid using attributes and instead add a whole lot of tags containing text. Any element with a content type of CDATA should be an attribute on its parent, which improves the readability of documents and lets you use ID/IDREF to automatically check stuff. Once you get rid of the complete cruft, it's not nearly so bad.
Not to nitpick, but attributes != elements. (hint: one of them is ordered, and repeatable). As far as ID/IDREF goes, key/keyref in XMLSchema replicates this for arbitrary markup. Use of attributes, in some instances is rather crufty precisely because they need to be handleed anampohically to elements.
A new XML could make all close tags ". The > could be dropped from empty tags, too.
You're design decision, not mine. Some might think that if you're going to have a verbose format like xml, you might as well throw in a few sanity checks as well, since they're almost free by comparison.
Look, I've said it before, and I'll say it again. Like Hello Kitty, XML has one thing going for it -- ubiquity. If, like me, you're a proponent, you need to understand this, embrace it. Once you've repeated the words enough, you will come to a blissful realization. It doesn't matter how "bad" xml performance is, the only thing that matters is that it be useful for everyone. This means that it should be useful for config files for simple programs/scripts. It should be useful for people who want to build (by hand) a little web-service to serve up their mp3 collection, or to multi-billion dollar companies that want to run online acutions. If XML can be this broad-based, then so can the tooling that is used to manipulate it. That's good news for big companies who want to save money, and script hackers who just want to save time, good for us all. Anything that fractures xml's ubiquity undermines the technology itself, and should be avoided. Binary XML falls into this category.
Now as for performance, my personal opinion is that its way to early to start running around creating binary standards. XML itself has been around a while, but the higher-level standards are still evoloving (web services, xml schema, etc). Most of the current tooling around xml is currently written to demonstrate standards compliance. When we really start to see performance-oriented solutions, and they still suck, then everyone can start rioting.
A good binary XML specification could be an extremely good fit for us.
And, don't suggest that we just compress XML and send that. Here's why: first we have to expand all that digitized data into some sort ASCII encoding, which is then compressed. End result: no gain and a possible loss of precision in the data.
A real, live, useful binary XML spec could help us immensely. I say BRING IT ON!!!!
BTW, wasn't DIME supposed to address these problems? What happened to DIME, anyway?
In the course of every project, it will become necessary to shoot the scientists and begin production.
What is text? It's a binary code that a computer translates into graphical glyphs. Is it proprietary? Not any more. Your computer is what turns that binary code into something that means something to you. It doesn't even mean something to everyone (in fact iirc the first line in an xml file identifies a code page for using the intended symbols). So firstly, opaqueness even on "text" is not quite black and white. Second, what is transparent to YOU may be completely opaque to software, I'll elaborate on this later.
So what could binary XML be? It's a binary code that translates into XML syntax. Except it's easier to deal with for software, there's no processing. Let me present this example, which I will endeavor to use over and over. . I could write this in binary as 0001010203, obviously to do that i'd have to store the strings "mytag", "is", "simple" in a string table elsewhere, but this is just a simple example. I made "0x00" mean "a tag", the first "0x01" mean 1 attribute, and the rest are string references. Reading this tag would be very simple, fread(buffer,10,1,file) (i picked the two middle numbers out of the air since we have not really defined this format).
Saying that binary is proprietary makes absolutely no sense. Proprietary means property of an owner (usually a business). A file can't be proprietary. It's contents, the format of it's contents, certainly. But a binary file is atomic, it's like the sky. It just is what it is. Binary XML COULD become proprietary, but it will not NECESSARILY happen. Nothing is inherently proprietary about a binary file. If the binary XML format satisfies the constraints of a standard in my first post, it will absolutely not be proprietary by construction, or so I think. I don't work in standards groups, people more experienced with their goings on may point out additional refinements.
Your next point, recovering data. What do you use to read XML files? A text editor usually. What does that do? It reads a binary file (uh a text file!), applies some understanding of what the ASCII code (as an example) means, and displays it to you. Most of it is usually character data, but not always, there's a bunch of special characters that text editors often respond to for formatting or other things. Unix and PCs can't agree even on how to terminate a line. The point being even right now you can't totally say plain text XML is transparent, magic happens for you to just see it. Nothing about its presentation is defined (nor should be, imho). So what could binary XML be viewed with? Only slightly more overhead. You could "textify" as a preprocessing step to be viewed in a text editor. bin2txt myfile.bin.xml > myfile.txt.xml as an example. Or you could write your xml in plain text, and do the opposite. It's one to one, no loss. XML is just a syntax.
Now as for processing, I'll admit to waving my hands and skipping a few pieces. The XML syntax is defined clearly, there's no ambiguity (that i know of). However the step of choosing text-like strings to declare the syntax elements is where it gets hairy. Your first step in writing a parser is to grab the syntax elements out of their native text string. This is disgusting as compiler writers, language developers, etc. understand. You have to make lexx/yac scripts or workalikes to generate code, or worse, write your own (no one should do this but that's purely my opinion and not defendable). Theres a complicated state machine, some funny thing called LRM, and some other gotchas. All this just to take and break it into it's constituent elements. Usually then you have a tree structure or some hierarchy that a computer can understand.
Take a look at some common XML libraries: xerces, libxml, a few others I can't remember. They're pretty damn big. Mostly, I argue, due to the text nature of their data. A lot of work goes into making text files useable by a program. A lot (but not all) of cruft can be cut by adopting a format that is simpler for softare to understand.
Sure, people who write MS Word (i.e.
The question should instead be "How can we best standardize binary XML?"
My main fear is the typical "design by committee" style of standards bodies will lead to a super-bloated binary standard containing every pet feature of each participant. This could make it just as slow and painful as working with any textual encoding. I think Mike Conner's "CBXML" is probably the right mix of simplicity, compactness, and efficiency. Sun's Fast Infoset is a horrendous concoction that we can only hope never achieves any prominence. Leave it to the company who made Java the bloated mess it is today to come up with something like that!
Hey guys, here's a clue...before including an ever so nifty new compression / performance feature into your proposals, how about actually quantifying the expected benefits? This includes both performance of parsing as well as generation. Yes we need a binary XML standard, but keep it simple PLEASE.