Slashdot Mirror


Does the World Need Binary XML?

sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."

18 of 481 comments (clear)

  1. The solution is clear... by LordOfYourPants · · Score: 3, Funny

    Use the Z-modem protocol between Information Superhighway routers to compress the plaintext.

  2. KISS by stratjakt · · Score: 5, Interesting

    On the face of it, compressing XML documents by using a different file format may seem like a reasonable way to address sluggish performance. But the very idea has many people -- including an XML pioneer within Sun -- worried that incompatible versions of XML will result.

    I agree with his point.

    What's wrong with just compressing the XML as it is with an open and easy-to-implement algorithm like gzip or bzip2?

    --
    I don't need no instructions to know how to rock!!!!
    1. Re:KISS by Ramses0 · · Score: 5, Informative

      On the surface that works, but it only solves a portion of the problem.

      Data => XML.

      XML == large (lots of verbose tags)

      XML == slow (have to parse it all [dom], or
      build big stacks [sax] to get at data)

      Solution:

      XML => .xml.gz

      You've solved (kindof) the large problem, but you still keep the slow problem.

      What they're suggesting is nothing more than:

      XML => .xml.gzxml

      Basically using a specialized compression schemes that understand the ordered structure of XML, tags, etc, and probably has some indexes to say "here's the locations of all the [blah] tags", attributes so you can just fseek() instead of having to do domwalking or stack-building. This is important for XML selectors (XQuery), and for "big iron" junk, it makes a lot of sense and can save a lot of processing power. Consider that Zip/Tar already do something similar by providing a file-list header as part of their specifications (wouldn't it suck to have completely to unzip a zip file when all you wanted was to be able to pull out a list of the filenames / sizes?)

      "Consumer"/Desktop applications already do compress XML (look at star-office as a great example, even JAR is just zipped up stuff which can include XML configs, etc). It's the stream-based data processors that really benefit from a standardized binary-transmission format for XML with some convenient indexes built in.

      That is all.

      --Robert

  3. Binary XML has been around a while... by PipianJ · · Score: 4, Informative

    Binary XML is nothing new, as I wager that many people here are already using it, albeit unknowingly.

    One of the earliest projects that has tried to make a binary XML (as far as I'm aware) was the EBML (Extensible Binary Meta-Language) which is used in the Matroska media container.

  4. Maybe this is like comparing assembly to C by Stevyn · · Score: 5, Insightful

    Programs written in assembly can run faster than programs written in C, but it's easier for someone to open a .c file and figure out what's going on.

    I'm sure when C came out, the argument was similar that the performance hit doesn't make up for the readability or cross compatibility. But as computers and network connections became faster, C becomes a more viable alternative.

  5. Re:For Starters by Omega1045 · · Score: 4, Interesting
    Why? Microsoft has done a fairly good job promoting XML and SOAP XML Web Services. As long as they stick to the standards (yes, I know) I see no reason to keep them out.

    IBM has actually tried to introduce some goofy stuff into the XML standards, like line breaks, etc, that should not be in a pure node-based system like XML. Why are not you picking on them in your comment?

    As far as SOAP and XML Web Services (standardized protocols for XML RPC transactions) Microsoft was way ahead of the pack. And I rather enjoy using their rich set of .NET XML classes to talk to our Unix servers. It helps my company interop.

    --

    Great ideas often receive violent opposition from mediocre minds. - Albert Einstein

  6. Re:For Starters by Soko · · Score: 3, Insightful

    Agreed.

    However, let me re-phrase the grandparent:

    "For starters, make sure Microsoft can't extend it to lock out compeditors in some way."

    Better?

    Soko

    --
    "Depression is merely anger without enthusiasm." - Anonymous
  7. Amen To That by American+AC+in+Paris · · Score: 5, Insightful
    XML, as originally designed, is deliciously straightforward. Data is encoded into discrete, easy-to-process chunks that any given XML parser can make sense of.

    XML, as implemented today, is often little more than a thin wrapper for huge gobs of proprietary-format data. Thus, any given XML parser can identify the contents as "a huge gob of proprietary data", but can't do a damned thing with it.

    Too many developers have "embraced" XML by simply dumping their data into a handful of CDATA blocks. Other programmers don't want to reveal their data structure, and abuse CDATA in the same way. Thus, a perfectly good data format has been bastardized by legions of lazy/overprotective coders.

    The slew publications exist for the sole purpose of "clarifying" XML serves as testament to the abuse of XML.

    --

    Obliteracy: Words with explosions

    1. Re:Amen To That by Kingpin · · Score: 4, Insightful

      An XML document is an abstract. The file with tags is a serialization of that document. A binary file would also just be a serialization. Then you deserialize it in your parser - and get the DOM. It's the job of the parser to give you the object represenation, no matter if it were human readable text or binary format.

      The data is interchangable either way - only difference is that binary XML file is not immediatly human readable.

      --
      Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
      Geocrawler error message.
  8. Re:Binary = Proprietary by Adhemar · · Score: 3, Insightful

    Of course binary doesn't equal proprietary. Those are two completely different concepts.

    PNG is a binary format. It isn't proprietary, though. And although I can't immediately find a text-based proprietary format, such formats are not impossible (although arguably easier to reverse-engineer than binary proprietary formats).

    But if the XML is really such a problem, I suggest the simple solution. Compressing XML with a simple and open algorithm like gzip or bzip2, is the way to go. XML usually compresses very easily.

  9. Then we wrap it again, that's what! by Tackhead · · Score: 4, Funny
    > Then what happens, do you base64 the binary xml and wrap it in an ascii xml document?

    Of course not! That's not XML!

    <file=xmlbinary> <baseencoding=64> <byte bits=8> <bit1>0 </bit><bit2>1 </bit><bit3>1 </bit><bit4>0 </bit><bit5>1 </bit><bit6>0 </bit><bit7>0 </bit><bit8>1 </bit> </byte>
    <boredcomment>(Umm, I'm gonna skip a bit if y'all don't mind)</boredcomment>
    </baseencoding> </file>

    Now it's XML!

  10. Re:there are already standards for this... by rootmonkey · · Score: 5, Insightful

    I'll say it again.. Its not the size of the document its the overhead in parsing.

    --

    Yes but every time I try to see it your way, I get a headache.
  11. Re:Step 1 to getting binary XML by Dasein · · Score: 3, Insightful

    The problem is that many systems that produce XML have a more compact internal storage (rows from a DB or whatever), then they go through an "expansion" to produce XML.

    So, to propose simply compressing it means that there's and expansion (which is expensive) followed by a compression (which is really expensive). That seems pretty silly. However, given an upfront knowledge of which tags are going to be generated, it's pretty easy to implement a binary XML format that's fast and easy to decode.

    This is what I did for a company that I worked for. We did it because performance was a problem. Now, if we don't get something like this through the standards bodies, more companies are going to do what mine did and invent thier own format. That's a problem -- back to the bad old days before we had XML for interoperability.

    Now, if we get something good through the standards body then, even though it won't be human readable, it should be simple to provide converters. To have something fast that is onvertable to human readable and back seems like a really good idea.

    --
    You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
  12. Overwhelming feeling... by GOD_ALMIGHTY · · Score: 4, Insightful

    of "I told you so!" coming over. Between all the people who jumped on the web services bandwagon without any clue how to handle distributed systems efficiently and the "OMG! It's human readable!" crowd, the architecture de jour has become a bloated PITA. Why this wasn't built into the spec in the first place alludes me. If we can use tools like ethereal to read those binary IP datagrams, why wouldn't the same concept be used for this standard? A standardized, compressed, data format with a standardized API for outputting plaintext (XML), would have allowed this system to be much more efficient.

    Didn't anyone remember that text processing was bulky and expensive? Sometimes the tech community seems to share the same uncritical mind as people who order get-rich-quick schemes off late night infomercials. I doubt XML would have gotten out of the gate as is, had the community demanded these kinds of features from the get-go.

    --
    Arrogance is Confidence which lacks integrity. -- me
  13. Vast omissions! by kahei · · Score: 4, Funny


    Aside from the mistakes pointed out by others, you also forgot to reference the xmlbinary namespace, the xmlbyte namespace, and the xmlboredcommentinparentheses namespace, and to qualify all attributes accordingly. You also didn't include anything in or any magic words like CDATA, and you didn't define any entities. You also failed to supply a DTD and an XSL schema.

    This is therefore still not _true_ XML. It simply doesn't have enough inefficiency. Please add crap to it :)

    --
    Whence? Hence. Whither? Thither.
  14. The article doesn't go far enough... by Da+VinMan · · Score: 5, Insightful

    It doesn't tell us what the specific performance problems are with XML. Does it take too long to transmit? Does it take too long to validate? Does it take too long to parse? Does it take too long to format? What's the real problem here?

    From experience, I can state that using XML in any high performance situation is easy to screw up. But once you get past the basic mistakes at that level, what other inherent problems are there?

    Oh, and just stating "well, the format is obviously wasteful" just because it's human readable (one of its primary, most useful, features) is NOT an answer.

    I get the feeling that this perception of XML is being perpetuated by vendors who do not really want to open up their data formats. Allowing them to successfully propagate this impression would be a very real step backwards for all IT professionals.

    --
    Please mod this post only if you think others should/n't read this. I have enough ego^H^H^Hkarma. Thanks!
  15. XML doesn't need to be non-ascii to be small by iabervon · · Score: 3, Informative

    Three ideas, in order of increasing significance and increasing difficulty:

    Stop using bad DTDs. There seems to be a DTD style in which you avoid using attributes and instead add a whole lot of tags containing text. Any element with a content type of CDATA should be an attribute on its parent, which improves the readability of documents and lets you use ID/IDREF to automatically check stuff. Once you get rid of the complete cruft, it's not nearly so bad.

    Now that everything other than HTML is generally valid XML, it's possible to get rid of a lot of the verbosity of XML, too. A new XML could make all close tags "</", since the name of the element you're closing is predetermined and there's nothing permitted after a slash other than a >. The > could be dropped from empty tags, too. If you know that your DTD will be available and not change during the life of the document, you could use numeric references in open tags to refer to the indexed child element type of the type of the element you're in, and numeric references for the indexed attribute of the element it's on. If you then drop the spaces after close quotes, you've basically removed all of the superfluous size of XML without using a binary format, as well as making string comparisons unnecessary in the parser.

    Of course, you could document it as if it were binary. An open tag is indicated with an 0x3C, followed by the index of the element type plus 0x30 (for indices under 0xA). A close tag is (big-endian) 0x3C2F. A non-close tag is an open tag if it ends with an 0x3E and an empty tag if it ends with an 0x2F. Attribute indices are followed with an 0x3D. And so forth.

  16. Re:WHO NEEDS FREAKING READABILITY ?! by dbacher · · Score: 3, Insightful

    I agree with your point, however there's one additional case where it is nice.

    The best use for XML is at system or domain boundaries, where you cannot control the software on both sides.

    For example, a support system might use file exchange to open support tickets in a vendors system for hardware failures. In this case, the vendor probably needs to deal with multiple different customers, and each of their customers might be dealing with several vendors.

    Being able to encapsulate to XML, in this case, is valuable so that all partners can understand the data.

    You could do this with a binary format, etc. but there is no binary format with the universal library support, and C doesn't guarauntee byte orders and structure layout between platforms, so in that case XML is useful.

    That's the only time it's useful.

    I strongly dislike using it for comms protocols, because the extensibility and transformation capabilities are lost, and it cripples throughput in the best of situations.

    --
    If your code is acting bloated, and is running rather slow, it's likely and predicted that some loops you will unroll.