Slashdot Mirror


Does the World Need Binary XML?

sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."

20 of 481 comments (clear)

  1. KISS by stratjakt · · Score: 5, Interesting

    On the face of it, compressing XML documents by using a different file format may seem like a reasonable way to address sluggish performance. But the very idea has many people -- including an XML pioneer within Sun -- worried that incompatible versions of XML will result.

    I agree with his point.

    What's wrong with just compressing the XML as it is with an open and easy-to-implement algorithm like gzip or bzip2?

    --
    I don't need no instructions to know how to rock!!!!
    1. Re:KISS by ZakMcCracken · · Score: 2, Interesting

      What's wrong with just compressing the XML as it is with an open and easy-to-implement algorithm like gzip or bzip2? I'll tell you one thing that's wrong: these compression algorithms might run fine on your desktop or server; but on an embedded system with restricted memory and CPU power, that's another matter...

    2. Re:KISS by e2d2 · · Score: 2, Interesting

      What you said is right on target. I've worked with XML in a few applications (specifically web services) and everytime we saw a performance drop it was not because of a network bandwidth issue but instead it was because the documents were so large that the parser became the bottleneck. And then when you throw in style sheets for manipulation.. well you get the point.

      So if the need is for compression over networks, well thats only half of XML performance problems. And if the end result becomes a binary format, then how is it, and why would it need to be related to XML in the first place? Data compression over networks is not a valid reason for another standard IMO.

  2. gzip ? by JonyEpsilon · · Score: 2, Interesting
    Am I missing something, or would just gzip'ing xml when it goes over the network not solve the problem ? And isn't this sort of solution already widely implemented for web content ?

    Somebody fill me in ...

  3. there are already standards for this... by ophix · · Score: 2, Interesting

    ... its called zipping, most webservers have it as an option to zip the data up as it streams to the client browser

    i fail to see the need to have a "binary xml" file format when there are already facilities in place to compress text streams

  4. Re:For Starters by Omega1045 · · Score: 4, Interesting
    Why? Microsoft has done a fairly good job promoting XML and SOAP XML Web Services. As long as they stick to the standards (yes, I know) I see no reason to keep them out.

    IBM has actually tried to introduce some goofy stuff into the XML standards, like line breaks, etc, that should not be in a pure node-based system like XML. Why are not you picking on them in your comment?

    As far as SOAP and XML Web Services (standardized protocols for XML RPC transactions) Microsoft was way ahead of the pack. And I rather enjoy using their rich set of .NET XML classes to talk to our Unix servers. It helps my company interop.

    --

    Great ideas often receive violent opposition from mediocre minds. - Albert Einstein

  5. But ASCII is binary after all... by MarkWPiper · · Score: 2, Interesting
    The fact is, ASCII is a binary format. It just happens to be a format that has become universally accepted. As the article says, there are certainly benefits to having ASCII-based XML: "The fact that XML is ordinary plain text that you can pull into Notepad... has turned out to be a boon, in practice," he said. "Any time you depart from that straight-and-narrow path, you risk loss of interoperability."

    However, if anything, XML has shown us the power of well-structured information. XML has given the possibility of universal interoperability. Developments in XML-based technologies have led us to the point where we know enough now to create a standard for structured information that will last for several decades.

    It's time that we had a new ASCII. That standard should be binary XML.

    When I think of the time that has been wasted by every developer in the history of Computer Science, writing and rewriting basic parsing code, I shudder. Binary XML would produce a standard such that an efficient, universal data structure language would allow significant advances in what is technically possible with our data. For example: why is what we put on disk any different from what's in memory? Binary XML could erase this distinction.

    A binary XML standard needs to become ubiquitous, so that just as Notepad can open any ASCII file today, SuperNotepad could open any file in existance, or look at any portion of your computer's memory, in an informative, structured manner. What's more, we have the technology to do this now.

    1. Re:But ASCII is binary after all... by MattRog · · Score: 2, Interesting

      Jesus Christ, no. The solution is simple:
      (1) Have every PC OS contain a DBMS (this is not as difficult as you would think)
      (2) Always keep your data in a DBMS
      (3) Have said DBMS transfer the data via whatever method it would like. Chances are this would be some sort of compact, efficient binary method.

      --

      Thanks,
      --
      Matt
    2. Re:But ASCII is binary after all... by FangVT · · Score: 2, Interesting
      The fact is, ASCII is a binary format. It just happens to be a format that has become universally accepted. As the article says, there are certainly benefits to having ASCII-based XML: "The fact that XML is ordinary plain text that you can pull into Notepad... has turned out to be a boon, in practice," he said. "Any time you depart from that straight-and-narrow path, you risk loss of interoperability."
      Not that anybody will care but...

      XML is not ASCII. XML is Unicode. That's why Tim Bray said "plain text" not ASCII.

      Because it was such a long hard road for ASCII to become the universal data format that it is for English text the creators of Unicode wisely made sure that there was backwards compatibility such that any valid ASCII texts (ones that do not include OS-specific, proprietary extensions in the range above 0x7F) are also valid Unicode texts when the encoding is UTF-8.

  6. Re:Step 1 to getting binary XML by Tsiangkun · · Score: 2, Interesting

    In my hands, bzip compresses better, but is somewhere between somewhat slower and orders of magnitude slower on my system, depending on the options used to invoke the command and the size of the file being compressed. gzip is fast, works on streams instead of blocks, and is available on nearly every system.

  7. Why not re-examine http? by digitalgimpus · · Score: 2, Interesting

    I think that's where the true problem lies. HTTP.

    We need to look towards http 2.0. What I would want:

    - pipelining that works, so that it could be enabled for use on any server that supports http 2.0
    - gzip and 7zip support.
    - All data is compressed by default (a few excludes such as .gz files, .zip files etc. since that would be pointless).
    - Option to initiate persistant connection (remove the stateless protocol concept), via a http header on connect. This would allow for a whole new level for web applications via SOAP/XML.

    There are tons of other things that could be enhanced for today's uses.

    HTTP is the problem. Not XML

  8. Ask Erik Naggum! by notany · · Score: 2, Interesting
    Erik Naggum (SGML/XML-guru) who first proposed empty elements

    <foo/>

    form Re: Lisp syntax, what about resynchronization?

    ... so it had to come up, and one of the least
    productive solutions, XML, won the day. I was there, at the conference
    table where the first thoughts that became XML surfaced. a few months
    earlier, I had proposed the need for a special marker for empty elements
    -- and then retracted that proposal because it led to new problems -- but
    guess what survived in XML!...

    Attributes in XML are inherited from SGML and they were thingking markup for textual documents. When you want to represent data it being attribute or not is completely irrelevant.

    Whether something is an attribute or element is _completely_ arbitrary.
    It is based on some arbitrary choices in the design process that reveal
    absolutely no inherent qualities. For purely pragmatic reasons, SGML
    folks will use attributes for some things and elements for others because
    their tools can deal with some things in attributes and some things in
    elements. The faulty idea that attributes say something "about" the
    element and sub-elements somehow constitute be their contents is the same
    premature structuring that premature optimization of code suffers from.
    The whole language is incredibly misdesigned in making that distinction.

    Deep explanation: From:The horror that is XML

    ... XML, being the single suckiest syntactic invention in the history of
    mankind, offers you several layers at which you can do exactly the same
    thing very differently, in fact so differently that it takes effort to
    see that they are even related.

    <foo type="bar">zot</foo> actually defines three different views on the
    same thing: Whather what you are really after is foo, bar, or zot,
    depends on your application. XML is only a overly complex and otherwise
    meaningless exercise in syntactic noise around the message you want to
    send. Its notion of "structure" must be regarded as the same kind of
    useless baggage that come with language that have been designed by people
    who have completely failed to understand what syntax is all about. It is
    therefore a mistake to try to shoe-horn things into the "structure" that
    XML allows you to define.

    In the abaove example, foo can be the application-level element, or it
    can be the syntax-level element and bar the application-level element.
    It is important to realize that SGML and XML offer a means to control
    only the generic identifier (foo) and their nesting, but that it is often
    important to use another attribute for the application. This was part of
    the reason for #FIXED in the attribute default specification and the
    purpose of omitting attributes from the actual tags. In my view, this is
    probably the only actually useful role that attributes can play, but
    there are other, much more elegant, ways to accomplish the same goal, but
    not within the SGML framework. Now, whether you use one of the parts of
    the markup, or use the contents of an element for your application is
    another design choice. The markup may only be useful for validation
    purposes, anyway.

    Let me illustrate:

    <if><condition>...</condition>
    <then>...</then>
    <else>...</else>
    </if>

    The XML now contains all the syntax information of the "host" language.
    Many people think this is the _only_ granularity at which XML should be
    used, and they try to enforce as much structure as possible, which

    --
    Dyslexics have more fnu.
  9. An Option to Binary XML by Anonymous Coward · · Score: 1, Interesting

    I think that Binary XML misses the mark on bringing any real benefits other than transmission compression. XML can be a huge benefit from a human and coding perspective, but it also has drawbacks in transmission (due to size) and in processing (again due to size). A lot of XML data goes thru many different processing systems that never need descriptive tags and the overhead of document size can bog down some very large computers.

    I know, try to do an XSLT on a 60 meg file.

    One approach that could potentially benefit everyone is to have interchangable namespaces. By that I mean have a human readable namespace that also had a machine friendly name space.

    In the Human version you could have those wonderful long tags like [FirstNameOfMyGrandmothersThirdCousin] and have a transform that would make that [ID1001] for maching processing.

    You can save a ton of space by swapping out all of the Elment and Attribute names, holding structure, allow for machines to more efficiently process, and then if a human or UI needs descriptive information, you could go grab the friendly Namespace and be back to your large XML file.

  10. Anecdotal example by plopez · · Score: 2, Interesting

    Had data to be delivered to client, dumped from a database. As flat files they were ~20mb in size as flat files. That bloated ~120mb after conversion to XML.

    Client attempted to open in a DOM based application which I suspect used recursion to parse the data (easy to code, recursion). Needless to say it brought their server to its knees.

    We switched to flat files shortly there after.

    In my problem domain, where 20MB is a small data set, XML is useless. XML seems does not scale well at all (though using a SAX parser helps at times).

    YMMV.

    --
    putting the 'B' in LGBTQ+
  11. How about WBXML? by Anonymous Coward · · Score: 1, Interesting

    Wbxml is very compact, easy to parse and it's standardized too. Have a look at http://www.w3.org/TR/wbxml/ .

  12. Re:You don't need to change XML itself by TheRaven64 · · Score: 2, Interesting
    Actually, you could compress XML by a significant amount by making one simple change to the language. Picture the following piece of XML:

    <SomeTagName>some character data</SomeTagName>

    According to the XML spec, the closing tag must close the nearest opening tag. So why does it have to include the opening tag's name? This is 100% redundant information, and is included in every XML tag with children or cdata. An obvious compression would be to replace this with:

    <SomeTagName>some character data</>

    I really don't know why this wasn't done from the outset (backwards compatibility with HTML, where tags often overlap - although they're not meant to - I suppose). Either allow tags to overlap (which allows some more interesting data structures to be easily encoded in XML) or make the name optional in the closing tag.

    --
    I am TheRaven on Soylent News
  13. Microsoft XML by Spy+der+Mann · · Score: 2, Interesting

    take an example on microsoft XML formats. Word, or the MSN messages format... they're _NOT_ xml. They're proprietary formats DISGUISED as XML.

    If Microsoft doesn't respect text-only XML, what do you think will happen when^H^H^H^Hif binary XML is out?

  14. DNS is binary; does that make it proprietary? by Skapare · · Score: 2, Interesting

    DNS is binary; does that make it proprietary? Not at all. It is a published open standard in RFC 883 and later documents. Other examples include ASN.1/BER as used in SNMP. It's not whether it is binary or text that matters; it's whether it is openly documented and unencumbered by intellectual property claims (a separate issue some of XML has).

    The decision of binary vs. text for a format should be the result of specific needs. XML is verbose. XML can be compressed for transmission purposes, but it still has to be uncompressed to its verbose form for parsing. If speed in parsing is necessary (it might be as I have noticed quite many XML based progams are rather slow), a binary format can have things like length prefixes and continuation tags, instead of having to detect and verify collection of characters whose position is unknown. A parser that does not recognize a given tag, or does not need to process it, in a binary format can simply skip it by jumping the specified number of bytes. Binary format is very optimal for machine processing.

    The usual argument for a text format spans the range of permitting humans to create the content for most things directly in an editor like vi or emacs (no wars here, I listed my favorite last), or reading that content directly, such as to diagnose the real cause of misunderstood errors. XML is too utterly complex for human creation or interpretation to be effective on a direct basis. There may be some argument that it can still be effective for diagnostic purposes (I have in fact needed to do so many times). Given that it is the powerful tools of XML that are used as the basis for the benefit of XML and promoting it, then what does it really matter what format is underneath as long as it is open and unencumbered?.

    A binary format for XML will absolutely not kill XML. DNS is obviously not dead (and you'll love it even more when IPv6 rolls into your network). What a binary format might do is weed out some of the weaker programmers who are sticking their fingers a bit too deep into the inner workings of some applications and tools.

    --
    now we need to go OSS in diesel cars
  15. Binary XML no / Pointer XML maybe? by Anonymous Coward · · Score: 1, Interesting

    I don't know that I care about or for "binary XML". I don't terribly worry about the efficiency that might be gained by converting a textual integer like 3,000,000,000 into a 32 bit binary integer.

    However, I might be interested in a "Pointer XML" - in an XML that allows me to use lseek like operations to efficiently move around a document.

    XPaths conceptually require parsing lots of the document. It's hard to skip over pieces - you have to process all of the byres from the start of the document to the first place where the XPath matches.

    Most of the "optimized XML" formats create a hash table from Xpath to file location or binary. But this is still at least O(length of Xpath string).

    If there was a way of providing the link as a textual integer, and then lseeking to this, it's O(lg NbytesInXmlDoc). That might be a saving.

    (Adage: don't worry about constants like 2X or 4X. Do worry about changing the O() efficiency.)

    There would be no reason that such a "Pointer XML" could not remain entirely textual. It might simply be an extra syntax or modifier to an Xpath:

    Instead of linking to xpath /a/b/c
    Link to /a/b/c || byte_position=5454786

    The lseek positions would have to be in bytes, not characters, and would get confused if the coding system were changed. But they would at least be useful an usable if the coding system were not changed.

    The hard part would be ensuring consistency. E.g. in the example above, you would want to ensure that the element at byte_position=5454786 really was the xpath /a/b/c. It would be bad if it was not. But I think that sort of thing could be checked in the same way that we check DTDs.

    Also, some minor annotations, such as placing anchors at the lseek-ed to byte position, might help in maintaining such consistency.

    Moreover, I would never advocate abandoning XPaths - I would just be suggesting including the lseekable byte positions as a performance hint. It should also be correct to ignore the byte possitions and just use the XPath links.

    By adding padding (blanks, whatever) you could avoid the need to change all of the lseekable byte position hints whenever you changed an element value.

  16. Here are your answers by Anonymous Coward · · Score: 1, Interesting
    It doesn't tell us what the specific performance problems are with XML. Does it take too long to transmit? Does it take too long to validate? Does it take too long to parse? Does it take too long to format? What's the real problem here?

    I cannot believe that your naieve post was modded up to a 5. FWIW the answer to all of your above questions is a resounding "Yes!", although some deserve a stronger "Yes!" than others. Let me state for the record that, from your newbie questions, you are XML-ignorant. And you apparently did not take compiler theory, where you would have learned how computationally expensive parsing was. But you are hardly alone; the industry is full of dumbasses who don't understand what's happening. I, on the other hand, predicted these problems four years ago and have yet to receive my Nobel Prize.

    XML is a cluster fuck for the following reasons. Any message must be:

    1. encoded to XML on the server,
    2. transmitted over the network (but the XML message is longer and requires greater bandwidth),
    3. received by the client,
    4. parsed by the client into some structure(which may require fetching the DTD over the network),
    5. If an error occurs, the message must be retransmitted, otherwise
    6. the relevant fields must be selected from the parsed structure.

    Note that at every step XML requires more CPU, more memory and more bandwidth. This is true for every component of the network! There is no way around these problems other than sheer computing power and throughput. So, one might say, the problem will disappear if we merely wait a few years. Unfortunately other factors are loading the Internet even more than XML, sapping Moore's Law.

    And that's without considering the problems of the W3C's various XML committees! But don't get me started.