Slashdot Mirror


XML Compression Options?

ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"

18 of 51 comments (clear)

  1. GZIP by shemnon · · Score: 4, Informative

    How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations. If windows is your platform of choice you can get at it via cygwin.

    --
    --Shemnon
    1. Re:GZIP by ergo98 · · Score: 5, Informative

      The problem with GZIP is quite simply that it doesn't take advantage of XML specifics (i.e. there are domain attributes of XML that make certain algorithms much more efficient than others): XMill, as an example, manages to achieve almost twice the compression of GZip with approximately the same CPU usage.

      In this case adding in a module that would do XMill on the server side (of course obeying the Accept-Encoding that the HTTP client passes in, so a client that didn't handle the custom compression would not be thwarted) would allow us to use a custom HTTP client that could do the compression and achieve twice the throughput on the limited pipe. As I mentioned in the submission: We have more CPU power than bandwidth, so that 2x compression improvement is very significant (i.e. there is a world of low bandwidth vertical uses out there: satellite, frame relay, CDPD, etc). For those who cringe at the idea of using XML to begin with, please realize that with the proper compression XML encoded data is smaller than any proprietary packaging because of its high degree of predictability.

    2. Re:GZIP by Zeinfeld · · Score: 2
      How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations.

      The problem with the gzip approach is that it is best on long files rather than short ones. The problem with XML is that SOAP messages that should take one or 2 packets end up taking 5. Gzip tends to make 5 packets 4, but it *will* make 50 packets 25.

      The other problem with GZIP is that the implementation is fairly computation intensive. It is not something that you can implement as a simple filter with almost no memory overhead.

      GZIP is a great general purpose compression scheme for largeish quantities of data, but a scheme that is highly domain specific can usually do better.

      The main mistake we made with the Web was to not include a simple compression scheme that was optimized for HTML at an early stage. The Content encoding tag was a good idea, but we should have shipped a lightweight compressor as well.

      The approach in XMill is a good one, but I am not too happy with the table switching codes, also I think that an extra 25% compression can be wrung out at low cost. The big problem with XMill is that it was not progressed as a standard anywhere. I would give more details on the scheme, however there has been a recent rash of Patent Trolls filing patents on ideas they find on mailing lists. I call that type of behavior fraud, the courts call it a cause of action that will cost me $2 million to defend against. Congress call this situation an opportunity to collect campaign bribes.

      XML Compression is something that I am working on, but I am already the point person in two major XML working groups and this is not a high priority. I really need an intern to crank through measuring efficiency on various document sets.

      --
      Looking for an Information Security student project suggestion?
      Try http://dotcrimeManifesto.com/
  2. don't compress the XML by disappear · · Score: 3, Interesting

    Why not reduce the information to transmit, using rsync or the equivalent? Or batch the data and use gzip. Or use ssh's -C option for compression, which won't do as good a job. But mucking with the XML is likely to be the least-understood way of doing it when it comes time to admin the system later, or update things.

  3. XML Compression... by bje2 · · Score: 3, Informative

    these guys claim to be able to compress XML at a 34-1 ratio...

    --

    "Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
    1. Re:XML Compression... by ergo98 · · Score: 3, Interesting

      This is exactly the sort of information I'm looking for: Thank you. I've downloaded their demo and am going to give it a try on some sample data to see how it performs (of course comparing it against the ubiquitous infamous GZip).

  4. gzip? by Howie · · Score: 4, Insightful

    There is a content-encoding plugin for Apache called mod_gzip that will do the server end, for any output including dynamic. I've not tried it, but on face value it's a standards-based way of getting what you want.

    I think, although I can't find it for sure, that LWP supports gzip content-encoding too, which would mean that things like SOAP::Lite and XML-RPC would benefit too.

    more about the content-encoding thing

    --
    "don't fall into the fallacy of believing that Perl can solve social problems. Maybe Perl 6 can, but that's a ways off"
  5. Do you control both sides? by dmorin · · Score: 4, Insightful
    If you are both the producer and consumer of the XML messages, then you could get away with whatever compression scheme you liked (I always envisioned something like mapping the DTD to numeric values, since the DTD describes all tags and attributes you will use), and then converting down that way. Figure for every tag that tags up like 14 characters, you replace it with an integer (or even byte) representing what tag it is, and perhaps something for the length of the field. My point is that you could roll whatever you like if you know who is on the other end.

    The problem of course is that if you control both the producer and consumer you're greatly limiting the applicability of XML in the first place. Just yesterday I explained to my boss that one of the advantages to XML is for cases when you have 10 people who want your data, but you can't dictate was software they use...AND, 6 months from now, 10 other people who you haven't even met yet are going to want your data too. If you're in that boat, and you create any sort of compression scheme, then you're in trouble. If you're not, then you may not need XML at all (at least, not for moving your data around).

    Perhaps you're hoping that there will be some compression module that becomes a standard part of XML, so that you can safely say "Anybody who is able to parse my XML message would also be able to decompress it"? Good luck. Even if that did happen, it would take ages for all of the parsers out there to get up to date.

    What you'll probably find is that something like SOAP or WSDL will have a compression component. But in that case it's ok, because both the client and server sides of WSDL that do the marshalling/unmarshalling will be provided for you by your tools (such as BEA WebLogic). Think about what CORBA IDL was like -- you just write the interface, and then both client and server stubs are automatically generated for you. In that case, it's perfectly reasonable to expect that some compression/decompression code could be written in to the code automatically.

    1. Re:Do you control both sides? by ergo98 · · Score: 2

      I do indeed control both sides: Basically imagine it as a blackhole on both sides and I want to stuff XML in one side and have it pop out the other, but unfortunately the connection in between is a limited pipe. While GZip does a nice job, other XML specific algorithms offer dramatically more compression (and 2x the throughput, for example, is pretty worthwhile).

      Even having said that: Thankfully when they made HTTP they realized that GZip isn't the be-all and end-all, so if I implemented ErgoCompression my custom clients that sent the ErgoCompression Accept-Encoding header value could take advantage of it, while others could use GZip or compress of deflate, or whatever their client supported. To be honest I'm very surprized to see some of the responses that I've seen here on Slashdot, which is "GZip is the monopoly: Use it!", a sort of "Gzip is good enough". Well when it comes down to it, getting 2x or greater XML pushed through that limited pipe seems pretty worthwhile to me, especially because, as you mentioned, I control both ends of the pipe.

  6. "best" option? by coyote-san · · Score: 2

    Define "best," as in "best" option.

    Are you looking for the tightest possible compression, or are you looking for "good enough" compression that's fairly standard? Or do you want "good enough" compression that can be quickly implemented?

    If you're looking for tightest possible compression, that requires a good statistical model of your data and is far beyond the scope of any answer here. Depending on your data, a good encoder could require far less bandwidth than any generic compressor, but it's highly nonstandard.

    If you're looking for something that's "good enough" and standard, there's absolutely no doubt that you should use zlib (gzip) and call it a day.

    If you're willing to stray from the standards for better compression, then bzlib generally offers better compression than zlib.

    Bottom line: figure out what you really need, then pick the tool. Don't just grab the first thing that comes to mind, or a tool that others swear by but which doesn't meet your needs.

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
  7. Specifics, Algorithms by fm6 · · Score: 5, Interesting
    This question really needs more specifics. There's a good reason why you hear so little about XML compression: it's usually not worth the trouble. People assume such a verbose format is fundamentally inefficient, but when the get down to cases they find that the XML "pipe" is just not a bottleneck.

    Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.

    Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2, which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.

    This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?

    Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.

    Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.

    1. Re:Specifics, Algorithms by fm6 · · Score: 2

      Did you read my post? Go back and try again.

  8. take out the spaces by SuiteSisterMary · · Score: 3, Insightful

    Replace any set of spaces greater than 1 with a single space. You'll cut, by a fair bit, your average XML document. :-) In other words, you can have it concise and efficent, or you can have it human-readable and pleasant to look at. :-)

    --
    Vintage computer games and RPG books available. Email me if you're interested.
    1. Re:take out the spaces by SuiteSisterMary · · Score: 2

      There's also stuff like this which are hardware XML accelerators.

      --
      Vintage computer games and RPG books available. Email me if you're interested.
  9. Just use the built in compression methods. by lkaos · · Score: 2

    libXML supports a built in compression method using libz. It may not be the most efficent method of compression, but it will get the job done and it will work seemlessly with any app that uses libXML.

    I've compressed a 1.5MB XML document down to the tens of KB range. That was definitely good enough for me :)

    --
    int func(int a);
    func((b += 3, b));
  10. Search google by Hard_Code · · Score: 5, Insightful

    really, how hard is it:

    http://www.google.com/search?q=xml+compression

    The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.

    --

    It's 10 PM. Do you know if you're un-American?
  11. mod_gzip by dubl-u · · Score: 2

    I've used mod_gzip for the last year or so on a number of sites, one of which gets millions of hits a day. Trading idle CPU time in for drastically reduced bandwidth is a sweet bargain indeed. As a side benefit, those on slow links get a much faster experience.

  12. Re:XML/ASN.1 Translation by Zeinfeld · · Score: 2
    There is some interesting work going on with XML/ASN.1 translation

    No there is not. ASN.1 is an utter crock. The original idea was good, the implementation sucketh, especialy the DER encoding.

    ASN.1 is not particularly efficient, unless you use something like PER which if it is supported requires a support library that comes on 2 CDROMS.

    The XML in ASN.1 proposal is simply an attempt by the die hard OSI/EDI types to try to preserve the work they have done.

    The problem with XML is the over complex schema notation, the problem with ASN.1 is the over complex encoding. So put them together and you have a perfect platform for information engineering consultants to suck an IT budget dry. If that is your objective then go for XML in ASN.1.

    --
    Looking for an Information Security student project suggestion?
    Try http://dotcrimeManifesto.com/