Slashdot Mirror


XML Compression Options?

ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"

3 of 51 comments (clear)

  1. Specifics, Algorithms by fm6 · · Score: 5, Interesting
    This question really needs more specifics. There's a good reason why you hear so little about XML compression: it's usually not worth the trouble. People assume such a verbose format is fundamentally inefficient, but when the get down to cases they find that the XML "pipe" is just not a bottleneck.

    Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.

    Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2, which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.

    This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?

    Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.

    Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.

  2. Re:GZIP by ergo98 · · Score: 5, Informative

    The problem with GZIP is quite simply that it doesn't take advantage of XML specifics (i.e. there are domain attributes of XML that make certain algorithms much more efficient than others): XMill, as an example, manages to achieve almost twice the compression of GZip with approximately the same CPU usage.

    In this case adding in a module that would do XMill on the server side (of course obeying the Accept-Encoding that the HTTP client passes in, so a client that didn't handle the custom compression would not be thwarted) would allow us to use a custom HTTP client that could do the compression and achieve twice the throughput on the limited pipe. As I mentioned in the submission: We have more CPU power than bandwidth, so that 2x compression improvement is very significant (i.e. there is a world of low bandwidth vertical uses out there: satellite, frame relay, CDPD, etc). For those who cringe at the idea of using XML to begin with, please realize that with the proper compression XML encoded data is smaller than any proprietary packaging because of its high degree of predictability.

  3. Search google by Hard_Code · · Score: 5, Insightful

    really, how hard is it:

    http://www.google.com/search?q=xml+compression

    The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.

    --

    It's 10 PM. Do you know if you're un-American?