Slashdot Mirror


XML Compression Options?

ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"

3 of 51 comments (clear)

  1. don't compress the XML by disappear · · Score: 3, Interesting

    Why not reduce the information to transmit, using rsync or the equivalent? Or batch the data and use gzip. Or use ssh's -C option for compression, which won't do as good a job. But mucking with the XML is likely to be the least-understood way of doing it when it comes time to admin the system later, or update things.

  2. Specifics, Algorithms by fm6 · · Score: 5, Interesting
    This question really needs more specifics. There's a good reason why you hear so little about XML compression: it's usually not worth the trouble. People assume such a verbose format is fundamentally inefficient, but when the get down to cases they find that the XML "pipe" is just not a bottleneck.

    Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.

    Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2, which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.

    This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?

    Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.

    Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.

  3. Re:XML Compression... by ergo98 · · Score: 3, Interesting

    This is exactly the sort of information I'm looking for: Thank you. I've downloaded their demo and am going to give it a try on some sample data to see how it performs (of course comparing it against the ubiquitous infamous GZip).