XML Compression Options?
ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"
How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations. If windows is your platform of choice you can get at it via cygwin.
--Shemnon
...gzip? sure it's not state of the art, but everything supports it, and it's a decent compressor....
Why not reduce the information to transmit, using rsync or the equivalent? Or batch the data and use gzip. Or use ssh's -C option for compression, which won't do as good a job. But mucking with the XML is likely to be the least-understood way of doing it when it comes time to admin the system later, or update things.
these guys claim to be able to compress XML at a 34-1 ratio...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
There is a content-encoding plugin for Apache called mod_gzip that will do the server end, for any output including dynamic. I've not tried it, but on face value it's a standards-based way of getting what you want.
I think, although I can't find it for sure, that LWP supports gzip content-encoding too, which would mean that things like SOAP::Lite and XML-RPC would benefit too.
more about the content-encoding thing
"don't fall into the fallacy of believing that Perl can solve social problems. Maybe Perl 6 can, but that's a ways off"
It's text. Gzip/Bzip2/Compress/(pk)zip will all do a decent job. If you are using C++, the following library may be useful. http://www.cs.unc.edu/Research/compgeom/gzstream/. Basically, it implements an igzstream and ogzstream which look and act like normal C++ IO streams, but automagically gzip (or gunzip) the data.
--
The problem of course is that if you control both the producer and consumer you're greatly limiting the applicability of XML in the first place. Just yesterday I explained to my boss that one of the advantages to XML is for cases when you have 10 people who want your data, but you can't dictate was software they use...AND, 6 months from now, 10 other people who you haven't even met yet are going to want your data too. If you're in that boat, and you create any sort of compression scheme, then you're in trouble. If you're not, then you may not need XML at all (at least, not for moving your data around).
Perhaps you're hoping that there will be some compression module that becomes a standard part of XML, so that you can safely say "Anybody who is able to parse my XML message would also be able to decompress it"? Good luck. Even if that did happen, it would take ages for all of the parsers out there to get up to date.
What you'll probably find is that something like SOAP or WSDL will have a compression component. But in that case it's ok, because both the client and server sides of WSDL that do the marshalling/unmarshalling will be provided for you by your tools (such as BEA WebLogic). Think about what CORBA IDL was like -- you just write the interface, and then both client and server stubs are automatically generated for you. In that case, it's perfectly reasonable to expect that some compression/decompression code could be written in to the code automatically.
www.HearMySoulSpeak.com
Define "best," as in "best" option.
Are you looking for the tightest possible compression, or are you looking for "good enough" compression that's fairly standard? Or do you want "good enough" compression that can be quickly implemented?
If you're looking for tightest possible compression, that requires a good statistical model of your data and is far beyond the scope of any answer here. Depending on your data, a good encoder could require far less bandwidth than any generic compressor, but it's highly nonstandard.
If you're looking for something that's "good enough" and standard, there's absolutely no doubt that you should use zlib (gzip) and call it a day.
If you're willing to stray from the standards for better compression, then bzlib generally offers better compression than zlib.
Bottom line: figure out what you really need, then pick the tool. Don't just grab the first thing that comes to mind, or a tool that others swear by but which doesn't meet your needs.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.
Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2, which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.
This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?
Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.
Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.
I think, that if you use something like wbxml you will keep some of the stucture in the file (aka. your parser will be simpler/faster). http://www.devx.com/xmlzone/articles/bs120101/Part 1/bs120101p1-1.asp
(Well I newer tried it, so mayby I should just shut up.)
There is some interesting work going on with XML/ASN.1 translation. Essentially, one can create a 1-1 mapping between an ASN.1 module and an XML schema, and use whatever encoding you're comfortable with (i.e., XML encoding, or ASN.1 [DBP]ER). The advantage here is that the encoding rules for ASN.1 (especially the PER (Packed Encoding Rules)) are very space efficient. Thus, by re-encoding the XML data as ASN.1, one can achieve remarkable levels of compression.
/. post about a recent EE Times article on the subject.
This site has some useful information on the proposed standardization of this effort. There was also a
Replace any set of spaces greater than 1 with a single space. You'll cut, by a fair bit, your average XML document. :-)
In other words, you can have it concise and efficent, or you can have it human-readable and pleasant to look at. :-)
Vintage computer games and RPG books available. Email me if you're interested.
...
/3 /2 2 3 War and Peace /3 /2 2 3 Ulysses /3 /2 2 3 1984 /3 /2 /1
Before a compression:
<library>
<book>
<name>
The Bible
</name>
</book>
<book>
<name>
War and Peace
</name>
</book>
<book>
<name>
Ulysses
</name>
</book>
<book>
<name>
1984
</name>
</book>
</library>
...
After a compression:
1 library 2 book 3 name
1 2 3 The Bible
....
That cut the size down to 40%, and 1/4 of that is the lookup table. It'll depend on how verbose you are with your tagnames, of course, and attributes would get tricky, as would any value that collided with a tag index ref, but you get the idea.
But I wasn't even trying and I got it down to 40% of what it was, and that's considering that 1/4 of that is the lookup table, which stays constant even with a massive increase in records. After that, a zip can cut it down to 25% of what THAT is. I went from 2400b to 200b, and I wasn't even trying.
Of course, a straight zip on the 2400b one brings me down to 220b, so maybe I shouldn't be TOO proud of myself.
The point is, you know your own XML code and you probably know what shortcuts you can take on a compression algorithm.
There's a nice article at http://www-106.ibm.com/developerworks/xml/library/ x-matters13.html?dwzone=xml on the subject.
There is a list of resources at http://wbxml4j.sourceforge.net/
XMill by AT&T is free (not quite GPL, but the source is there too), and it takes advantage of the redundancy in XML data so that it's super efficient. Here are some comparisons to plain old gzip compression (it blows it away). It'd be horrible on random data, but it squishes XML like you wouldn't believe.
libXML supports a built in compression method using libz. It may not be the most efficent method of compression, but it will get the job done and it will work seemlessly with any app that uses libXML.
:)
I've compressed a 1.5MB XML document down to the tens of KB range. That was definitely good enough for me
int func(int a);
func((b += 3, b));
My old company had a freeware product specifically designed to do this. Its called 'XMLZip' and can still be downloaded
from XMLSolutions (Windows and Solaris/Linux distros are available). Its written in Java - I think the source
might be available too.
This distribution contains two Java programs for manipulating XMLZip files (gzip compatible).
XMLZip reads an XML file and writes an XMLZip file.
XMLUnzip reads an XMLZip file and writes an XML file.
A large XML file containing the UNSPSC product classification hierarchy is included. It can be used to see the structure
of the XMLZip file, the reduction in size for various level parameters, and so on.
See the readme.txt file for details on how it works.
really, how hard is it:
http://www.google.com/search?q=xml+compression
The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.
It's 10 PM. Do you know if you're un-American?
I've used mod_gzip for the last year or so on a number of sites, one of which gets millions of hits a day. Trading idle CPU time in for drastically reduced bandwidth is a sweet bargain indeed. As a side benefit, those on slow links get a much faster experience.
So, you control the both ends of the communication so you can implement any compression scheme you want? Why not choose IIOP (The Corba protocol)? I bet IIOP would give you the compression that you want because it is a very efficient protocol. It is also widely supported (many ORB vendors for various programming languages, and Java 2 Standard Edition even comes with a free one).
The binary representation of ASN.1 is supposed to be quite good at representing XML.
If you have a schema for your data, then it can simplify the data by making assumptions about the order and format of the data.. at least that is my understanding.. and of coarse it's no longer verbose and human readable..
Hi, Have you looked in WBXML? it is a standard used for ( the quite dead I think ) WAP protocol, where the XML tags are translated into binary equivalents (I believe there is also some compression done) and I would suggest it also compresses the elements. I do not know how well if fares or even if it is better than gzip, but it might be worth a look for what you are doing.
Basically, it's a serialisation of XML using a tokenisation system - tags, attributes and even values become tokens, and has extensions for unknown items.
It also has extensions for string tables, where commonly used words or phrases can be given, with a lookup into the string table index used.
Although it doesn't actually compress XML per sé, it does do a fairly good job (unfortunately, I don't have any figures to hand).