XML Compression Options?
ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"
How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations. If windows is your platform of choice you can get at it via cygwin.
--Shemnon
Why not reduce the information to transmit, using rsync or the equivalent? Or batch the data and use gzip. Or use ssh's -C option for compression, which won't do as good a job. But mucking with the XML is likely to be the least-understood way of doing it when it comes time to admin the system later, or update things.
these guys claim to be able to compress XML at a 34-1 ratio...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
There is a content-encoding plugin for Apache called mod_gzip that will do the server end, for any output including dynamic. I've not tried it, but on face value it's a standards-based way of getting what you want.
I think, although I can't find it for sure, that LWP supports gzip content-encoding too, which would mean that things like SOAP::Lite and XML-RPC would benefit too.
more about the content-encoding thing
"don't fall into the fallacy of believing that Perl can solve social problems. Maybe Perl 6 can, but that's a ways off"
The problem of course is that if you control both the producer and consumer you're greatly limiting the applicability of XML in the first place. Just yesterday I explained to my boss that one of the advantages to XML is for cases when you have 10 people who want your data, but you can't dictate was software they use...AND, 6 months from now, 10 other people who you haven't even met yet are going to want your data too. If you're in that boat, and you create any sort of compression scheme, then you're in trouble. If you're not, then you may not need XML at all (at least, not for moving your data around).
Perhaps you're hoping that there will be some compression module that becomes a standard part of XML, so that you can safely say "Anybody who is able to parse my XML message would also be able to decompress it"? Good luck. Even if that did happen, it would take ages for all of the parsers out there to get up to date.
What you'll probably find is that something like SOAP or WSDL will have a compression component. But in that case it's ok, because both the client and server sides of WSDL that do the marshalling/unmarshalling will be provided for you by your tools (such as BEA WebLogic). Think about what CORBA IDL was like -- you just write the interface, and then both client and server stubs are automatically generated for you. In that case, it's perfectly reasonable to expect that some compression/decompression code could be written in to the code automatically.
www.HearMySoulSpeak.com
Define "best," as in "best" option.
Are you looking for the tightest possible compression, or are you looking for "good enough" compression that's fairly standard? Or do you want "good enough" compression that can be quickly implemented?
If you're looking for tightest possible compression, that requires a good statistical model of your data and is far beyond the scope of any answer here. Depending on your data, a good encoder could require far less bandwidth than any generic compressor, but it's highly nonstandard.
If you're looking for something that's "good enough" and standard, there's absolutely no doubt that you should use zlib (gzip) and call it a day.
If you're willing to stray from the standards for better compression, then bzlib generally offers better compression than zlib.
Bottom line: figure out what you really need, then pick the tool. Don't just grab the first thing that comes to mind, or a tool that others swear by but which doesn't meet your needs.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.
Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2, which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.
This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?
Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.
Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.
Replace any set of spaces greater than 1 with a single space. You'll cut, by a fair bit, your average XML document. :-)
In other words, you can have it concise and efficent, or you can have it human-readable and pleasant to look at. :-)
Vintage computer games and RPG books available. Email me if you're interested.
libXML supports a built in compression method using libz. It may not be the most efficent method of compression, but it will get the job done and it will work seemlessly with any app that uses libXML.
:)
I've compressed a 1.5MB XML document down to the tens of KB range. That was definitely good enough for me
int func(int a);
func((b += 3, b));
really, how hard is it:
http://www.google.com/search?q=xml+compression
The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.
It's 10 PM. Do you know if you're un-American?
I've used mod_gzip for the last year or so on a number of sites, one of which gets millions of hits a day. Trading idle CPU time in for drastically reduced bandwidth is a sweet bargain indeed. As a side benefit, those on slow links get a much faster experience.
No there is not. ASN.1 is an utter crock. The original idea was good, the implementation sucketh, especialy the DER encoding.
ASN.1 is not particularly efficient, unless you use something like PER which if it is supported requires a support library that comes on 2 CDROMS.
The XML in ASN.1 proposal is simply an attempt by the die hard OSI/EDI types to try to preserve the work they have done.
The problem with XML is the over complex schema notation, the problem with ASN.1 is the over complex encoding. So put them together and you have a perfect platform for information engineering consultants to suck an IT budget dry. If that is your objective then go for XML in ASN.1.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/