Exhaustive Data Compressor Comparison
crazyeyes writes "This is easily the best article I've seen comparing data compression software. The author tests 11 compressors: 7-zip, ARJ32, bzip2, gzip, SBC Archiver, Squeez, StuffIt, WinAce, WinRAR, WinRK, and WinZip. All are tested using 8 filesets: audio (WAV and MP3), documents, e-books, movies (DivX and MPEG), and pictures (PSD and JPEG). He tests them at different settings and includes the aggregated results. Spoilers: WinRK gives the best compression but operates slowest; AJR32 is fastest but compresses least."
as its slashdotted
this site
http://www.maximumcompression.com/
has been up for years and performs tests on all the compressors with various input sources, much more comprehensive
I read this earlier today through the firehose. It was interesting, but the graphs are what struck me. It seems to me all the graphs should have been XY plots instead of pairs of histograms. That way you could easily see the relationship between compression ratio and time taken. Their "metric" for showing this, basically multiplying the two numbers, is pretty bogus and isn't nearly as easy to compare. With the XY plot the four corners are all very meaningful. One is slow with no compression, one each good compression/time, and the sweet spot of good compression and good time. It's easy to tell those on two opposing corners apart (good compression vs good time), where as with the article's metric they could look very similar.
Still, interesting to see. The popular formats are VERY well established at this point (ZIP in Windows and Mac (stuffit seems to be fading fast), and GZIP and BZIP2 on Linux). They are so common (especially with ZIP support built into Windows since XP and also built into OS X) I don't think we'll see them replaced any time soon. Of course, with CPU power getting cheaper and cheaper we are seeing formats that are more and compressed (MP3, H264, Divx, JPEG, etc) so these utilities are becoming less and less necessary. I no longer need to stuff files on floppies (I've got the net, DVD-Rs, and flash drives). Heck, if you look at some of the formats they "compressed" (at like 4% max) you almost might as well use TAR.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Well, since the dawn of ages I saw ZIP v ARJ, bzip2 vs gzip.
What's the point? Same programs compressing same data on a different computer.
I use gzip for big files (takes less time)
I use bzip2 for small files (compresses better)
I use zip to send data to Windows people
I really, really miss ARJ32. It was my favorite on DOS Days.
These days, file compression is pretty much only used for large downloads. In those instances, you really have to use either gzip, pkzip, or bzip2 format, so that your users can extract the file.
Yes, having a good compression algorithm is nice, but unless you can get it to partially supplant zip, you'll never make much money off it. Also, most things these days don't need to be compressed. Video and audio are already encoded with lossy compression, web pages are so full of crap that compressing them is pointless, and hard drives are big enough. Although, I haven't seen any research lately about whether compression is useful for entire filesystems to reduce the bottleneck from hard drives. Still, I suspect that it is not worth the effort.
TAR is not a compressor.
It seems odd that they didn't include executables/dlls in the comparison (where maxmumcompression.com does). I also find it odd that they are compressing items that normally don't compress very well with most data compression programs (divx/mpegs/jpegs/etc). I'm guessing this is why 7-zip ranked a bit lower than most.
I did some comparison last year, and found 7-zip to do the best job for what I needed (great compression ratio without requiring days to complete). It also doesn't take into account the network speed at which the file is going to be transmitted. I use 7-zipfor pushing application updates and such to remote offices (most over 384k/768k WAN links). Compressing w/ 7-zip has saved users quite a bit of time compared to winrar or winzip.
I would definitely recommend checking out maximumcompression.com (As others have, as well) over this article. It goes into a lot greater detail.
> UM, yeah, the dataset includes WAV files. Try flac [sourceforge.net].
...
> Then you will have exhausted a little more of the compression programs available.
You are aware that all the tools tested are general purpose compressors, and FLAC is not, aren't you?
Otherwise, you would also have to talk about Wavepack, Monkey Audio, Shorten and others.
And those are only the loseless audio codecs. What about lossy codecs?
What about all those different formats for pictures? They compress data as well.
And what about the different video codecs?
It's closed sourced and proprietary though. Someone needs to make an open-source RAR compressor - the problem is you can't use the official code to do that (as it's specifically in the licence), but you could use unrarlib as a basis...
I agree with you on the importance of this article but
Yes, I know it is better than gzip, and it is also supported everywhere. But it is much worst than the "modern" compression algorithms.
I have been using LZMA for some time now for things I need to store longer, and getting good results. It is not on the list, but should give results a little bit better than RAR. Too bad it is only fast when you have a lot of memory.
For short/medium time storage, I use bzip2. Online compression, gzip (zlib), of course.
morcego
By default, Stuffit won't even bother to compress MP3 files. That's what it shows an increase in file size (for the archive headers) and why it is the fastest throughput (it's not trying to compress). If you change the option, the results will be different.
I imagine some other codecs also have similar options for specific file types.
Another problem is that gzip has compression levels ranging from -1 (fast, minimal) to -9 (slow, maximal), and I suspect he only tested the default, which is either -6 or -7.
I wouldn't be surprised if many of the other compression tools have similar options.
--Pat
Here's a scatterplot of resulting file sizes and compression times from the text compression data (lower is better), and as my luck would have it, bzip2 is really the only one that's out of line - i.e. the furthest from the pareto frontier. But then, looking at the same data with file sizes plotted in the range of [0.0, 1.0], it seems like there's a major case of diminishing returns for the expensive algorithms anyways. If you care at all about compression time, good ol' gzip is still a pretty decent choice!
Use ISO 8601 dates [YYYY-MM-DD]
7zip's default compression is LZMA, FYI.
"Build a man a fire warm him for a day, set a man on fire and warm him for the rest of his life."
for general purpose lossless compression. Most modern compression utilities out there mix and match the same algorithms which do the same thing.
With the exception of compressors that use arithmetic coding (which has patents out the wazoo covering just about every form of it), virtually all compressors use some form of Huffman compression. In addition, many use some form of LZW compression before executing the Huffman compression. That is pretty much it for general purpose compression.
Of course, if you know the nature of the data you are compressing you can come up with a much better compression scheme.
For instance, with XML, if you have a schema handy, you can do some really heavy optimization since the receiving side of the data probably already has the schema handy which means you don't need to bother sending some sort of compression table for the tags, attributes, element names, etc.
Likewise, with FAX machines, run length encoding is used heavily because of all the sequential white space that is indicative of most fax documents. Run length encoding of white space can also be useful in XML documents that are pretty printed.
Most compression algorithms that are very expensive to compress are usually pretty cheap to decompress. If you are providing a file for millions of people to download, it doesn't matter if it takes 5 days to compress the file if it still only takes 30 seconds for a user to decompress it. However, when doing peer to peer communication with rapidly generated data, you need the compression to be fast if you use any at all.
Nevertheless, most generaly purpose lossless compression formats are more or less clones of each other once you get down to analyzing what algorithms they use and how they are used.
RAR has recovery records (settable percentage of each archive dedicated to ECC, default off) and recovery volumes (dedicated files with PAR-like recovery capabilities). "Keep broken files" can be used to extract from broken or truncated archives.
With Arithmic encoding however you'd encode each character according to the exact probability it has of occuring and write it as a fractional number between 0 and 1. For example, if you want to encode an "A", you'd pick a number between 0.0 and 0.2 (the lower 20% of our number); if you want to encode a "B", you'd use a number between 0.2 and 1.0 (the upper 80% of our number).
What you keep track of during encoding is the upper and lower bound of this number. So, when I want to encode the first "A", my lower bound is 0.0 and the upper bound is 0.2. The next character to encode is "B". We already now the range we can pick from is 0.0 - 0.2, but to encode a "B" we need to pick a number in the upper 80% of this range, so 0.04 to 0.20 (picking a number between 0.0 and 0.04 would encode another "A").
The next letter, another "B", would use a range 0.072 - 0.200. The 3rd "B" would narrow the range to 0.0976-0.2000. The 4th "B" narrows it to 0.11808 - 0.2000.
At some point, the upper and lower bound will have a few most significant digits in common that cannot change anymore. When this occurs, you can start writing these out as part of your compressed stream. For example, when we encode the 6th character (the 2nd "A"), the range becomes 0.118080 - 0.134464. The first two digits (0.1) can't change anymore now, so we can write them out, and just continue narrowing the range further for subsequent data to be compressed.
At some point, there'll be no more data to be compressed, and you then just pick a number (as convenient as possible) between the upper and lower bound you have established, write it out and end the stream. The process is the same when doing this with binary floating point numbers.