A Look at Data Compression
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
No talk of the speed of compression/decompression?
Bradley Holt
For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.
The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.
I'm a big tall mofo.
I always wanted to know how Compressia ( http://www.compressia.com/ ) works. It uses some form of distance coding, but information about it is quite rare.
This sig does not contain any SCO code.
Just downloaded it and I find that it compresses significantly better than winrar when both are set to maximum. Decompress is quite slow. I use it to compress a small collection of utilities.
Humor from a Genetically Molested Mind
but I was surprised to see that the reviewer was using XP Professional Service Pack 1. I actually had to double check the review date to make sure that I wasn't reading an old article.
I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.
Is it just me or is that site really difficult to navigate amongst all those ads? Speed of compression would have been nice too.
It's a real shame that 1) the guy only did Windows archivers, and 2) SBC Archiver is no longer in active development, closed source, and Windows-only.
Disinfect the GNU General Public Virus!
WinRK may have won only because he used the fast compression setting on all the compressors he tested. Results for default setting and best compression settings are TBA.
There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools
A key benefit to PKZIP and tarballs formats is that they will be accessible for decades or hundreds of years. These formats are open (non-proprietary), widely implemented, and free (as in freedom) software.
The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.
I did a short review and benchmarking of unix compressors people might be interested in.
Mouse powered Chips, Open source Processors and Lego
I'd like to see an article about exe compressors done like this.
There are some interesting beasts out there like UPX, which as far as I remember does quite respectable packing on the win32 platform.
the WinRK archive compressor tested here seems to achieve quite amazing results on the cost of speed.. a lot of speed..
Venlig Hilsen / Regards
John Hinge - shayera /
"Buffy I love you... Please God No!" S
Why mess around with compressing individual files? DiskDoubler is definitely the way to go. Hell, I even have it set up to automagically compress files I haven't used in a week.
Its running perfectly fine on my Mac IIci.
Know what I like about atheists? I've yet to meet one that believes God is on their side.
Speed aside [and speed would be a huge concern if you insisted on compression], I just don't understand the desire for compression in the first place.
As the administrator, your fundamental obligation is data integrity. If you compress, and if the compressed file store is damaged [especially if the header information on a compressed file - or files - is damaged], then you will tend to lose ALL of your data.
On the other hand, if your file store is ASCII/ANSI text, then even if file headers are damaged, you can still read the raw disk sectors and recover most of your data [might take a while, but at least it's theoretically do-able].
In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.
Looks like the site got slashdotted while I was in the middle of reading it. What file types were used as input? Clearly compression algorithms differ on the file types that they work best on. Also, a better metric would probably have been space/time, rather than just using time. Also, I know that zlib, for example, allows you to choose the compression level - was this explored at all?
Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.
Anyone else having trouble viewing the site? It comes up utterly blank in IE with all patches on fully updated XP. View sources shows everything you'd expect to see but it's rendering blank. ?? (useless "don't use IE" type comments will be modded flamebait)
I generally prefer gzip/7-Zip.
The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.
For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.
Why mess around with compressing individual files? DiskDoubler is definitely the way to go.
And NTFS of Windows 2000 or later includes technology similar to DiskDoubler.
how does it perform against the rest?
http://rzip.samba.org/
Not everyone cares about how great their data integrity is with compressed files- They just care about compressing a few files to send to someone over IM faster than if they were sending them uncompressed. When telling someone they have to wait 40 mintues for your file to finish sending because it's uncompressed, then speed/compression becomes the deciding factor.
>backing up last years data to make room for more of the same. If it's really more of the same, using delta compression on new data using last-years data would work nicely.
I did a small test of the common linux compression commands back in 2000. Here are the results: (note that some of the command options have changed since then, for example tar now uses -j for bzip2)
THE COMPRESSION UTILITY TEST
Compression utilities tested: zip, rar, gzip, bzip2, tgz(tar with the z flag invoked). Each test was run three times. For each completed test the system was rebooted. Hardware used: Pentium2 350Mhz, 256Mb RAM. OS: linux Mandrake 7.1. The system load was minimal. The "time" commands was used to time the elapsed time, the "ls -l" command was used to determin the size and a script was used to determine the total size of gzip files.
Note: gzip, packs individual files recursively. For bzip2, the command invoked was tar -cvIf file.bz2 dir (in gnu tar, the I flag invokes bzip2). for tgz, tar with the z flag invokes gzip.
TEST 1 - compressing multiple files
total size of the dir: 91.621.857 bytes, total files: 3540 (most of these files are ascii and html, but there are a few gifs and jpgs too.)
default compression settings:
tool time elapsed MB/s compressed to time elapsed uncompressing
gzip 1m.44s 0.88 24.884.124 37s
zip 1m.10s 1.3 25.813.958 41s
rar 3m.25s 0.44 20.784.489 48s
bzip2 3m.54s 0.39 17.399.561 1m.17s
tgz 1m.09s 1.32 23.821.446 36s
maximum compression settings:
tool time elapsed MB/s compressed to time elapsed uncompressing
gzip 2m.00s 0.76 24.670.516 36s
zip 1m.42s 0.89 25.593.448 39s
rar 10m.12s 0.14 18.698.710 1m.02s
bzip2 n/a (the comprsession rate can not be specified through tar, is the maximum default?)
tgz n/a (the compression rate can not be specified through tar, is the maximum default?)
CONCLUSION: use tgz (tar with the z flag) if time is an issue, otherwise use bzip2(tar with the I flag)
TEST 2 - compressing 1 ascii file
size of the ascii file: 53.819.786 bytes (the file was taken out of my mailbox)
default compression settings:
tool time elapsed MB/s compressed to time elapsed uncompressing
gzip 42s 1.28 15.560.144 15s
zip 41s 1.31 15.560.261 17s
rar 1m.57s 0.45 11.507.387 17s
bzip2 1m.58s 0.45 10.788.502 39s
tgz 54s 0.99 15.560.907 8s
maximum compression settings:
tool time elapsed MB/s compressed to time elapsed uncompressing
gzip 44s 1.22 15.486.842 15s
zip 45s 1.19 15.486.959 16s
rar 6m.40s 0.08 09.582.810
Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.
/dev/random
It is not only the space, but also the speed. Once the data is compressed, backing up the compressed data takes less time. If you compress, then backup you have to compare the compression time to the transfer time. Now, if you compress once, then backup, then copy the backup you now compare the compression time to 2X of the transfer time.
Outside of the pure speed issue, what media swapping? Once you exceed the media capacity (I'm talking removable media), the media needs to be swapped which not only takes time, but most like requires human interaction. If you have a 30GB tape, but you have a 40GB to backup, tape need to be swapped. This eliminates the "start the backup, go home" backup process.
Fight Spammers!
I always compress my compressed files over and over until I achieve absolute 0Kb.
I carry all data of my entire serverfarm like that on a 128Mb USB-stick.
I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.
I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.
Ouch.
Since the original site seems to be really slow and split into a billion pages, those who aren't aware of it might want to look at MaximumCompression since it has tests for several file formats and also has a multiple file compression test that is sorted by efficiency. A program called SBC does the best, but the much more common WinRAR comes in a respectable third.
http://www.popularculturegaming.com -- my blog about the culture of videogame players
The "related links" box for this story is horribly broken. Instead of being links related to the story, it's a bunch of advertising. I'm sure this was a mistake or a bug in slashcode itself.
I've searched the FAQ, but I can't figure out how to contact slashdot admins. Does anyone know an email address or telephone number I can use to contact them about this serious problem? I'm sure they'll want to fix it as quickly as possible.
http://rzip.samba.org/ is a phenomenal compressor. It does much better than bzip2 or rar on large files and is open source.
Interesting that the article talks about compression ratio and compression speed. When considering compression, Decompression time is extremely relevant. I don't mind witing more to compress the fileset, as long as decompression is fast. I normally compress once, and then decompress various times (media files and games for example).
rzip wasn't reviewed but it uses hashing to quickly look for previously seen data. I think it's great. A tutorial with it and other linux compression tools is here. The tutorial also has graphs that make it easy to see the trade offs between speed and compression ratio, as well as advice on which compressors increase effective bandwidth the most for your CPU and network speed.
If you're familiar with Usenet, you've probably encountered PAR files from time to time. A PAR file is a parity file which can be used to reconstruct lost data. It works sort of like a RAID, but with files as the units instead of disks.
Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.
Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.
Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.
Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.
I can't fathom why a responsible and well informed admin would avoid compression.
- It's not the Macs I hate. It's Digg users. -
Those are some pretty impressive compression ratios, but how does rzip do speed-wise? Is it faster, slower, or about the same as bzip2?
Regardless of how fast it is, it looks like it's worth considering if you have large files to compress. Thanks for pointing it out--I'll give it a try next time I make backups.
Is there any mention made about unicode support? I know that WinZip is out of the question for me because I can't compress anything with Chinese filenames with it. They'll either not work at all, or become compressed but the filenames will turn into garbage. Even though the data stays intact, it doesn't help much if it's a binary and has no intelligible filename.
I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.
Why use JPEG or PNG when you can just use .BMP files?
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
A suitable level of paranoia would suggest that it would be good to decompress the compressed files and verify that they produce the identical dataset. I did not see this step in the overview.
-- Don't believe everything you read, hear or think
All I see is ads. I think I found a paragraph that looked like it may have been the article, but every other word was underlined with an ad-link so I didn't think that was it..
- It's not the Macs I hate. It's Digg users. -
It's interesting to note that Stuffit produces worthwhile compression of JPG images, something long thought to be impossible.
I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.
Quidquid Latine dictum sit, altum videtur (anything said in Latin sounds important)
What about lzip? I've heard good things about this archiver but it's homepage seems to have gone down. Here's the archive.org link:
l zip.sourceforge.net/index.html
http://web.archive.org/web/20041010014034/http://
It's a crime that the submitter didn't mention this was with the fastest compression settings.
Proprietary, costs money...
I use ZipGenius - handles 20 compression formats including RAR, ACE, JAR, TAR, GZ, BZ, ARJ, CAB, LHA, LZH, RPM, 7-Zip, OpenOffice/StarOffice Zip files, UPX, tc.
You can encrypt files with one of four algorhythms (CZIP, Blowfish, Twofish, Rijndael AES).
If you set an antivirus path in ZipGenius options, the program will prompt you to perform an AV scan before running the selected file.
It has an FTP client, TWAIN device image importing, file splitting, convert RAR into SFX, converts any Zip archive into an ISO image file, etc.
And it's totally free.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
They are testing 7-zip at the FAST setting, which does a poor job compared to the BEST setting.
Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.
http://en.wikipedia.org/wiki/Phil_Katz
The firewall is a service.
Disable it.
Simple as that.
I wouldn't say PKZIP is like opensource tarballs using gzip or bzip2.
Infozip comes to mind.
Also check out textfiles.com creator Jason Scott's BBS documentary if you haven't yet!
# Compression tells the story of the PKWARE/SEA legal battle of the late 1980s and how a fight that broke out over something as simple as data compression resulted in waylaid lives and lost opportunity.
http://bbsdocumentary.com/
The extremely controversial debacle mentioned on wikipedia is also discused in Jason Scott's BBS documentary.
http://www.bbsdocumentary.com/
Compression tells the story of the PKWARE/SEA legal battle of the late 1980s and how a fight that broke out over something as simple as data compression resulted in waylaid lives and lost opportunity.
A while ago, linux journal had a great comparison of a lot of programs, with a lot of options, comparing speed and resulting size. If you want to know something about compression on unix, go and look. Everything! It even convinced me to buy the magazine! (Yep, I start to sound like an ad). Anyway, check this link
I used to like this one: Archive Comparison Test, but unfortunately it hasn't seen updates since 2002 for general data compression. However, that's still in the post-WinRAR 3.00 era, and the Windows archiver summary explains a bit why WinRK may win here, but still not be too well-known. Good compression isn't everything -- one often have to keep the speed aspect in mind too. And when you've then picked an archiver with nice compression for the speed, you may start looking at the feature set. Again WinRK isn't state-of-the-art there. It's mostly a pure no frills compressor where you can ignore durations, especially for large archives. Not nearly "an archiver for everyone".
Personally, after a couple of years of testing things out (OK, make that a decade -- time flies), I believe RAR by far exceed most archivers' features nowadays, and also hit the sweet spot of good compression for reasonably good speeds. I think RAR trumps both WinZIP 10, 7-zip, bzip2, and all other common archivers you throw at it as for features, and does really well in the compression field for being so all-around. It can decompress most common archive formats too. For a lower cost than WinZIP, while to me looking just as easy to use.
WinACE was once an archiver preferred by some over RAR, but it sort of died out due to a lack of updates, or at least a lagging behind by RAR's improvements. What once looked promising there now looks more like a rarely used RAR-wannabe to me.
7-zip is the one other archiver that has recently caught my attention because it's open source and generally compress better than RAR, still at pretty good speeds. However, it's nowhere near RAR's feature set and lacks pretty large chunks of important features for me to use it still, but I keep having an eye on it, and I don't dislike it at all, and can clearly understand why some prefer it. 7-zip has become my favorite over bzip2 (in turn over gzip) now as my favorite open source archiver, and its cross-platform support is looking better these days with OS X, Debian, Fedora, and Gentoo support, although unofficial, directly from its home page.
Beware: In C++, your friends can see your privates!
If your data is that valuable, compressing makes it more likely to lose it.
Thanks - I was getting a little lonely there.
I think part of the problem is that most /.-ers believe that
But, of course, pr0n has no inherent integrity, therefore it seems to me that maybe the concept of data integrity is essentially meaningless to the averageIt's rather pointless to compare compressing jpegs between gzip and anything else, because jpeg internally uses gzip to compress the blocks that make up the image.
Also for a lot of applications, compression speed is not important, decompression speed is. If you're distributing software, it's not that much of a problem if it takes a lot of time to compress, but if the install takes ages because the decompression is too slow it does matter.
This is sad. Over and over slashdot is posting stories with nothing more than some lame tech review and dozens of ads. I really believe people are generating sites with crap technical content, packing them with ads, and submitting to slashdot hoping to win the impression/click lottery.
Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.
Please try l33tZip, the *BEST* compression software available. We have taken the best settings of WinRAR and changed its name to "FAST". OMFGWTFBBQ best invention ever!!!111
If I can do it, its probably not worth doing... probably
Yeah, in that same vein, how many (if any) of these compressors will take advantage of my shiny new Athlon 64 X2? It's amazing to see the difference in compression times with XVID or the new DiVX - but I have yet to see a compression program use two processors. That said, I usually use 7-zip as my main compression program. Flexible, compatible, free...
"...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
Back in the days of Doublespace, I used to not compress because bad sectors were common and it was easier to recover parts of a file when you are dealing with an uncompressed file and the compression mechanism wasn't good at dealing with keeping the rest of the compressed disk image valid when parts of it got corrupted. Now I always compress NTFS volumes.
Ignoring that this article is just one big advertisment:
:55 to decompress, file size 188,380,358 bytes.
:55 to get 6:19. That's a difference of 22:23 minus 6:19 to get 16:04 that the 50 meg needs to be sent in. Or 964 seconds.
900 MB of text data. Precisely 944,156,137 bytes of text files. AMD 1800 w/1.5 gig of RAM. Cable connection. My objective is often getting the data to someone else.
Comparisons:
7Zip = 5:24 to compress,
WinRK = 18:35 to compress, 3:48 to decompress, file size 132,097,001 bytes.
Note that this is one of the fastest settings on 7Zip, I didn't have time to see if 7zip could beat it in size.
That's a difference of about 50 meg, which may seem like a lot, but imagine if you just wanted to send these 900 megs of text files to someone in the quickest amount of time. With WinRK, immediately add the 18:35 and 3:48 to get 22:23. WinZip is 5:24 plus the
Actual amount to transfer: 188,380,358 - 132,097,001 = 56,283,357 bytes.
I have a 256k upload, which is about 32K per second. 32Kpbs at 964 seconds is about 30.8 megs. So while 7zip isn't quite as good in a one peer to one peer transfer with say, a cable modem, it could be argued that the excessive processor time needed to compress and (especially!) decompress the file is ridiculous when compared to the saved space.
Why is it bad?
Because data backups corrupt, but often they do not corrupt all the way.
Which leaves the possiblity open for partial recovery. Especially if only part of the data is needed, this can be "good enough".
However if the entire data set is compressed and a part of it corrupts it can make it very difficult to recover the data that is still uncorrupted. In this case think of data compression as a low-grade form of data encryption.
That's not to say that you CAN'T compress data in a safe way, it's just that you have to be very smart about it.
Case in point.
Lets say your a Unix/Linux user. You have nice choices between tar, dd, dump, cpio, and other forms of data copying utilities. Each with their own strengths and weaknesses.
Then you have different compression technics to choose from, bzip2, zip, gzip, rar, etc. etc.
So lets say you choose to use gzip and tar, which are good old standbyes that do a good job and are almost universally recognized and supported.
However your directory system you want to backup is bigger then the medium your backing up to. Say you have a 8 gig directory system and your backing up to 650 meg cdroms.
So the kneejerk response is to:
tar czf - source | split -b 650m backup-
So that will create a bunch of backup-aa, backup-ab, backup-ac, etc etc files that are 650 megs each which is a nice size for backing up to cdroms.
However, if one of the cdroms is burned incorrectly or gets lost, then when you go:
cat backup-* > tar zxf -
Then you have hosed all your data.
So instead of doing compression along with the tarball, THEN splitting, you do the tarball to split then compress. And then do smaller sizes of files to make it easier to handle, since now that data has different compression rates then you can make it so that it fits all neatly into cdrom-sized nuggets.
tar cf - source | split -d 50m backup-
then do the gzip and copy with a simple script or whatnot.
Now if you have a missing cdrom or part of the cdrom is toast...
you gunzip all the remaining files...
cat backup-* | tar xf -
it will bitch when comes accross a partially-their file and exit.
Then with some manual work you remove the backup files that worked out so far, then finish up with the backup. Tar will complain that it's missing the starting point for some data and ignore that, but when it comes to a file header then it will happily finish up with what it has left.
You still loose some data, but the rest of the data is easily recoverable.
Also keep in mind that different data compresses differently.
If you have uncompressed audio data it makes more sense to compress it to ogg, mp3, or flac before backing it up. Also with images it makes sense to agressively compress them to png (lossless) or jpeg (lossy) before backing them up. You'll get much more efficient compression in sizes. (However seeing that most of us get our information, of this type, off the net then it's probably already compressed.)
Please. Explain.
Look - I don't have an "explanation".
And I'm even receptive to two of the pro-compression arguments:
Nevertheless, I have had to deal with corrupted files, and have had to write my own file-recovery software to examine and alter bad files [at the byte-level], and I can tell you that RECONSTRUCTING A CORRUPTED FILE BY HAND IS AN UNMITIGATED DISASTER - EVEN IF YOU HAVE ACCESS TO THE SOURCE CODE THAT CREATED THE FILE IN THE FIRST PLACE.Trust me - you do not ever EVER EVER want to be handed the task of re-creating a corrupted file - even if you have access to source code. 'Cause if you are given that assignment, you can just about kiss goodbye the next several weeks of your life.
And if the corrupted file was compressed with some weird-ass compression scheme [for which you may or may not have the source code], then hell - it might take you YEARS to figure out what happened. Maybe even forever.
I know not many of you actual RTFA, but that article was so damned annoying. There's a table in there - think it's to compare compression schemes? nope - it's for processors. There are red links.. article related? Nope - ad links. Blue underlined links - yup, for more ads.
What a steaming pile of shit. Happy new year.
Lossless data compression is a pretty well studied subject. Shannon started it back in the 40's and plenty of research has gone into it since.
There are basically three ways to do lossless compression: Huffman, Arithmetic, and LZW. Technically Huffman can achieve the best of three, however its generally the worst because of implmentation issues (it would take a lot of processing to do rigourous Huffman encoding).
Arithmetic coding is generally better but is difficult to implment. I think IBM is the company who actually sells an arithmetic coder (I could be wrong though).
LZW is by far the best of the three (you can read online how it works), but alas it is patented and anyone who gives away free copies of it will get sued.
I know for a fact that gzip uses Huffman, which would explain its lackluster performance. I haven't researched it further, but I would not be suprised if the three proprietary compression programs which "won" this review use LZW. I also wouldn't be suprised if they pay a good amount to LZW's patent holders (Unisys I think).
I'd be interested to see how gzip performs on its "maximum compression" setting. Like I said earlier, Huffman can can achieve the theortical limit on compression where LZW cannot.
Fast Federal Court and I.T.C. updates
Using gzip to back up terabytes of files sounds like a very dumb idea, since gzip has no error recovery mechanisms.
Software piracy is victimless theft.
I knew I had seen this story before but it wasn't here. This article was up on Digg three days ago--with only three Diggs to it's name (at the time of this writing), but it's front page news here? Interesting to say the least...
I predict that this Digg will become frontpage Slashdot news shortly. It was quite popular (914 diggs so far) and it's hit the three-day mark...
I know, this is all so OT, but it's no worse then whining about duplicate postings here...
Oh the irony here is just too much to take without laughing! My comment gets hammered with the REDUNDANT pummel when I point out that /. is being REDUNDANT in posting old Diggs? Man, it just doesn't get any better then this to make a point.
Moderators: did you catch the not-so-subtle play I made here by quoting ALL of my original message? In case you didn't, I'm beinging REDUNDANTLY sarcastic...
Enjoy!
There are plenty of good compression algorithms out there but most of them are covered by patents. There have been many cases where a small company comes out with some cool new way of compressing stuff and then later being told to pay royalties. It can be a real pain trying to decompress data in a few years when the company that made the decompressor is no longer in business.
Seriously, there are many freebie compression tools which weren't mentioned but which are in common enough use that they can be regarded as highly significant in the market, or which are simply SO good that they are likely to become significant. Zzip and SZip are big ones that didn't get mentioned.
Further, since speed is considered, it is unfair to list bzip2 without mentioning pbzip2 or bzip2smp (two parallel versions), as you'd obviously get a speed boost from non-sequential compression. Not sure what it does to the compression ratio, though.
Finally, some forms of compression - notably Huffmann Compression - rely on the size of the compression table to determine how well you'll compress data. On a modern computer, where multiple gigs of RAM is no longer unusual, you could reasonably look at frequencies for 24-bit strings.
Most Huffmann compression will use 8-bit frequency tables, a few will use 16-bit, because the memory requirements get big. Fast. Not only do you need to record the total frequency of the wordsize you're using, you also need to have space to build the encoding tree. Even then, the level of compression you'll get will only improve to a point. After that, longer words will produce worse compression or even inflation.
In the examples used - audio and video data - you will most likely want a 16-bit word for the audio and a 24-bit word for the video, because that reflects the nature of the data. 8-bit words on the frequency tables are going to be crap, because you're compressing random fragments of words, so artificially worsening the encoding tree you're going to build.
I have two points here. First is that by picking the right (or wrong!) parameters for the data, you can always rig a benchmark. My second point is that you can often tailor an algorithm that would normally be worse than some other algorithm such that the worse of the two will outperform the default behavior of the better one.
Ideally, you'd use a form of arithmetic encoding, but that is so riddled with patents that although you could (in theory) develop a system which was numerically identical but did not infringe on the wording of the patents, most Open Source and low-cost vendors don't bother trying.
The secret, in compression, is not to use default algorithms if you can avoid it. (Ideally, this would be when the compression header stores enough information for you to tweak table sizes, etc, so that off-the-shelf decoders will work with your custom encoders.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding. Range encoding and arithmetic coding are both variants of the same basic technique of entropy encoding. That said, the compression difference between huffman coding and arithmetic coding is minimal. I think (though I'm not entirely sure), entropy encoding might be a subset of a larger family of algorithms called markov modelling.
LZW is a refinement on LZ78, which has other variants such as LZSS. It is a dictionary coding algorithm. Similarly, the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE, as does xZip and PNG. DEFLATE first compresses the stream with an LZ77 variant, and then compresses the resulting stream with huffman coding to squeeze out some redundancy. LZW is no longer covered by patents, at least not here in Europe.
So what you wrote about huffman coding, arithmetic coding and LZW was largely misinformed. There are two lossless methods: entropy encoding and dictionary coding, huffman coding and arithmetic coding representing the former and LZW representing the latter. Some compression algorithms combine the two, DEFLATE being an example.
I don't like trolls and mod against me if you like, but I'd prefer if you'd reply.
Here's coral cache link: http://www.rojakpot.com.nyud.net:8090/showarticle. aspx?artno=4&pgno=0 for those who do not bother to edit the URL manually :)
Pulsed Media Seedboxes
If you're paying $3 per megabyte for cellular data, you're getting screwed.
Mine is billed by the minute and comes out to between $0.17 and $0.54 per megabyte (assuming 100 kbps on average), depending on whether I'm using my plan minutes or overage minutes. And that's just during the day - it's free between 9:00 PM and 6:00 AM every day, all day Saturday and Sunday, and on holidays.
Visual IRC: Fast. Powerful. Free.
On current x86 hardware I get on average ~30MB/sec with lzop and ~50% compression when imaging HDD images[1].
;).
USD100 for LTO3? Sure looks like tapes are pretty expensive. I'd use tapes for legacy backups or where physical shock is an issue, or when you have tons of tapes and need automated loaders.
But removable hard drives seem a more attractive option for most cases nowadays (small to medium businesses). Large corporations can probably afford to be locked in to a particular tape technology, for the convenience of automated tape libraries.
LTO3=400GB storage at 10MB/sec (native) @ about USD100 per tape and USD2K for the cheapest drive.
SATA= 250GB storage at 40MB/sec (native+ average sequential transfer, 60MB peak) @ ~USD100 per drive.
SATA hotswap cage = USD100-USD200.
PATA+USB= 250GB storage at 20MB/sec (native+ average sequential transfer) @~USD100 per drive. PATA to USB enclosure USD30 to USD50 (for a decent one).
Plus PATA/SATA is less of a "locked-in" technology compared to LTO3 or other tape technologies.
With tape drives you have to deal with two main standards that could become obsolete. First = the tape standard (e.g. LTO3, DLT, DDS etc) , second = the tape drive interface standard (e.g. SCSI).
If you got an expensive LTO2 drive in 2003 you are stuck with 200GB native capacity media. Same goes for DLT, DDS etc. You'd have to pay for an expensive LTO3 drive, and then when LTO4 comes out, you're still stuck with LTO3 capacity unless you pay for a probably expensive LTO4 drive.
In contrast with hard drives you just deal with the drive interface standard (e.g. SCSI, SATA, PATA). With HDDs each "tape" comes with its own drive
If 800GB SATA drives become cheaply available, you can start using them with your existing backup systems.
In desperate situations you are more likely to be able to find servers/PCs where you can plug the "media" to and start restoring stuff. Whereas with tape you need a mucho expensive tape drive for each backup/restore point.
A decent backup/restore server with a decent drive cage can hold multiple drives and you can backup multiple machines to different drives, and on decent hardware you can get 40MB/sec for each backup (multiple gigabit interfaces, SMP/multi-core CPUs). If you have a server with a 4 drive cage and 2 x 1 gigabit NICs, you can easily get 4 x 40MB/sec backup/restore streams going from/to different targets.
For tape you'd need four expensive tape drives to do that.
You get sub-second random access. No need to wait 10 seconds to seek.
Last but not least: if I have to restore backups with The Boss/Customer breathing down my neck, I'd pick 40MB/sec over 10MB/sec. Perspective: 11 hours to read 400GB from an LTO3 tape vs 2.8 hours to sequentially read 400GB from SATA drives.
[1] first 131MB of a disk image
time dd if=drive.img bs=131072 count=1000 | lzop -c |wc -c
1000+0 records in 1000+0 records out 66784879
real 0m3.307s user 0m2.442s sys 0m0.842s
39MB/sec 1.96:1 compression
For first 131MB of linux kernel tar uncompressed ball (cached in RAM):
time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | lzop -c |wc -c
1000+0 records in 1000+0 records out 46473494
real 0m2.483s user 0m1.660s sys 0m0.821s
52MB/sec 2.82:1 compression
time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | gzip -c --fast |wc -c
1000+0 records in 1000+0 records out 36786087
real 0m5.965s user 0m5.297s sys 0m0.659s
22MB/sec 3.56:1 compression
time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | gzip -c |wc -c
1000+0 records in 1000+0 records out 29724624
real 0m11.283s user 0m10.640s sys 0m0.615s
11.6MB/sec 4.41:1 compression
First 131MB of wave file (cached in RAM)
time dd if=somewave.wav bs=131072 count=1000 | lzop -c |wc -c
1000+0 records in 1000+0 records out 128273334
real 0m5.520s user 0m4.597s sys
Jeff Gilchrist's Archive Comparison Test has been around for years, and covers many more archivers and uses several different data sets, on several different platforms. It has even been cited in compression literature:
http://www.compression.ca/
Music speeds up when you yawn, but does not change pitch.
PS, Comparison_of_file_archivers externals links