A Look at Data Compression
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.
The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.
I'm a big tall mofo.
Just downloaded it and I find that it compresses significantly better than winrar when both are set to maximum. Decompress is quite slow. I use it to compress a small collection of utilities.
Humor from a Genetically Molested Mind
Seems to be a compression speed section on page 12 - Aggregated Results. Ranging from gzip really fast, to winrk really slow.
but I was surprised to see that the reviewer was using XP Professional Service Pack 1. I actually had to double check the review date to make sure that I wasn't reading an old article.
I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.
WinRK may have won only because he used the fast compression setting on all the compressors he tested. Results for default setting and best compression settings are TBA.
There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools
A key benefit to PKZIP and tarballs formats is that they will be accessible for decades or hundreds of years. These formats are open (non-proprietary), widely implemented, and free (as in freedom) software.
The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.
I did a short review and benchmarking of unix compressors people might be interested in.
Mouse powered Chips, Open source Processors and Lego
Why mess around with compressing individual files? DiskDoubler is definitely the way to go. Hell, I even have it set up to automagically compress files I haven't used in a week.
Its running perfectly fine on my Mac IIci.
Know what I like about atheists? I've yet to meet one that believes God is on their side.
I generally prefer gzip/7-Zip.
The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.
For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.
"I just don't understand the desire for compression in the first place."
Sometimes, people have to download things.
I rarely criticize things I don't care about.
In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.
Because when you are storing Petabytes of information it makes a difference in cost.
Besides, all the problems you mention with data coruption can be solved by backing up the information more than once. Anyplace that places a high value on there info is going to have multiple backups in multiple places anyways. The most usefull application of compression is in archiving old customer records. Being mostly text, you can easily get above 50% compression ratios. Also, these are going to be backed up to tape (not disk). Being able to reduce the volume of tapes being stored by 50% can save a lot of money for a large organization.
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
"My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?"
The solution to this issue is popular on usenet, since it's common for large files to be damaged. There's a utility called par2 that allows recovery information to be sent, and it's extremely effective. It's format-neutral, but most large binaries are sent as multi-part RAR archives. par2 can handle just about any damage that occurs, up to and including missing files.
Most of the time however, when it's simply someone downloading something it is only necessary to detect damage so they can download it again. All the formats I have experience with can detect damage, and it's common for MD5 and SHA1 sums to be sent separately anyway for security reasons.
I rarely criticize things I don't care about.
3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?
Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.
/dev/random
They do it to sell more ad impressions. Each time you go to the next page you load a new ad.
I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.
I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.
Ouch.
Since the original site seems to be really slow and split into a billion pages, those who aren't aware of it might want to look at MaximumCompression since it has tests for several file formats and also has a multiple file compression test that is sorted by efficiency. A program called SBC does the best, but the much more common WinRAR comes in a respectable third.
http://www.popularculturegaming.com -- my blog about the culture of videogame players
If you're familiar with Usenet, you've probably encountered PAR files from time to time. A PAR file is a parity file which can be used to reconstruct lost data. It works sort of like a RAID, but with files as the units instead of disks.
Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.
Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.
Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.
Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.
I can't fathom why a responsible and well informed admin would avoid compression.
- It's not the Macs I hate. It's Digg users. -
All I see is ads. I think I found a paragraph that looked like it may have been the article, but every other word was underlined with an ad-link so I didn't think that was it..
- It's not the Macs I hate. It's Digg users. -
It's interesting to note that Stuffit produces worthwhile compression of JPG images, something long thought to be impossible.
I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.
Quidquid Latine dictum sit, altum videtur (anything said in Latin sounds important)
It's a crime that the submitter didn't mention this was with the fastest compression settings.
if you download a file over gprs and each megabyte costs you 3$, then saving 200 megabytes means saving 600$, which is a price for a low-end pc or almost a laptop.
another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.
so it really depends if you need it or not. sometimes you need it, mostly you don't.
but bashing on the issue "like nobody ever needs it" is certainly wrong.
I'd tell you the chances of this story being a dupe, but you wouldn't like it.
Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.
http://en.wikipedia.org/wiki/Phil_Katz
Don't you mean ads?
The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.
The article is basically unreadable, I'd say, due to the ads.