A Look at Data Compression

Speed by mysqlrocks · 2005-12-26 06:37 · Score: 3, Insightful

No talk of the speed of compression/decompression?

Re:Speed by sedmonds · 2005-12-26 06:42 · Score: 4, Informative

Seems to be a compression speed section on page 12 - Aggregated Results. Ranging from gzip really fast, to winrk really slow.
Re:Speed by Anonymous Coward · 2005-12-26 07:08 · Score: 3, Insightful

No talk of the speed of compression/decompression?

Exactly! We compress -terabytes- here at wr0k, and we use gzip for -nearly- everything (some of the older scripts use "compress", .Z, etc.)

Why? 'cause it's fast. 20% of space just isn't worth the time needed to compress/uncompress the data. I tried to be modern (and cool) by using bzip2, yes, it's great, saves lots of space, etc., but the time required to compress/uncompress is just not worth it. ie: if you need to compress/decompress 15-20gigs per day, bzip2 just isn't there yet.

Also, look at what google is using---they probably store more data than most other corps, but they still use gzip (I think, from some description, somewhere).
Re:Speed by Arainach · 2005-12-26 07:20 · Score: 3, Insightful

The Article Summary quoted is completely misleading. The most important graph is the final one on page 12, Compression Efficiency, where gzip is once again the obvious king. Sure, WinRK may be able to compress decently, but it takes an eternity to do so and is impractical for every-day use, which is where routines like gzip and ARJ32 come in - incredible compression for the speed in which they can operate. Besides - who really needs that last 54MB in these days of 4.9GB DVDs and 160GB Hard Drives?
Re:Speed by Luuvitonen · 2005-12-26 07:52 · Score: 5, Insightful

3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?
Re:Speed by Karma+Farmer · 2005-12-26 08:01 · Score: 2, Interesting

3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?

If your file starts out as 250 mb, it might be worth it. However, if you start with a 2.5 gb file, then it's almost certainly not -- especially once you take the closed-source and undocumented nature of the compression algorithm into account.

/not surprisingly, the article is about 2.5 gb files
Re:Speed by sshore · 2005-12-26 08:09 · Score: 5, Informative

They do it to sell more ad impressions. Each time you go to the next page you load a new ad.
Re:Speed by Wolfrider · 2005-12-26 09:22 · Score: 2, Informative

Yah, when I'm running backups and it has to Get Done in a reasonable amount of time with decent space savings, I use
gzip -9. (My fastest computer is 900MHz AMD Duron.)

For quick backups, gzip; or gzip -6.

For REALLY quick stuff, gzip -1.

When I want the most space saved, I (rarely) use bzip2 because rar, while useful for splitting files and retaining recovery metadata, is far too slow for my taste 99% of the time.

Really, disk space is so cheap these days that Getting the Backup Done is more important than saving (on average) a few megs here and there.

But if you Really Need that last few megs of free space, this is an OK guide to which compressor does that the best -- even if it takes *days.*

--
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
Re:Speed by moro_666 · 2005-12-26 09:39 · Score: 5, Interesting

if you download a file over gprs and each megabyte costs you 3$, then saving 200 megabytes means saving 600$, which is a price for a low-end pc or almost a laptop.

another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.

so it really depends if you need it or not. sometimes you need it, mostly you don't.

but bashing on the issue "like nobody ever needs it" is certainly wrong.

--

I'd tell you the chances of this story being a dupe, but you wouldn't like it.
Re:Speed by Hangeron · 2005-12-26 09:57 · Score: 3, Funny

Oh, I wondered what the big empty blocks in the middle of the text were. I have ad blocking with this http://everythingisnt.com/hosts.html
Re:Speed by MilenCent · 2005-12-26 11:16 · Score: 4, Insightful

Don't you mean ads?

The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.

The article is basically unreadable, I'd say, due to the ads.
Re:Speed by Karma+Farmer · 2005-12-26 12:18 · Score: 2, Funny

But, if you were using mobile phones to transfer a 2.5 GB file between two seperate windows-only PCs, and you were willing to initiate a $10,000 dollar, 10 day file transfer using a proprietary windows-only compression scheme without any type of error correction or partial restart, then I agree that WinRK would be the best choice.
Re:Speed by Killall+-9+Bash · 2005-12-26 12:40 · Score: 2, Insightful

If I didn't click on any ads on pages 1 through 14, will I click on one on that 15th page?

--
"Prediction: within 10 years, Windows will be a Linux distribution." Me, 7-6-2016
Re:Speed by Nutria · 2005-12-26 17:04 · Score: 2, Funny

When you're talking about GPRS, even transatlantic sneakernet would be faster (and cheaper).

"Never underestimate the bandwidth of a stationwagon full of tapes."

or the updated "Never underestimate the bandwidth of a 747 filled with DVDs".

Or the even more updated "Never underestimate the bandwidth of a 747 filled with 500GB HDDs".

--
"I don't know, therefore Aliens" Wafflebox1

More time = More compression by bigtallmofo · 2005-12-26 06:39 · Score: 4, Insightful

For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.

The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.

--
I'm a big tall mofo.

Re:More time = More compression by Orgasmatron · 2005-12-26 07:15 · Score: 3, Funny

Speaking of unknown compression programs, does anyone remember OWS?

I had a good laugh at that one when I figured out how it worked, way back in the BBS days.

--
See that "Preview" button?
Re:More time = More compression by undeadly · 2005-12-26 07:17 · Score: 2, Interesting

For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.
Not only time, but also how much memory the algorithm uses, though the author did not mention how much space each algorithm uses. gzip, for instance, does not use much, but others, like rzip (http://rzip.samba.org/) uses alot. rzip may use up to 900MB during compression.
I did a test with compressing a 4GB tar archive with rzip, wich result in a compressed file of 2.1 GB. gzip at max compression gave about 2.7 GB.
So one should choose an algorithm based upon need, and of course, availability of source code. Using a propetiary, closed source compression algorithm with no open source alternative implementation is begging for trouble down the road,
Re:More time = More compression by Rich0 · 2005-12-26 07:22 · Score: 5, Interesting

If you look at the methodology - all the results were obtained using the software set to the fastest mode - not the best compression mode.

So, I would consider gzip the best performer by this criteria. After all, if I cared most about space savings I'd have picked the best-mode - not the fast-mode. All this articles suggests is that a few archivers are REALLY lousy for doing FAST compression.

If my requirements were realtime compression (maybe for streaming multimedia) then I wouldn't be bothered with some mega-compression algorithm that takes 2 minutes per MB to pack the data.

Might I suggest a better test? If interested in best compression, then run each program in a mode which optimizes purely for compression ratio. On the other hand, if interested in realtime compression then take each algorithm and tweak the parameters so that they all run in the same time (which is a realtively fast time), and then compare compression ratios.

With the huge compression of multimedia files I'd also want the reviewers to state explicity that the compression was verified to be lossless. I've never heard of some of these proprietary apps, but if they're getting significant ratios out of .wav and .mp3 files I'd want to do a binary compare of the restored files to ensure they weren't just run through a lossy codec...

WinRK is excellent by drsmack1 · 2005-12-26 06:40 · Score: 4, Interesting

Just downloaded it and I find that it compresses significantly better than winrar when both are set to maximum. Decompress is quite slow. I use it to compress a small collection of utilities.

--

Humor from a Genetically Molested Mind

Nice Comparison... by Goo.cc · 2005-12-26 06:43 · Score: 4, Insightful

but I was surprised to see that the reviewer was using XP Professional Service Pack 1. I actually had to double check the review date to make sure that I wasn't reading an old article.

I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.

Re:Nice Comparison... by Anonymous Coward · 2005-12-26 10:47 · Score: 2, Interesting

I have ported ppmd to a nice pzip style utility and a pzlib style library. Find it at http://pzip.sf.net/

Speed is better than bzip2 and compression is top class, beaten only by 7zip and LZMA compresserors (which require much more speed and memory). Problem is that decompression is the same speed as the compression, unlike bzip2/gzip/zip where the decompression is much faster

The review quoted above is totally useless because 7zip for example uses a 32Kb dictionary. Given a 200Mb dictionary it really starts to perform quite well! I would not be suprised if 7zip didn't come out the winner there given a better compression parameter.

Windows only by Jay+Maynard · 2005-12-26 06:45 · Score: 2, Interesting

It's a real shame that 1) the guy only did Windows archivers, and 2) SBC Archiver is no longer in active development, closed source, and Windows-only.

--
Disinfect the GNU General Public Virus!

Actually by Sterling+Christensen · 2005-12-26 06:45 · Score: 5, Interesting

WinRK may have won only because he used the fast compression setting on all the compressors he tested. Results for default setting and best compression settings are TBA.

This is a surprisingly big subject by derek_farn · 2005-12-26 06:46 · Score: 4, Informative

There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools

Open formats and long-term accessibility by ahziem · 2005-12-26 06:48 · Score: 5, Insightful

A key benefit to PKZIP and tarballs formats is that they will be accessible for decades or hundreds of years. These formats are open (non-proprietary), widely implemented, and free (as in freedom) software.

The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.

Unix compressors by brejc8 · 2005-12-26 06:48 · Score: 5, Interesting

I did a short review and benchmarking of unix compressors people might be interested in.

--
Mouse powered Chips, Open source Processors and Lego

Just use DiskDoubler by mattkime · 2005-12-26 06:50 · Score: 5, Funny

Why mess around with compressing individual files? DiskDoubler is definitely the way to go. Hell, I even have it set up to automagically compress files I haven't used in a week.

Its running perfectly fine on my Mac IIci.

--
Know what I like about atheists? I've yet to meet one that believes God is on their side.

Re:Just use DiskDoubler by SleepyHappyDoc · 2005-12-26 06:57 · Score: 3, Funny

Mac IIci? Has it finished compressing files since you bought it?

--
Stasis is death. Embrace change.
Re:Just use DiskDoubler by fbjon · 2005-12-26 09:19 · Score: 2, Insightful

I prefer DoubleSpace for maximum file-destroying activity.

--
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.

Input type? by reset_button · 2005-12-26 06:53 · Score: 3, Interesting

Looks like the site got slashdotted while I was in the middle of reading it. What file types were used as input? Clearly compression algorithms differ on the file types that they work best on. Also, a better metric would probably have been space/time, rather than just using time. Also, I know that zlib, for example, allows you to choose the compression level - was this explored at all?

Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.

Re:Input type? by bigbigbison · 2005-12-26 08:04 · Score: 2, Interesting

According to Maximum Compression, which is basically the best site for compression testing, Stuffit's new version is the best for lossless jpeg compression. I've got it and I can confirm that it does a much better job on jpegs than anything else I've tried. Unfortunately, it is only effective on jpegs not gifs, pngs, or even pdfs which seem to use jpeg compression. And, outside of the mac world, it is kind of rare.

--
http://www.popularculturegaming.com -- my blog about the culture of videogame players

Why compress in weird formats? by canuck57 · 2005-12-26 06:54 · Score: 4, Insightful

I generally prefer gzip/7-Zip.

The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.

For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.

Re:Why compress in weird formats? by _Shorty-dammit · 2005-12-26 07:41 · Score: 2, Insightful

haha, yeah, 7-zip isn't 'weird' at all. I like how you try to make it sound like it's just as pervasive as something like gzip, even though 7-zip's a pretty much unknown format.
Re:Why compress in weird formats? by hobuddy · 2005-12-26 13:14 · Score: 2, Interesting

7-zip is the 16th most popular download on SourceForge (8544268 downloads so far), and it gets downloaded about 18000 times per day, so it must be going somewhere in terms of popularity.

--
Erlang.org: wow

Re:Why compress in the first place? by ArbitraryConstant · 2005-12-26 07:00 · Score: 5, Insightful

"I just don't understand the desire for compression in the first place."

Sometimes, people have to download things.

--
I rarely criticize things I don't care about.

Re:Why compress in the first place? by topham · 2005-12-26 07:03 · Score: 2, Insightful

I'd call you a troll, but I think you were being honest.

Compressing files with a good compression program does not increase the chance of it being corrupted.

And, the majority of files people send to each other, etc, aren't simply ascii files. (even if yours are).

The other advantage of using a compression program is the majority of them create archives and allow you to consolidate all the related files.

A good archive/compression program will add a couple of percent of reduntancy data which can substantially increase the data integrity. Above and beyond that which you have by simply story an ascii file uncompressed.

My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?

Re:Why compress in the first place? by Ironsides · 2005-12-26 07:03 · Score: 4, Interesting

In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.

Because when you are storing Petabytes of information it makes a difference in cost.

Besides, all the problems you mention with data coruption can be solved by backing up the information more than once. Anyplace that places a high value on there info is going to have multiple backups in multiple places anyways. The most usefull application of compression is in archiving old customer records. Being mostly text, you can easily get above 50% compression ratios. Also, these are going to be backed up to tape (not disk). Being able to reduce the volume of tapes being stored by 50% can save a lot of money for a large organization.

--
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars

Re:Why compress in the first place? by ArbitraryConstant · 2005-12-26 07:21 · Score: 4, Interesting

"My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?"

The solution to this issue is popular on usenet, since it's common for large files to be damaged. There's a utility called par2 that allows recovery information to be sent, and it's extremely effective. It's format-neutral, but most large binaries are sent as multi-part RAR archives. par2 can handle just about any damage that occurs, up to and including missing files.

Most of the time however, when it's simply someone downloading something it is only necessary to detect damage so they can download it again. All the formats I have experience with can detect damage, and it's common for MD5 and SHA1 sums to be sent separately anyway for security reasons.

--
I rarely criticize things I don't care about.

Re:Compressia by Insurgent2 · 2005-12-26 07:58 · Score: 2, Informative

Burrows-Wheeler transform

Re:Why compress in the first place? by LWATCDR · 2005-12-26 08:00 · Score: 3, Interesting

"As the administrator, your fundamental obligation is data integrity. If you compress, and if the compressed file store is damaged [especially if the header information on a compressed file - or files - is damaged], then you will tend to lose ALL of your data."
Not all data is stored in ASKII and or ANSI. Compressing the data can make it more secure not less.
1. It takes up less sectors of a drive so it is less likely to get corrupt.
2. Can contain extra data to recover from bad bits.
3. Allows you to make redundant copies without using any more storage space.
Let's say that you have some files that are in ASCII you want to store. Using any compression method you can probably store 3 copies of the file using the same amount of disk space.
You are far more likely to recover a full data set from three copies of compressed file than from one copy of an uncompressed file.

Also we do not have unlimited bandwidth and unlimted storage EVERYWHERE.Loseless video, image, and audio files take up a lot of space. For some applications MP3, Ogg, MPG, and JPEG just don't cut it.
So yes compression still is important.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

small mistake by ltwally · 2005-12-26 08:01 · Score: 4, Interesting

There is a small mistake on page 3 of the article, in the first table: WinZip no longer offers free upgrades. If you have a serial for an older version (1-9), that serial will only work on the older versions. You need a new serial for v10.0, and that serial will not work when v11.0 comes out.

Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.

--

/dev/random

Compress to 0K by Anonymous Coward · 2005-12-26 08:11 · Score: 2, Funny

I always compress my compressed files over and over until I achieve absolute 0Kb.
I carry all data of my entire serverfarm like that on a 128Mb USB-stick.

Nothing to see here by Anonymous Coward · 2005-12-26 08:15 · Score: 5, Informative

I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.

I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.

Ouch.

Re:Nothing to see here by Meostro · 2005-12-26 13:53 · Score: 2, Informative

If you even take the most basic/well studied Lempel-Ziv and Huffman algorithms you'll quickly find cases where each would be preferred over another.
That's sort of the point of this test though, to see which of the general-purpose compressors (GPC) is going to give you the best overall results. Yes, you should use FLAC for WAVs, and probably StuffIt for JPEGs, but what is your best choice if you're going to have just one, or just a few? I don't want 200 different compressors for 200 different content types, I want one.

As a matter of practicality, right now you need zip or gzip, and bzip2 is gaining ground. If you're going to create new content, you should offer both bz2 and zip. In the future, maybe you should use 7z or sit instead, it depends on the rate of adoption. Personally, I don't think zip will ever die.
And since different algm's identify different patterns in the file their compressing, certain files will be compressed better by different algorithms and do much worse on the next file. Besides, we're not even getting into any discussion of lossy/lossless algm's here. (Think jpeg vs bmp).
Generally, you will pick a special-purpose compressor for lossy compression, and a GPC for lossless compression. Your audio compressor will probably be MP3 or OGG, your images will probably be JPG, videos will be MPG. It's not efficient to use MP3 compression on your images, it's designed with different constraints. Either for the same bitrate the image is much worse quality, or for the same quality the file will be much larger than necessary. The same goes for lossless compressors too, FLAC works much better than ZIP on audio data, but I would bet if you used a BMP file as the source for compression FLAC would probably be bad and ZIP would probably be average.

If you want to compress 300 files of various types, you need a GPC. That doesn't mean that the GPC doesn't have special-purpose algorithms built into it, it just means that on-average it will perform better than a special-purpose compressor.

Kolmogorov complexity, or at least an estimate thereof, is what you're talking about. For any specific dataset, the Kolmogorov complexity is the minimum size of compressed data + decompressor. It can't be calculated, but it is a measure of performance for any combination of compressor and dataset. For WAVs, you will probably see this:
K(FLAC, WAVs) < K(GPC, WAVs)

However, for an evenly-distributed general dataset of generic binary files, TXT, JPG, PDF, TIF, PNG, MP3, WAV, and MPG, you will probably find that for any SPC (special-purpose compressor for any of the individual data types):
K(GPC, dataset) < K(SPC, dataset)

Maximum Compression has efficiency comparisons by bigbigbison · 2005-12-26 08:18 · Score: 5, Informative

Since the original site seems to be really slow and split into a billion pages, those who aren't aware of it might want to look at MaximumCompression since it has tests for several file formats and also has a multiple file compression test that is sorted by efficiency. A program called SBC does the best, but the much more common WinRAR comes in a respectable third.

--
http://www.popularculturegaming.com -- my blog about the culture of videogame players

Related Links Broken by Karma+Farmer · 2005-12-26 08:21 · Score: 2, Funny

The "related links" box for this story is horribly broken. Instead of being links related to the story, it's a bunch of advertising. I'm sure this was a mistake or a bug in slashcode itself.

I've searched the FAQ, but I can't figure out how to contact slashdot admins. Does anyone know an email address or telephone number I can use to contact them about this serious problem? I'm sure they'll want to fix it as quickly as possible.

No one ever looks at rzip by Mr.Ned · 2005-12-26 08:24 · Score: 3, Interesting

http://rzip.samba.org/ is a phenomenal compressor. It does much better than bzip2 or rar on large files and is open source.

Re:Why compress in the first place? by DeadboltX · 2005-12-26 08:42 · Score: 3, Informative

Sounds like you need to introduce yourself to the world of par2 ( http://www.quickpar.org.uk/ )

Parity reconstruction

Think of it like the year 2805 where scientists can regrow someones arm if they happen to lose it

Decompression Speed by Hamfist · 2005-12-26 08:42 · Score: 3, Interesting

Interesting that the article talks about compression ratio and compression speed. When considering compression, Decompression time is extremely relevant. I don't mind witing more to compress the fileset, as long as decompression is fast. I normally compress once, and then decompress various times (media files and games for example).

Because it makes a hell of a lot of sense. by cbreaker · 2005-12-26 08:51 · Score: 4, Insightful

If you're familiar with Usenet, you've probably encountered PAR files from time to time. A PAR file is a parity file which can be used to reconstruct lost data. It works sort of like a RAID, but with files as the units instead of disks.

Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.

Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.

Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.

Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.

I can't fathom why a responsible and well informed admin would avoid compression.

--
- It's not the Macs I hate. It's Digg users. -

Re:Because it makes a hell of a lot of sense. by cbreaker · 2005-12-26 10:12 · Score: 2, Insightful

I don't undertstand why you think compression automatically destroys the chance of recovery? And how encoding in ASCII is better? What's the thing about "sectors"? I never said using a compressed volume on a hard disk was a good idea. Compressed files can be recovered too, you know. If you have the forensic expertise to recover a corrupted non-compressed file, changes are you'd also be able to recover the data from a compressed one.

The only arguement for compression is not the cost of media - in fact I didn't mention media price at all. I did mention the library capacity, however - and getting an even bigger library is a lot more expensive of a prospect then the $.75 you quoted per GB. Did you read the whole part of my post about speeds? If I can restore that database in half the time because of compression, that means less down time and less money lost. (Although, the money-lost factor doesn't really apply at a government institution; we're not selling anything.)

"If you backup more than once UNCOMPRESSED, you can recover almost anything because it is VERY unlikely that a bad sector will occur in the exact same spot or even in the same file (assuming the one file does not take up most of the specific media.)"

Wouldn't this apply to a compressed backup, too? You're assuming here that the file was unchanged in between the two backups - thus it would apply to any data, compressed or not.

"Alternatively, use PAR files to recover - as long as you're willing to add the extra space and time - which sort of obviates the advantage of compression, doesn't it?"

No - it simply lowers the compression ratio a bit. If you're getting 2:1 compression and add 10% pars, you're still looking at a 1.8:1 compression ratio, but with recoverability.

----

Within every IT budget, you must balance out the speed, recoverability, and cost of your backup solution.

In your solution of never using compression (since no admin should do that, you mentioned) you lose a lot of speed in backups and restores. Speed of recovery is a key factor in many enviornments. It's often the top question asked when in discussion of new backup solutions. You talk about this as an important point yet excluding compression could double your restore times, or more. Not to mention backup speeds - if you can take your backups in half the time, you effectively double the number of servers you could backup in the same amount of time. Or, you reduce the amount of time servers are busy with backups.

Recoverability is big - you want your backups to be reliable. Most of the time, any corruption is unacceptable, be it in a compressed file or not. It's either good or you throw it out and go back to the previous backup. Many IT shops are doing multiple backups these days - backup to disk first, then to tape. Then take snapshots of those tapes and bring them off-site. Compressed or not, testing your backups and ensuring you have no problems with hardware is much more effective then using uncompressed backups and performing forensics on them if they're bad. Speaking of which, I don't see why compressed data would be less recoverable.

Finally, you have cost. Yes, even when data recoverability is a key factor, you still have to consider cost. So, what makes more sense? Using uncompressed backups that will backup and restore slower, cost a lot more for media and library capacity, and cause more personnel overhead for swapping tapes - or using compression and cutting all that in half? You'd rather lose all that in the off chance that MAYBE you could recovery more of your data, in the off chance that NONE of your other backups are good? I don't know any resposible IT manager that could agree with you.

A proper backup and recovery plan with periodic testing and multiple copies held on-site and off is a much more effective solution then betting on forensic recovering of uncompressed data.

Hey, I'm not claiming that compression is always right in every situation. That's far fro

--
- It's not the Macs I hate. It's Digg users. -

Unicode support? by icydog · 2005-12-26 09:01 · Score: 3, Informative

Is there any mention made about unicode support? I know that WinZip is out of the question for me because I can't compress anything with Chinese filenames with it. They'll either not work at all, or become compressed but the filenames will turn into garbage. Even though the data stays intact, it doesn't help much if it's a binary and has no intelligible filename.

I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.

accuracy test missing by Grimwiz · 2005-12-26 09:05 · Score: 2, Insightful

A suitable level of paranoia would suggest that it would be good to decompress the compressed files and verify that they produce the identical dataset. I did not see this step in the overview.

--
-- Don't believe everything you read, hear or think

There's an article in there somewhere? by cbreaker · 2005-12-26 09:10 · Score: 4, Insightful

All I see is ads. I think I found a paragraph that looked like it may have been the article, but every other word was underlined with an ad-link so I didn't think that was it..

--
- It's not the Macs I hate. It's Digg users. -

JPG compression by The+Famous+Druid · 2005-12-26 09:15 · Score: 5, Interesting

It's interesting to note that Stuffit produces worthwhile compression of JPG images, something long thought to be impossible.
I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.

--
Quidquid Latine dictum sit, altum videtur (anything said in Latin sounds important)

Completely out of context by EdMcMan · 2005-12-26 09:36 · Score: 4, Informative

It's a crime that the submitter didn't mention this was with the fastest compression settings.

Why does ANYBODY Bother with WinZip? by Master+of+Transhuman · 2005-12-26 09:39 · Score: 3, Interesting

Proprietary, costs money...

I use ZipGenius - handles 20 compression formats including RAR, ACE, JAR, TAR, GZ, BZ, ARJ, CAB, LHA, LZH, RPM, 7-Zip, OpenOffice/StarOffice Zip files, UPX, tc.

You can encrypt files with one of four algorhythms (CZIP, Blowfish, Twofish, Rijndael AES).

If you set an antivirus path in ZipGenius options, the program will prompt you to perform an AV scan before running the selected file.

It has an FTP client, TWAIN device image importing, file splitting, convert RAR into SFX, converts any Zip archive into an ISO image file, etc.

And it's totally free.

--
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!

This test is worthless by Dwedit · 2005-12-26 10:04 · Score: 3, Informative

They are testing 7-zip at the FAST setting, which does a poor job compared to the BEST setting.

Lest We Forget - Philip W. Katz by BigFoot48 · 2005-12-26 10:29 · Score: 4, Interesting

While we're discussing compression and PKZip, I thought a little reminder of who started it all, and who died before his time, may be in order.

Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.

http://en.wikipedia.org/wiki/Phil_Katz

Embarassing ads - This is an ad cash-grab by dr_skipper · 2005-12-26 12:38 · Score: 3, Insightful

This is sad. Over and over slashdot is posting stories with nothing more than some lame tech review and dozens of ads. I really believe people are generating sites with crap technical content, packing them with ads, and submitting to slashdot hoping to win the impression/click lottery.

Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.

Pile of ad-laden shit article by hazem · 2005-12-26 14:10 · Score: 2, Funny

I know not many of you actual RTFA, but that article was so damned annoying. There's a table in there - think it's to compare compression schemes? nope - it's for processors. There are red links.. article related? Nope - ad links. Blue underlined links - yup, for more ads.

What a steaming pile of shit. Happy new year.

The analysis is kind of silly by speedplane · 2005-12-26 15:28 · Score: 2, Informative

Lossless data compression is a pretty well studied subject. Shannon started it back in the 40's and plenty of research has gone into it since.

There are basically three ways to do lossless compression: Huffman, Arithmetic, and LZW. Technically Huffman can achieve the best of three, however its generally the worst because of implmentation issues (it would take a lot of processing to do rigourous Huffman encoding).

Arithmetic coding is generally better but is difficult to implment. I think IBM is the company who actually sells an arithmetic coder (I could be wrong though).

LZW is by far the best of the three (you can read online how it works), but alas it is patented and anyone who gives away free copies of it will get sued.

I know for a fact that gzip uses Huffman, which would explain its lackluster performance. I haven't researched it further, but I would not be suprised if the three proprietary compression programs which "won" this review use LZW. I also wouldn't be suprised if they pay a good amount to LZW's patent holders (Unisys I think).

I'd be interested to see how gzip performs on its "maximum compression" setting. Like I said earlier, Huffman can can achieve the theortical limit on compression where LZW cannot.

--
Fast Federal Court and I.T.C. updates

Re:Speaking of Comparisons by chronicon · 2005-12-26 15:47 · Score: 2, Interesting

Speaking of Comparisons (Score:-1, Redundant)

I knew I had seen this story before but it wasn't here. This article was up on Digg three days ago--with only three Diggs to it's name (at the time of this writing), but it's front page news here? Interesting to say the least...

I predict that this Digg will become frontpage Slashdot news shortly. It was quite popular (914 diggs so far) and it's hit the three-day mark...

I know, this is all so OT, but it's no worse then whining about duplicate postings here...

Oh the irony here is just too much to take without laughing! My comment gets hammered with the REDUNDANT pummel when I point out that /. is being REDUNDANT in posting old Diggs? Man, it just doesn't get any better then this to make a point.

Moderators: did you catch the not-so-subtle play I made here by quoting ALL of my original message? In case you didn't, I'm beinging REDUNDANTLY sarcastic...

Enjoy!

Wrong, wrong, wrong, wrong, wrong. by hereticmessiah · 2005-12-26 19:30 · Score: 3, Informative

Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding. Range encoding and arithmetic coding are both variants of the same basic technique of entropy encoding. That said, the compression difference between huffman coding and arithmetic coding is minimal. I think (though I'm not entirely sure), entropy encoding might be a subset of a larger family of algorithms called markov modelling.

LZW is a refinement on LZ78, which has other variants such as LZSS. It is a dictionary coding algorithm. Similarly, the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE, as does xZip and PNG. DEFLATE first compresses the stream with an LZ77 variant, and then compresses the resulting stream with huffman coding to squeeze out some redundancy. LZW is no longer covered by patents, at least not here in Europe.

So what you wrote about huffman coding, arithmetic coding and LZW was largely misinformed. There are two lossless methods: entropy encoding and dictionary coding, huffman coding and arithmetic coding representing the former and LZW representing the latter. Some compression algorithms combine the two, DEFLATE being an example.

--
I don't like trolls and mod against me if you like, but I'd prefer if you'd reply.

64 of 252 comments (clear)