Slashdot Mirror


A Look at Data Compression

With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."

252 comments

  1. Speed by mysqlrocks · · Score: 3, Insightful

    No talk of the speed of compression/decompression?

    1. Re:Speed by Anonymous Coward · · Score: 0

      Dude, RTFA.

    2. Re:Speed by sedmonds · · Score: 4, Informative

      Seems to be a compression speed section on page 12 - Aggregated Results. Ranging from gzip really fast, to winrk really slow.

    3. Re:Speed by kailoran · · Score: 1

      Too bad they don't write a thing about DEcompression speeds. I'd say it would in many cases be more important tha the compression speed.

    4. Re:Speed by Coneasfast · · Score: 1

      the site is almost slashdotted, very slow
      secondly, why do they have to put everything on 15 different pages, does it make it more organized? i think not. easier to read the article when everything is together.

      --
      Marge, get me your address book, 4 beers, and my conversation hat.
    5. Re:Speed by Anonymous Coward · · Score: 3, Insightful

      No talk of the speed of compression/decompression?

      Exactly! We compress -terabytes- here at wr0k, and we use gzip for -nearly- everything (some of the older scripts use "compress", .Z, etc.)

      Why? 'cause it's fast. 20% of space just isn't worth the time needed to compress/uncompress the data. I tried to be modern (and cool) by using bzip2, yes, it's great, saves lots of space, etc., but the time required to compress/uncompress is just not worth it. ie: if you need to compress/decompress 15-20gigs per day, bzip2 just isn't there yet.

      Also, look at what google is using---they probably store more data than most other corps, but they still use gzip (I think, from some description, somewhere).

    6. Re:Speed by Arainach · · Score: 3, Insightful

      The Article Summary quoted is completely misleading. The most important graph is the final one on page 12, Compression Efficiency, where gzip is once again the obvious king. Sure, WinRK may be able to compress decently, but it takes an eternity to do so and is impractical for every-day use, which is where routines like gzip and ARJ32 come in - incredible compression for the speed in which they can operate. Besides - who really needs that last 54MB in these days of 4.9GB DVDs and 160GB Hard Drives?

    7. Re:Speed by Luuvitonen · · Score: 5, Insightful

      3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?

    8. Re:Speed by Karma+Farmer · · Score: 2, Interesting

      3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?

      If your file starts out as 250 mb, it might be worth it. However, if you start with a 2.5 gb file, then it's almost certainly not -- especially once you take the closed-source and undocumented nature of the compression algorithm into account.
       
      /not surprisingly, the article is about 2.5 gb files

    9. Re:Speed by sshore · · Score: 5, Informative

      They do it to sell more ad impressions. Each time you go to the next page you load a new ad.

    10. Re:Speed by emmetropia · · Score: 1

      Before anyone comments, I didn't read the article. However, most likely, the reason for "15 pages" instead of one, is because they would be displaying "15 pages with ads" instead of one, which would be potentially more ad revenue.

      That said, I hate it when they're broken up like that too.

    11. Re:Speed by eggnet · · Score: 1

      I'd image the decompression speeds are all fast.

    12. Re:Speed by Wolfrider · · Score: 2, Informative

      Yah, when I'm running backups and it has to Get Done in a reasonable amount of time with decent space savings, I use
      gzip -9. (My fastest computer is 900MHz AMD Duron.)

      For quick backups, gzip; or gzip -6.

      For REALLY quick stuff, gzip -1.

      When I want the most space saved, I (rarely) use bzip2 because rar, while useful for splitting files and retaining recovery metadata, is far too slow for my taste 99% of the time.

      Really, disk space is so cheap these days that Getting the Backup Done is more important than saving (on average) a few megs here and there.

      But if you Really Need that last few megs of free space, this is an OK guide to which compressor does that the best -- even if it takes *days.*

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
    13. Re:Speed by nlfmop · · Score: 1

      What's wrong with jar as a compression utility? In the tests I ran, I got almost equal compression vs. gzip with a -9 compression but jar was at least twice as fast. I used to tar then gzip large files but moved to jar about a year and a half ago. Its major flaw is the that it doesn't understand symbolic links but there are ways around that. Am I missing something? I am a Solaris/UNIX guy. (first post)

    14. Re:Speed by megalomaniacs4u · · Score: 1
      What's wrong with jar as a compression utility? In the tests I ran, I got almost equal compression vs. gzip with a -9 compression but jar was at least twice as fast. I used to tar then gzip large files but moved to jar about a year and a half ago. Its major flaw is the that it doesn't understand symbolic links but there are ways around that.

      Jar (IIRC) uses the same algorithm as Gzip & Zip. If you looked at infozip (Zip for unix) which is available as a sun package (as zip & unzip respectively) you'll get symlink support as well.

      Am I missing something? I am a Solaris/UNIX guy.

      Yep the whole tar then gzip thing suggests your either still using sun's brain dead tar and you should be using gnu tar (gtar?) to do the gzip at the same time on the fly using the z switch.

    15. Re:Speed by swv3752 · · Score: 1

      From what I have seen, jar is not as fast. Regardless, gzip is built right into tar with the -z flag so why bother using a utility that does not work with links.

      --
      Just a Tuna in the Sea of Life
    16. Re:Speed by moro_666 · · Score: 5, Interesting

      if you download a file over gprs and each megabyte costs you 3$, then saving 200 megabytes means saving 600$, which is a price for a low-end pc or almost a laptop.

      another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.

      so it really depends if you need it or not. sometimes you need it, mostly you don't.

      but bashing on the issue "like nobody ever needs it" is certainly wrong.

      --

      I'd tell you the chances of this story being a dupe, but you wouldn't like it.
    17. Re:Speed by Hangeron · · Score: 3, Funny

      Oh, I wondered what the big empty blocks in the middle of the text were. I have ad blocking with this http://everythingisnt.com/hosts.html

    18. Re:Speed by Andy+Dodd · · Score: 1

      "if you download a file over gprs and each megabyte costs you 3$, then saving 200 megabytes means saving 600$, which is a price for a low-end pc or almost a laptop."

      When you're talking about data files on the order of 2.5 GB, someone is going to find ANY solution other than GPRS. When you're talking about GPRS, even transatlantic sneakernet would be faster (and cheaper).

      Plus many providers offer unlimited plans at higher monthly costs. (I know every US-based provider has unlimited data plans for under $100/month, and the U.S. is generally known for having significantly higher prices for mobile phone usage than anywhere else.)

      --
      retrorocket.o not found, launch anyway?
    19. Re:Speed by pyce · · Score: 1

      Best compression program for you -- cat (great compression for the speed!)

      --
      Hellenologophobia, n. -- a fear of Greek terms or complex terminology
    20. Re:Speed by Andy+Dodd · · Score: 1

      Well, if you're right at the barrier of the capcity of a DVD disc, 54MB may matter.

      That said, chances are that in such situations you're just going to be better off figuring a way to span multiple DVDs, especially given that while increasing compression might be enough for you today, chances are that you're going to exceed the capacity of that single DVD soon no matter what compression technique you'll use.

      --
      retrorocket.o not found, launch anyway?
    21. Re:Speed by MilenCent · · Score: 4, Insightful

      Don't you mean ads?

      The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.

      The article is basically unreadable, I'd say, due to the ads.

    22. Re:Speed by Crayon+Kid · · Score: 1

      If you need backup, go with bzip2. It also supports the -1 to -9 flags. AND it has error recovery, while gzip does not. One byte gone wrong and your gzip backup is toast.

      --
      i ate crayons when i was a kid and now i have two braincells and the blue ones taste nicer
    23. Re:Speed by Anonymous Coward · · Score: 0

      lzop is faster than gzip -1 for both compression and decompression. Compression ratio is typically a bit worse.

    24. Re:Speed by Dun+Malg · · Score: 1
      if you download a file over gprs and each megabyte costs you 3$, then saving 200 megabytes means saving 600$

      In order to save 200MB with WinRK over gzip, you'd need a 600MB file. What kind of idiot would send a file that big (400MB after compression) using $3/MB GPRS? Yeah, you're saving $600, but you're still spending $1200! Given the several hours longer WinRK needs over gzip, I could hire a boy to run a CDR down to the nearest internet cafe in less time, for less money, and he could bring me back a coffee to boot! As others have noted, there are usually no-limit data transmission plans for far less than you'd even save, much less spend, paying per-megabyte.

      another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.

      Can't imagine a single scenario in which a) I am up against a hard intermediate storage limit; b) I have plenty of computational power, storage, and time at both ends to allow such intensive compression/decompression; and c) I'm running Windows at each end! so it really depends if you need it or not. sometimes you need it, mostly you don't.

      I don't think you've successfully shown that. Do you have any specific examples where one might actually need it, or just these abstract thought experiments?

      --
      If a job's not worth doing, it's not worth doing right.
    25. Re:Speed by Karma+Farmer · · Score: 2, Funny

      But, if you were using mobile phones to transfer a 2.5 GB file between two seperate windows-only PCs, and you were willing to initiate a $10,000 dollar, 10 day file transfer using a proprietary windows-only compression scheme without any type of error correction or partial restart, then I agree that WinRK would be the best choice.

    26. Re:Speed by Killall+-9+Bash · · Score: 2, Insightful

      If I didn't click on any ads on pages 1 through 14, will I click on one on that 15th page?

      --
      "Prediction: within 10 years, Windows will be a Linux distribution." Me, 7-6-2016
    27. Re:Speed by Nutria · · Score: 2, Funny

      When you're talking about GPRS, even transatlantic sneakernet would be faster (and cheaper).

      "Never underestimate the bandwidth of a stationwagon full of tapes."

      or the updated "Never underestimate the bandwidth of a 747 filled with DVDs".

      Or the even more updated "Never underestimate the bandwidth of a 747 filled with 500GB HDDs".

      --
      "I don't know, therefore Aliens" Wafflebox1
    28. Re:Speed by moro_666 · · Score: 1

      look at these examples:

      1)the gprs case, you are in your switzerland country house in the bloody mountains, there ARE NO other ways to get to the network and you really NEED to send out your budget specifications to your partners, or otherways you'll be bashed out of the business along with your company. this isn't as "james bond scenario" as it my sound, this is quite real. there are cases where gprs is the only bloody way to exchange data, you're not on a broadband connection 24/7 (or you're just having one damn boring life).

      2)data limit case. yo sherlock, have you ever found yourself in a situation where you have no network connection and no other way to store the data you need on a 128mb memory stick ? let's suppose you're supposed to be at your office after 12 hours and you have no cd's or portable harddrives you could use. let's say you're stuck in the same house in switzerland with your french lover and your departing from there in different moments. you can't obviously take her laptop with you and there are no cd-stores next door (there is snow and cliffs "next door"), the only thing you can use is your bloody only memory stick and you just HAVE to fit your data on that.

      climb out of the box and you will see that there are cases where you are limited to really restricted resources and you need some extraordinary packagers ...

      i'm not saying that you need these packagers 24/7, i'm just saying that there are moments where they could help you out of real shit.

      --

      I'd tell you the chances of this story being a dupe, but you wouldn't like it.
    29. Re:Speed by ysegalov · · Score: 1

      And don't forget - never underestimate the bandwidth of snails with DVD wheels:
      http://science.slashdot.org/article.pl?sid=05/04/2 6/2251234

    30. Re:Speed by Tet · · Score: 1
      you are in your switzerland country house in the bloody mountains, there ARE NO other ways to get to the network and you really NEED to send out your budget specifications to your partners, or otherways you'll be bashed out of the business along with your company.

      Then you ensure your budget specifications are in a plain text file, rather than $BLOATED_PRESENTATION_FORMAT, so they actually have a chance of getting through. Or at worst, a PDF or similar, which will give you good presentation and a relatively small file size. There simply isn't a valid case I can think of for sending 400+ MB of content over GPRS. Sure, you could send a file that big, but it simply means it's in the wrong format. If you're having to resort to GPRS, it's usually because you really need to exchange information with someone. The amount of information you can fit in 400MB is quite astounding, and there is no way you'll need to transmit that to someone urgently.

      --
      "The invisible and the non-existent look very much alike." -- Delos B. McKown
    31. Re:Speed by ratatask · · Score: 1

      I second this. We use gzip to compress searchable telecom CDRs. gzip provides a very nice
      speed vs compression ratio. Searching through the CDRs, decompressing the files on the fly is still IO bound on recent hardware. bzip2'ing the files placed too much load on the CPU with our current scheme.

    32. Re:Speed by Anonymous Coward · · Score: 0

      The article is basically unreadable, I'd say, due to the ad

      I would agree, but we still downloaded the ads, and I suspect that's all they care aboutat the end of the day.

    33. Re:Speed by TheLink · · Score: 1

      For speed I use lzop. In most cases it's a drop-in replacement for gzip. lzop is about 3x to 4x faster for my cases, and just a bit worse in compression (about 10%).

      With lzop I typically get 30MB/sec (default settings).

      Whereas with gzip I typically get about 8 to 10MB/sec, which often isn't close to network or disk transfer limits, and in those cases it will mean that things aren't getting done as fast as possible.

      gzip with --fast is still significantly slower than lzop and at those settings it compresses only about as well as lzop at default.

      I'd be interested to know if there's a drop-in replacement for gzip that's faster than lzop. But so far lzop works pretty well.

      --
    34. Re:Speed by Anonymous Coward · · Score: 0

      "The article is basically unreadable, I'd say, due to the ads."

      ??? What ads? I saw no ads.

      OK, I looked at the source for the page, and it seems that there is a lot of JavaScript there. Since I have JavaScript disabled, I don't have a problem with the ads.

      Besides allowing crap like ads and other things, JavaScript provides a huge security hole in most browsers (including Mozilla, Firefox, and Oprah). Whenever a new security advisory is released, the temporary workaround is, more often than not, "disable scripting until a patch/new version is released". Well, why not just have all scripting disabled all of the time? Then you would rarely have a problem.

      People who have scripting enabled are asking for trouble, and frequently find it.

    35. Re:Speed by Lerc · · Score: 1

      Say, you're right. The links are down there. I totally failed to find them despite spending a while hunting for them. I had to resort to changing the page number in the url.

      It's gotta hurt them in the long run. I go to a fair few sites with ads regularly, but I'm not inclined to go back to theirs specifically because of the ads.

      --
      -- That which does not kill us has made its last mistake.
    36. Re:Speed by Anonymous Coward · · Score: 0

      I did not see any ads, why don't you use something like Adblock ?

    37. Re:Speed by MilenCent · · Score: 1

      I do use Flashblock. I'm considering it.

  2. More time = More compression by bigtallmofo · · Score: 4, Insightful

    For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.

    The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.

    --
    I'm a big tall mofo.
    1. Re:More time = More compression by Orgasmatron · · Score: 3, Funny

      Speaking of unknown compression programs, does anyone remember OWS?

      I had a good laugh at that one when I figured out how it worked, way back in the BBS days.

      --
      See that "Preview" button?
    2. Re:More time = More compression by undeadly · · Score: 2, Interesting
      For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.

      Not only time, but also how much memory the algorithm uses, though the author did not mention how much space each algorithm uses. gzip, for instance, does not use much, but others, like rzip (http://rzip.samba.org/) uses alot. rzip may use up to 900MB during compression.

      I did a test with compressing a 4GB tar archive with rzip, wich result in a compressed file of 2.1 GB. gzip at max compression gave about 2.7 GB.

      So one should choose an algorithm based upon need, and of course, availability of source code. Using a propetiary, closed source compression algorithm with no open source alternative implementation is begging for trouble down the road,

    3. Re:More time = More compression by Rich0 · · Score: 5, Interesting

      If you look at the methodology - all the results were obtained using the software set to the fastest mode - not the best compression mode.

      So, I would consider gzip the best performer by this criteria. After all, if I cared most about space savings I'd have picked the best-mode - not the fast-mode. All this articles suggests is that a few archivers are REALLY lousy for doing FAST compression.

      If my requirements were realtime compression (maybe for streaming multimedia) then I wouldn't be bothered with some mega-compression algorithm that takes 2 minutes per MB to pack the data.

      Might I suggest a better test? If interested in best compression, then run each program in a mode which optimizes purely for compression ratio. On the other hand, if interested in realtime compression then take each algorithm and tweak the parameters so that they all run in the same time (which is a realtively fast time), and then compare compression ratios.

      With the huge compression of multimedia files I'd also want the reviewers to state explicity that the compression was verified to be lossless. I've never heard of some of these proprietary apps, but if they're getting significant ratios out of .wav and .mp3 files I'd want to do a binary compare of the restored files to ensure they weren't just run through a lossy codec...

    4. Re:More time = More compression by Anonymous Coward · · Score: 1, Informative

      looks like you can still grab a copy of it here.

    5. Re:More time = More compression by Anonymous Coward · · Score: 0
      I've never heard of some of these proprietary apps, but if they're getting significant ratios out of .wav and .mp3 files I'd want to do a binary compare of the restored files to ensure they weren't just run through a lossy codec...

      That should be done for all contenders and all files, or dropped as a consideration alltogether.

    6. Re:More time = More compression by jZnat · · Score: 1

      As far I as I understand, memory usage becomes an issue usually with block-sorting algorithms. The more data you analyze at a time, the larger memory usage you will have.

      --
      'Yes, firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.'
    7. Re:More time = More compression by tzanger · · Score: 1

      Hahahaha Yes i remember OWS. I wonder how many people actually lost data to it...

    8. Re:More time = More compression by saranagati · · Score: 1

      Compression has a lot to do w/ what type of data you're compressing. I did a test a couple months ago and compressed 2557 megs of rom's w/ bzip2 and 7zip. bzip2 only managed to compress it down to 1.2 gigs while 7zip got it down to 218megs.

      --
      Give a man a match and he'll be warm for a minute, set him on fire and he'll be warm for the rest of his life.
  3. Compressia by ardor · · Score: 1

    I always wanted to know how Compressia ( http://www.compressia.com/ ) works. It uses some form of distance coding, but information about it is quite rare.

    --
    This sig does not contain any SCO code.
    1. Re:Compressia by Insurgent2 · · Score: 2, Informative
    2. Re:Compressia by ardor · · Score: 1

      I know this part. The real deal is how to encode the results. Usually, MTF & ntropy coding is used. AFAIK Compressia uses distance coding instead of MTF.

      --
      This sig does not contain any SCO code.
  4. WinRK is excellent by drsmack1 · · Score: 4, Interesting

    Just downloaded it and I find that it compresses significantly better than winrar when both are set to maximum. Decompress is quite slow. I use it to compress a small collection of utilities.

  5. Nice Comparison... by Goo.cc · · Score: 4, Insightful

    but I was surprised to see that the reviewer was using XP Professional Service Pack 1. I actually had to double check the review date to make sure that I wasn't reading an old article.

    I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.

    1. Re:Nice Comparison... by lowid+(24)+_________ · · Score: 1

      Article is slow, so I can't speak to it specifically, but I personally still use XPSP1 for audio work because the sp2 firewall creates a lot of instability. (This was my opinion, and I later discovered it to be a general consensus in the audio community.) For people who need their windows box to be as stable as possible, it's probably best to stick with sp1 for a while.

    2. Re:Nice Comparison... by Johnno74 · · Score: 1

      Why not install SP2 and use a different firewall then? I hadn't heard about any SP2 firewall problems before, but I don't use it (MS's firewall) anyway - I use Kerio 2.1.4 (the last good ver before they became bloated)

      My box has always been fairly stable, but even more so under SP2.

    3. Re:Nice Comparison... by fbjon · · Score: 1

      What kind of instability, specifically? I haven't noticed anything.

      --
      True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
    4. Re:Nice Comparison... by IamTheRealMike · · Score: 1
      On UNIX systems at least the LZMA codec is excellent - it regularly achieves better ratios than bzip2, and is very fast to decompress. For many applications, decompression speed is more important than compression speed and the LZMA dictionary appears to fit inside the CPU cache, as it beats out bzip2 handily even though it's doing more work.

      There are better compressors out there, in particular PPM codecs can achieve spectacular ratios, but as they're very slow to both compress and decompress they're useful mostly for archiving.

      I've also seen great results from codecs tuned specifically to certain types of data over others, for instance, a PPM codec designed specifically for Intel x86 executable code can work wonders.

    5. Re:Nice Comparison... by Anonymous Coward · · Score: 2, Interesting

      I have ported ppmd to a nice pzip style utility and a pzlib style library. Find it at http://pzip.sf.net/

      Speed is better than bzip2 and compression is top class, beaten only by 7zip and LZMA compresserors (which require much more speed and memory). Problem is that decompression is the same speed as the compression, unlike bzip2/gzip/zip where the decompression is much faster

      The review quoted above is totally useless because 7zip for example uses a 32Kb dictionary. Given a 200Mb dictionary it really starts to perform quite well! I would not be suprised if 7zip didn't come out the winner there given a better compression parameter.

    6. Re:Nice Comparison... by phorm · · Score: 1

      Another nice free one is IZarc which can handle some of the non-windows format (tar, gzip, etc) in addition to most of the windows ones (zip, ace, rar, etc)

  6. horrible site interface by the_humeister · · Score: 0, Redundant

    Is it just me or is that site really difficult to navigate amongst all those ads? Speed of compression would have been nice too.

    1. Re:horrible site interface by the_humeister · · Score: 0, Redundant

      Looks like I posted too fast. There's a speed comparison somewhere around there...

    2. Re:horrible site interface by reset_button · · Score: 1

      It's actually quite easy - just keep incrementing the "pgno" variable in the URL :)

    3. Re:horrible site interface by rolandog · · Score: 1

      A bit difficult, yes... but I advice you use Firefox and the NoScript extension. You can temporarily allow that site to execute its scripts, but you won't allow the ad companies to run whatever they run on you.

  7. Windows only by Jay+Maynard · · Score: 2, Interesting

    It's a real shame that 1) the guy only did Windows archivers, and 2) SBC Archiver is no longer in active development, closed source, and Windows-only.

    --
    Disinfect the GNU General Public Virus!
  8. Actually by Sterling+Christensen · · Score: 5, Interesting

    WinRK may have won only because he used the fast compression setting on all the compressors he tested. Results for default setting and best compression settings are TBA.

  9. This is a surprisingly big subject by derek_farn · · Score: 4, Informative

    There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools

    1. Re:This is a surprisingly big subject by IamTheRealMike · · Score: 1
      I've tried PAQ before and it can achieve good results, especially for text, but given the extremely slow nature of the algorithm I judged it not a good enough improvement over LZMA for the autopackage installers.

      Still, worth remembering, especially as these algorithms are being improved all the time.

  10. Open formats and long-term accessibility by ahziem · · Score: 5, Insightful

    A key benefit to PKZIP and tarballs formats is that they will be accessible for decades or hundreds of years. These formats are open (non-proprietary), widely implemented, and free (as in freedom) software.

    The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.

  11. Unix compressors by brejc8 · · Score: 5, Interesting

    I did a short review and benchmarking of unix compressors people might be interested in.

    1. Re:Unix compressors by Queuetue · · Score: 1

      Thanks for this - you helped me take the plunge and updating my remote backup scripts ... They now take about 1/10th the time to transfer and space to store, all by changing gzip to 7z in 4 or 5 places!

    2. Re:Unix compressors by TypoNAM · · Score: 1

      What is up with this? "Also do note the lack of ace results as there are no Unix ace compressors." on your Compression times page, yet somehow you were able to do it on your Size page. Kind of funny isn't it?

      --
      This space is not for rent.
    3. Re:Unix compressors by Anonymous Coward · · Score: 0

      Which options did you use in your tests for compressors that let you to choose between "fast" and "maximum" compression? (such as the -1 to -9 flags in gzip)

    4. Re:Unix compressors by brejc8 · · Score: 1

      For these I chose the highest compression for each test. I chose to do that because most of the compressors assume the max compression option (e.g. bzip2 assumes -9) and I was more intrested in the size reather than the speed.

    5. Re:Unix compressors by brejc8 · · Score: 1

      I used winace to do the compression. The idea is to determine which format to distribute the files in and ace is still possible due to the linux decompressor.

    6. Re:Unix compressors by Justin205 · · Score: 1

      No, not funny at all. Perfectly sensible.

      Size doesn't depend on the OS it was compressed on (generally - perhaps a small bit, at most). So he compressed it for size on Windows (or an OS with an ACE compressor).

      Speed, however, does depend on the OS it was compressed on. Much more than size, at any rate. So the results would have been skewed in one direction or the other, due to the OS.

      --
      "Your effort to remain what you are is what limits you."
    7. Re:Unix compressors by molo · · Score: 1

      Thanks for that benchmark. It might be interesting to see a plot of size vs. compression time.

      -molo

      --
      Using your sig line to advertise for friends is lame.
    8. Re:Unix compressors by TypoNAM · · Score: 1

      So, it is perfectly sensible to include a non-UNIX compression utility in a UNIX compression utilities review? WinAce does have a decompressor for UNIX, but no compressor, therefore shouldn't it have been dropped completely for this review because of that? Because it is irrelevant to this review if there is no UNIX compressor for it and this is a UNIX compression review.

      Sorry for nit picking, but come on how can you go use WinACE on Windows to do the size compressions and then use all the other compressors on UNIX, now that really skews the results. Now this review doesn't seem so trust worthy now doesn't it due to complete lack of details as to what environments he really did use. Seems just like those half-ass'd hardware reviews where they provide you no real information and just to show off what they want you to see, and not the whole picture.

      --
      This space is not for rent.
    9. Re:Unix compressors by GigsVT · · Score: 1

      About stuffit linux... Unless something's changed, they talk a lot about it being a time limited trial version, but it never expires. At least the copies I used a few years ago didn't.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    10. Re:Unix compressors by twistedcubic · · Score: 1

      Why don't you add to the list the compressor (bzz) that ships with the djvu tools.

    11. Re:Unix compressors by Anonymous Coward · · Score: 0

      I don't understand how this skews the results. He is looking at making a comparison of archivers which can decompress in UNIX-like OSes. Yes, ACE compresses very nicely, but 7-zip, which is LGPL and available for Linux (I know; I just compressed with 7-zip on Linux) compresses even more nicely. BTW, 7-zip is a bit slower but compresses better than rzip.

      Yes, maybe there should be a comparison of open-source compressors out there, and yes maybe he should have made it clear there isn't a UNIX ace compressor, but this comparison was useful enough for me.

    12. Re:Unix compressors by TheLink · · Score: 1

      gzip and lzop do not assume the max compression option.

      lzop in default is MUCH faster than gzip.

      I would not recommend using lzop in anything other than the default setting - it gets a lot slower when you set it to max, for not very much gain. If you want more compression and less speed, use gzip --fast instead of lzop.

      In fact sometimes lzop in minimal compression mode is slower than lzop in default!

      --
    13. Re:Unix compressors by TheLink · · Score: 1

      Not exactly what you asked for, but this site might be helpful.

      --
  12. Quite interesting by shayera · · Score: 1

    I'd like to see an article about exe compressors done like this.
    There are some interesting beasts out there like UPX, which as far as I remember does quite respectable packing on the win32 platform.

    the WinRK archive compressor tested here seems to achieve quite amazing results on the cost of speed.. a lot of speed..

    --
    Venlig Hilsen / Regards
    John Hinge - shayera / .sPOOn.
    "Buffy I love you... Please God No!" S
    1. Re:Quite interesting by _Shorty-dammit · · Score: 1

      I never understood the point of exe compressors, once HDs made it past the megabyte stage, well, there wasn't much point. And it's worse for distribution, since your archiving program will compress it better anyways if you hadn't UPXed it. Whenever I get something that's been UPXed, the first thing I do is decompress it.

    2. Re:Quite interesting by StillAnonymous · · Score: 1

      I find that exe compressors are generally used more for their ability to be exe encrypters.

      So long as you don't use some known and easily decompressable packer (like UPX), it adds a layer of protection to the program that prevents people from just hex editing the contents and patching out protection routines. They have to go through the trouble of decompressing the file first. That or write a loader that performs the patch in memory after the program has unpacked.

  13. Just use DiskDoubler by mattkime · · Score: 5, Funny

    Why mess around with compressing individual files? DiskDoubler is definitely the way to go. Hell, I even have it set up to automagically compress files I haven't used in a week.

    Its running perfectly fine on my Mac IIci.

    --
    Know what I like about atheists? I've yet to meet one that believes God is on their side.
    1. Re:Just use DiskDoubler by SleepyHappyDoc · · Score: 3, Funny

      Mac IIci? Has it finished compressing files since you bought it?

      --
      Stasis is death. Embrace change.
    2. Re:Just use DiskDoubler by fbjon · · Score: 2, Insightful

      I prefer DoubleSpace for maximum file-destroying activity.

      --
      True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
    3. Re:Just use DiskDoubler by toddestan · · Score: 1

      In all seriousness, for most people with a lot of files, drive compression programs aren't going to help them very much as most of the files people tend to accumulate (movies, mp3, jpegs) are already pretty well compressed. For everyone else, a 40GB drive is probably all the space they will ever need.

    4. Re:Just use DiskDoubler by hawaiian717 · · Score: 1

      Sure, since the IIci came with either a 40MB or 80MB hard drive, it should have finished long ago.

      However, AutoDoubler was probably what he wanted. DiskDoubler was your basic compression/decompression program basically like WinZip, whereas AutoDoubler would go through and automatically compress most of the files on your disk (I think it would normally be configured so that it didn't compress the System Folder, which would be a Bad Thing). Though personally, I liked Stacker, which installed itself into your disk driver.

      --
      End of Line.
  14. Why compress in the first place? by mosel-saar-ruwer · · Score: 1, Interesting
    No talk of the speed of compression/decompression?

    Speed aside [and speed would be a huge concern if you insisted on compression], I just don't understand the desire for compression in the first place.

    As the administrator, your fundamental obligation is data integrity. If you compress, and if the compressed file store is damaged [especially if the header information on a compressed file - or files - is damaged], then you will tend to lose ALL of your data.

    On the other hand, if your file store is ASCII/ANSI text, then even if file headers are damaged, you can still read the raw disk sectors and recover most of your data [might take a while, but at least it's theoretically do-able].

    In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.

    1. Re:Why compress in the first place? by ArbitraryConstant · · Score: 5, Insightful

      "I just don't understand the desire for compression in the first place."

      Sometimes, people have to download things.

      --
      I rarely criticize things I don't care about.
    2. Re:Why compress in the first place? by topham · · Score: 2, Insightful

      I'd call you a troll, but I think you were being honest.

      Compressing files with a good compression program does not increase the chance of it being corrupted.

      And, the majority of files people send to each other, etc, aren't simply ascii files. (even if yours are).

      The other advantage of using a compression program is the majority of them create archives and allow you to consolidate all the related files.

      A good archive/compression program will add a couple of percent of reduntancy data which can substantially increase the data integrity. Above and beyond that which you have by simply story an ascii file uncompressed.

      My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?

    3. Re:Why compress in the first place? by Jeff85 · · Score: 1

      Well what if you wish to transfer this data in a timely fashion? Sending less (read: compressed) data would take less time in this regard, though it's worth noting that the time required to decompress the data may make the total time to retrieve the original data longer than it would to just send the uncompressed data. So I think compression and decompression speed are also important factors.

      --
      Fetch Text URL - Firefox Extension
    4. Re:Why compress in the first place? by Ironsides · · Score: 4, Interesting

      In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.

      Because when you are storing Petabytes of information it makes a difference in cost.

      Besides, all the problems you mention with data coruption can be solved by backing up the information more than once. Anyplace that places a high value on there info is going to have multiple backups in multiple places anyways. The most usefull application of compression is in archiving old customer records. Being mostly text, you can easily get above 50% compression ratios. Also, these are going to be backed up to tape (not disk). Being able to reduce the volume of tapes being stored by 50% can save a lot of money for a large organization.

      --
      Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
    5. Re:Why compress in the first place? by ArbitraryConstant · · Score: 4, Interesting

      "My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?"

      The solution to this issue is popular on usenet, since it's common for large files to be damaged. There's a utility called par2 that allows recovery information to be sent, and it's extremely effective. It's format-neutral, but most large binaries are sent as multi-part RAR archives. par2 can handle just about any damage that occurs, up to and including missing files.

      Most of the time however, when it's simply someone downloading something it is only necessary to detect damage so they can download it again. All the formats I have experience with can detect damage, and it's common for MD5 and SHA1 sums to be sent separately anyway for security reasons.

      --
      I rarely criticize things I don't care about.
    6. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      Plain text compresses beyond 10:1, making storage costs even cheaper (still 0.50 per GB, but that GB really stores 10GB of data, so more like 5 cents/GB in the end).

      And yes, storage is cheaper than ever, but it's still somewhat expensive. I wish I could afford all the petabytes I want, but it still cost me ~500$CDN for 4 250GB drives (and bigger drives only cost more per GB). And no, it's not for pr0n, it's for music, movies (in H.264), ebooks, training videos, GBs of travel/family photos from my DSLR, etc. I wish I could afford to have mirrorring on some of my stuff, but that means even more HDs. Now, if I didn't use compression (rar/zip - not in the sense of mp3/mpeg4), I'd perhaps need twice that space. I'd need a 2nd job to buy HDs or something! Storage is STILL expensive!

    7. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      I always compress everything, even if I don't need the space savings, often just so that the compression checksum will provide assurance of the content's integrity (or alert me to subtle damage).

      Also by compressing I can often span an archive across fewer CDs than would otherwise be needed, which reduces the risk of damage.

      Then I make a redundant copy. Altogether, for roughly the same amount of storage media and just a tad extra effort, I get redundancy and integrity verification, instead of neither.

    8. Re:Why compress in the first place? by LWATCDR · · Score: 3, Interesting

      "As the administrator, your fundamental obligation is data integrity. If you compress, and if the compressed file store is damaged [especially if the header information on a compressed file - or files - is damaged], then you will tend to lose ALL of your data."
      Not all data is stored in ASKII and or ANSI. Compressing the data can make it more secure not less.
      1. It takes up less sectors of a drive so it is less likely to get corrupt.
      2. Can contain extra data to recover from bad bits.
      3. Allows you to make redundant copies without using any more storage space.
      Let's say that you have some files that are in ASCII you want to store. Using any compression method you can probably store 3 copies of the file using the same amount of disk space.
      You are far more likely to recover a full data set from three copies of compressed file than from one copy of an uncompressed file.

      Also we do not have unlimited bandwidth and unlimted storage EVERYWHERE.Loseless video, image, and audio files take up a lot of space. For some applications MP3, Ogg, MPG, and JPEG just don't cut it.
      So yes compression still is important.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    9. Re:Why compress in the first place? by Anonymous Coward · · Score: 0, Interesting

      Bah, speak for yourself. In this day when everyone has a 100 Mbit connections (at least around pretty much in this country; hint, not USA), to be honest, compressed content is actually a hassle. For instance, when I'm downloading your latest movie on DVD-R, it's usually packed in RARs, saving a few 100 Mb if that at best. But who cares? When my download speed is pretty much limited by my harddrive, I'd rather spend the extra 10 seconds to get everyone incompressed instead of having to wait 10 minutes to unpack the damn thing.

    10. Re:Why compress in the first place? by Potor · · Score: 1
      well, if i want to ftp my nightly backup to a remote server, it's easier to combine these files into one file and then ftp that file - and what a better way than simply to compress a folder? it's either that, or ftp'ing each file independently.

      compression can have more uses than simply saving space.

    11. Re:Why compress in the first place? by DeadboltX · · Score: 3, Informative

      Sounds like you need to introduce yourself to the world of par2 ( http://www.quickpar.org.uk/ )

      Parity reconstruction

      Think of it like the year 2805 where scientists can regrow someones arm if they happen to lose it

    12. Re:Why compress in the first place? by ysegalov · · Score: 1

      You sound a bit like Bill Gates who said nobody will need more than 640Kb of RAM..

      Also, data corruption has nothing to do with compression. Take an uncompressed EXE file, mess up a couple of bytes - and the whole package is useless.

    13. Re:Why compress in the first place? by Master+of+Transhuman · · Score: 0, Troll


      Compressing files intended for BACKUP, as opposed to DOWNLOAD, DOES increase the chance of losing the entire file. That was the poster's point and it is entirely correct.

      NEVER use compression on a backup unless you have PAR files you can use to recover the lost data if a bad sector on a CD, DVD, or bad block on a tape is discovered on restoration.

      The Disk Archive (DAR) program is one of the few backup programs that can generate PAR files during the backup.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    14. Re:Why compress in the first place? by 10101001+10101001 · · Score: 1

      And the answer to this is simple. Compression removes redundant data. In that space where redundant data would have gone, you can include some par2 files. Together the par2 files and compressed data will take up less total sectors, more than likely. Now, assuming that the odds of having a bad sector are fixed, using less total sectors while still having a means of correcting for bad sectors greatly increases the odds that not only that you won't have a problem but that if you have a problem there won't be any data loss. Tack onto this the fact that if by compressing it you can make two copies on two separate DVDs instead of making one copy spread out on two DVDs, and you've greatly increased your odds that you won't suffer partial data loss.

      Now, if you were talking about an analog source, I'd have a much better understanding of why you'd be against digitizing and compressing it. But then analog is innately more fault tolerant.

      --
      Eurohacker European paranoia, gun rights, and h
    15. Re:Why compress in the first place? by dotgain · · Score: 1

      NcFTP has a cool 'tar' mode. When you connect with a similarly capable FTP server, and recursively put or get a folder, each end pipes it through the local 'tar' program, and only one transaction takes place. It's really cool. Anyway, what self-respecting nerd would need compression. Everybody knows pr0n won't compress one little bit, unless it's ASCII.

    16. Re:Why compress in the first place? by stud9920 · · Score: 1
      Anyplace that places a high value on there info is going to have multiple backup
      On where info ?
    17. Re:Why compress in the first place? by dotgain · · Score: 1

      Or you end up fixing a bug...

    18. Re:Why compress in the first place? by khallow · · Score: 1

      Sounds like you have a problem with your hardwared. Your harddisk shouldn't be that kind of a bottleneck.

    19. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      Tapes are slow. If you use the spare cpu time to compress the datas, you actually speed up things.

    20. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      Don't most tape technologies compress data?

    21. Re:Why compress in the first place? by Bloater · · Score: 1

      > all the problems you mention with data coruption can be solved by backing up the information more than once.

      You are better off not compressing and then storing ECCs. Although you are better off still by compressing and *still* storing ECCs.

    22. Re:Why compress in the first place? by StikyPad · · Score: 1

      Let's see.. 1PB / 117.2Kb/s = 868,657 CPU days to compress.

      That sounds reasonable. A server farm of 1,000 could knock that out in just under two and a half years.

      You'd probably use gzip to save time though.. Of course you'll still need 1,000 CPUs to get it done in under two weeks, or triple that if you're only working in off-peak hours because you don't have a thousand servers sitting around doing nothing.

      But I'm guessing anyone who has PBs of data to store is not working on a shoestring budget, and not particularly worried about saving a few grand on storage. Money is irrelevant without considering time. (The old accountant joke, "I'll give you $1,000,000 dollars, one dollar a year.") The computing (not to mention electrical) power isn't free, in either dollars or time, and you can't compare costs until you take that into account.

    23. Re:Why compress in the first place? by StikyPad · · Score: 1

      I forgot to mention the obvious.. anyone who's working with that volume of data is probably doing incremental backups (PBs of data don't just appear out of nowhere), and if compression is involved, it's probably incorporated into the FS and/or the nightly backup. That didn't seem to be the theme of the article and summary though.

    24. Re:Why compress in the first place? by ergo98 · · Score: 1

      Bah, speak for yourself. In this day when everyone has a 100 Mbit connections (at least around pretty much in this country; hint, not USA), to be honest, compressed content is actually a hassle.

      Which country are you talking about? If by "everyone" you mean "a small number of people in a small urban area", then I suppose it's all perception. I get 6Mbps, and I know that I'm among a very small minority (and I wouldn't go around spouting nonsense about it being "everyone"). And secondly, where are you sourcing data from that it can actually fill a 100Mbps pipe? And thirdly, even the cheapest hard-drive can write far faster than 10MB/second.

      Of course, ignoring the nonsensical fabrications of your post, it is a major pain in the ass when people compress already compressed content (e.g. a RAR of an MPEG or a JPG, etc. It's a waste of time for no compression).

    25. Re:Why compress in the first place? by toddestan · · Score: 1

      For instance, when I'm downloading your latest movie on DVD-R, it's usually packed in RARs, saving a few 100 Mb if that at best.

      The reason why they usually come in a bunch of rar files is because that's the best way to distribute files over usenet (where one big file usually doesn't work as well as a whole pile of smaller ones for a bunch of reasons). The RAR format is just a convienent way to split the files up. Then when people make the torrents or whatever out of the files, the RAR compression sometimes just comes along for the ride.

    26. Re:Why compress in the first place? by ChicoLance · · Score: 1

      DAR sounds like a cool utility. Do you know of something similar for my Windows boxes?

      --Lance

    27. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      Bah, speak for yourself. In this day when everyone has a 100 Mbit connections (at least around pretty much in this country; hint, not USA)

      lol, how's that for a closed mind? ;-)

    28. Re:Why compress in the first place? by Bryson · · Score: 1

      The right tools for data integrity are things like reliable transports, error correction codes, RAID, off-site backups, and digital signatures. Modern systems are too complex to tolerate wrong data. Refusing to compress so that we might live with errors in some formats via manual partical data recovery is naive. Engineer the system so it works: compress then add ECC.

      Or -- let me see if I can get the order right:

              compress
              encypher
              sign
              error-control code

      --Bryan

    29. Re:Why compress in the first place? by ArbitraryConstant · · Score: 1

      "Then when people make the torrents or whatever out of the files, the RAR compression sometimes just comes along for the ride."

      That's kinda irritating, since bittorrent doesn't really care how big files are, and file integrity is handled by the client and the .torrent file. Also, you can't prioritize individual files in the torrent when it's all RARs, and torrent content is typically compressed fairly well already (MP3, XVID, etc) so there won't be much extra compression (it may even be bigger with the overhead).

      --
      I rarely criticize things I don't care about.
    30. Re:Why compress in the first place? by cortana · · Score: 1

      That's because DVD video and audio is already compressed with MPEG... you idiot.

      I'd like to see you fit a two hour movie, say 640x480, 24 bit colour, 25 frames per second, with CD quality (44.1 KHz, 16 bits per sample) audo onto a DVD without using compression...

      Compression is an engineering problem. You just have to pick the right kind of algorithm to compress your data! With MPEG, if you lose a few frames, the worst thing that happens is that the stream is corrupted until the next keyframe.

    31. Re:Why compress in the first place? by Ironsides · · Score: 1

      LTO tape has built in ECC. The data retains integrety the longest by keeping the tapes in temperature/humidity controled rooms and, most importantly, by not reading the data. Tapes are very good for long term storage this way.

      --
      Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
    32. Re:Why compress in the first place? by John+Bokma · · Score: 1

      You can compress the data, and create some error correction info on the compressed data so in case something breaks in your data, it can be restored. Or burn another (set) of CDs, which might be much smaller (in number) compared to uncompressed data.

    33. Re:Why compress in the first place? by Bloater · · Score: 1

      Actually you're probably best off doing a RAID 5 like scheme, so you end up using more storage, but having twice as many bits damaged in a localised region of tape is not twice as likely to prevent recovery of a given bit.

      Of course, having said what I have said, while the chance of damage causing unrecoverable losses can be reduced to almost nothing, if you are using compression then (depending on the scheme), those losses in that unlikely scenario could be totally devastating - while intellect can be applied to recover most of the data when you're not using compression. Overall, I think the risk/cost ratio is improved the most by using compression, ECC, and a RAID5-like scheme, but I'd like to see if somebody has done a thourough mathematical analysis of this. It would be interesting to see, and good to know.

    34. Re:Why compress in the first place? by Ironsides · · Score: 1

      The RAID-5 would be redundant in most cases. A business that really wants to keep the data will have duplicate tapes (simulating RAID-1?) in at least two different storage wharehouses. Given that, I don't think the RAID-5 would be necessary.

      Also, given the ECC already in the LTO format, and that you can put some in the cmpressed file, RAID-5 would probably be a bit too exotic on tape to include.

      --
      Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
    35. Re:Why compress in the first place? by six · · Score: 1

      In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.

      yes you can, just paste this to your nearest xterm :

      echo "without data compression, i'd have to pay \$$((`ls -1 /root/.private/movies/|wc -l`*640*352*3*25*60*90/1024/1024/1024/2)) for my pr0n storage"

    36. Re:Why compress in the first place? by dougmc · · Score: 1
      NEVER use compression on a backup unless you have PAR files you can use to recover the lost data if a bad sector on a CD, DVD, or bad block on a tape is discovered on restoration.
      You're far too free with that NEVER statement. There's many many variables involved -- `backup' is an incredibly large brush, and in many many cases, using compression (another incredibly broad brush) makes perfect sense.

      There's lots of variables that we're not touching here, but be assured that using compression along with backups often makes lots of sense -- -- it sometimes speeds up the backup process
      -- it usually allows you to fit more onto a single piece of media
      -- it usually lets you know, upon restoring, if anything was corrupted or not. (Yes, keeping md5 or crc32 hashes will also do this, but most archivers/compressors add this by default.) Without compression or hashes, you often don't know if there was some data corruption or not (depends on the backup system -- remember, there's lots of variables.)
      -- sometimes the compressor/archiver will preserve metadata that the backup system will not. For example, tar will save *nix metadata where your backup may ignore it. (To be fair, tar is an archiver, but not a compressor. So consider tar.bz2 files instead.)
      -- it's true that a single bit error in a gzip file will make everything after the error unreadable, and this is one reason that gzip is often a poor choice for compressing a very large file. But gzip is only one compressor -- bzip2 doesn't have this limitation, for example. (And gzip's stream could be periodically restarted to get past that.)

      Also consider that in many cases, even a single bit error will make something totally unusable, compression or not. It depends on the circumstances.

      But I do agree that par2 files (or the equivilent) are nice. I personally do my backups at home to DVD-Rs, with a program that creates tar.bz2 files of a certain range of sizes and then 5-10% .par2 files, and I also include md5sums and crc32s onto the DVD-R of all files on the disk so I can easily tell if anything is corrupt without even decompressing anything.

    37. Re:Why compress in the first place? by Anonymous Coward · · Score: 0

      Speed aside [and speed would be a huge concern if you insisted on compression], I just don't understand the desire for compression in the first place.

      1) The bottleneck for getting a backup done is typically how fast you can write to tape. Compressed data = more net data per hour written to the tape.

      2) The space savings from compression can then be spent on providing additional redundancy / error correction (for the paranoid admins). After all, even if your file is ANSI/ASCII, a partially corrupted file is extremely suspect. Whereas if you had compressed it and sprinkled in some recovery data, you could both verify the data and restore any corrupted blocks.

      3) This is also why admins invented multi-generational backups. That way, in case one of your backup tapes goes bad, you can restore from the previous tape. Sure, you lose a bit of data, but that's the trade-off between backing up hourly vs daily vs weekly.

      Frankly, the more experienced admins in the office don't understand why you *wouldn't* use compression in backups. Unless your organization has more money then sense. (I don't know of very many admins who complain that they don't have enough data to backup. Most of us are struggling to keep everything from spilling over to additional backup devices.)

    38. Re:Why compress in the first place? by enrgeeman · · Score: 1

      4*250GB=1TB Petabyte's are sorta bigger.

      --
      sent from my slashdot browser.
    39. Re:Why compress in the first place? by DJerman · · Score: 1

      Compression decreases the chance that a file will be corrupted by a random 1-byte or 1-block event (because it becomes a smaller target) but it increases the chance that you won't be able to guess the right way to fix it, and (with most algorithms) it ensures that the file beyond the point of the error (or sometimes the whole file) is useless.

      If you're worried about loss, compress your data then use a separate parity system (like PAR2) to store redundant information about the compressed file, so you can reconstruct a certain number of bad bytes or blocks. Most of the parity programs have a feature that helps you determine the right degree of redundancy if you can define the degree of error you want to be able to tolerate. I like that better than hoping the compression algorithm makes the right assumptions.

      Yes, this can increase the size of the data up to or beyond the original size, but you're in control of the size difference and by compressing and computing parity you can have better error tolerance than by not compressing at all (that is, you're more likely to be able to guess the missing information).

      I too am concerned about these high-performance compression programs. I prefer open source so that I can do the math and be sure the algorithm is going to be 100% reversable. There's always some new miracle compression tool (anyone remember WEB?) that can compress anything but only uncompresses some things -- not the compression program I'd like to use :).

      --
    40. Re:Why compress in the first place? by Master+of+Transhuman · · Score: 1


      There's a version of DAR for Windows, which uses the Cygwin DLL to allow it to run under Windows.

      The main DAR home page has a link to the Sourceforge page where you can get a Zip file package for Windows.

      The only complexity is that even under Windows, you have to give DAR filenames with forward slashes imbedded instead of back slashes like Windows.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    41. Re:Why compress in the first place? by Master+of+Transhuman · · Score: 1

      "it usually lets you know, upon restoring, if anything was corrupted or not"

      Gee, recovery time is really when I want to know if I have a corrupted backup... That's why one has to use checksum files or PARs - or back up twice (then you don't care as the odds of losing the same file on two backups is very small - unless the drive is failing in such a way as to record errors in the same spot on multiple media - very unlikely.)

      And I'm not talking about bit errors in files - I'm talking about whole (sometimes multiple) sector read errors on the MEDIA. That's the show stopper for compressed archives. Bit errors are so rare it's not worth discussing.

      I backup twice if the data is hard to recover from the Internet (ie., I have to hunt for it to get it back - such as video files I've downloaded from somewhere), but only once if the data is easily retrieved from elsewhere.

      I've never lost a file doing two backups - even when my previous CD drive was screwing up regularly (my current DVD drives seem to be more reliable). I HAVE had bad sector reads on CDs that proved the value of backing up twice. If I had archived and compressed those files and then backed them up only once, I would have lost them. That's when I learned.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    42. Re:Why compress in the first place? by dougmc · · Score: 1
      Gee, recovery time is really when I want to know if I have a corrupted backup...
      Well, sure. That's why I've instituted a policy where all backup tapes will give six months notice before developing any errors not correctable by the drive itself.

      That's why one has to use checksum files or PARs - or back up twice (then you don't care as the odds of losing the same file on two backups is very small - unless the drive is failing in such a way as to record errors in the same spot on multiple media - very unlikely.)
      Backing up twice is nice, but if you don't know that the bits you read from the tape are incorrect, why would you even bother to look at the other tape? Being able to recover from an error is a good thing -- but it's also important to actually know that there was an error, and yes, par2 or checksums of some sort will help you determine that. Most of the time the drive itself will also report an error, but it's best not to rely on that.

      As for par2 files, they require so much cpu to calculate that you'll not find many people actually using them. I do, you do, but you probably won't find a company with 100 TB to back up weekly using them. (And really, if I had even 1 TB to back up in one shot (250 DVD-Rs ... ouch! I'd probably go buy some big IDE drives instead) I probably wouldn't be using them either. But as long as I'm doing a few DVD-Rs at a time, no problem.

      Bit errors are so rare it's not worth discussing.
      Actually, they're quite common. However, your drives (tape, CD, DVD, hard, etc.) typically have some error correction built in, and can recover transparantly from a few bit errors per sector. Which is why you almost never see single bit errors -- the drives typically take care of them and you never see them until they become so numerous that firmware can't recover, and an error is reported, and the entire block is usually garbage. However, obscure bugs in drivers and firmware do sometimes cause the error to be swallowed up and no error is reported. It's very rare, but it does happen.
    43. Re:Why compress in the first place? by Master+of+Transhuman · · Score: 1

      "Actually, they're quite common. However, your drives (tape, CD, DVD, hard, etc.) typically have some error correction built in."

      That's why I said they aren't worth talking about. It's rare for a bit error to not be corrected automatically and it's rare for a bit error to actually matter given the size and organization of most files today. Unless it causes a program crash, most of the time it's just a glitch in a video or audio stream that is ignored.

      "Being able to recover from an error is a good thing -- but it's also important to actually know that there was an error..."

      I still don't know what the hell you're talking about here. If you're talking about errors in data that DON'T cause a failure of the system but merely corrupt the data, then your statement makes sense - but is irrelevant to the discussion. I'm talking about errors that prevent restoration of the data - which, again, is the point of backup - to get the data back if it is lost. I'm not concerned about simple data corruption that doesn't cause a restore failure because, as you say, most of the time that is corrected by other mechanisms in the system. I'm concerned about data corruption that causes loss of data by being unable to restore it to replace data already lost elsewhere.

      Today I'm backing up my system again (haven't done it for months since it IS a painful process with just DVDs) - I expect to have to use at least twenty or thirty and maybe even forty or fifty DVDs to do it since I have 40GB of media to back up and another 40GB of images to back up, plus several tens more gig of ebooks, documents, programs, etc. I'd use tape if I could afford it, but DVD is the only method I can afford.

      I tend to back up only once since much of the stuff is retrievable from the Net. Also, since I back up individual files unarchived and uncompressed, the odds are I'll only lose a few images or something if a bad sector occurs. Thus I avoid having to use PARs - but that's only because little of the data is critical or hard to replace. I'm a bit more protective of my downloaded Corrs videos as they can be harder to replace depending on who's got them uploaded to a Web site at any given time.

      But I'd be NUTS to archive and compress this stuff on DVD - one bad sector and I'd lose it all. Especially when DVD drives and media are so flakey - nobody can be sure any given media will work in any given drive. It's a major problem. I had to replace my last LiteOn DVD burner because it absolutely sucked at using even top of the line Taiyo Yuden media. Now I have a much better NEC drive. But who knows - it could go out at any time and the next drive I get might not read anything the NEC burned - more likely, I'd get bad sector reads from the NEC-burned media because of differences in the drives. I learned from the LiteOn experience - do NOT archive and compress.

      On an enterprise scale, there may be times when you have to archive and compress. But I stick to the view that for critical data that MUST be restorable (as opposed to merely archived for legal reasons), keep it simple.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    44. Re:Why compress in the first place? by dougmc · · Score: 1
      That's why I said they aren't worth talking about. It's rare for a bit error to not be corrected automatically and it's rare for a bit error to actually matter given the size and organization of most files today. Unless it causes a program crash, most of the time it's just a glitch in a video or audio stream that is ignored.
      Depends on your data. And as for your video or audio stream -- you'll notice that it's 1) probably compressed and 2) a single bit (or sector worth of bits) error will generally not ruin the entire stream.
      But I'd be NUTS to archive and compress this stuff on DVD - one bad sector and I'd lose it all.
      It's not that bad. Even if you do the worst possible thing -- put one 4.3 GB tar.gz file onto a DVD -- a single bad sector will only make 50% of the data unusable on the average. (Note that I said `on the average' -- it could be that you lose only a tiny bit, or it could be that you lose 100%.)

      But as I suggested, this is the wrong way to safely use compression. You could simply use bzip2 instead of gzip, for example. bzip2 compresses streams into blocks, usually 900 KB long, and each block is handled independently. A single sector error will corrupt only one or two blocks, and so that's all that will be corrupted -- the rest can be recovered. And that's just one of many possible ways of making sure that a single bad sector will only corrupt a small part of your backup.

      On an enterprise scale, there may be times when you have to archive and compress.
      On an enterprise scale, you'll probably find that most tape drives implement compression in the drive firmware itself, and that most enterprises use it. It 1) speeds backups up and 2) allows you to put more `typical' data onto the tape. (Of course, it can't further compress already compressed data, but much (most?) of what's backed up is not already compressed.) And a single bad block on the tape typically only corrupts a small part of the backup (a few MB tops), not the entire media.

      In any event, quit saying that `compression is bad, mmmkay?' It's not. It's a useful tool, and done properly it does not signifigantly endanger your backups or your data.

    45. Re:Why compress in the first place? by Master+of+Transhuman · · Score: 1

      "Even if you do the worst possible thing -- put one 4.3 GB tar.gz file onto a DVD -- a single bad sector will only make 50% of the data unusable on the average."

      That depends on the software you use to try to recover the data - my experience in this regard is not good. I DO have tools for that sort of thing, but most people wouldn't.

      As for tape drives, I've just been through an extensive discussion about them elsewhere in the threads - which ended when I cited at industry study that showed 30% of tape backups fail due to media corruption or drive failure or operator error. I also cited a study showing that the industry is moving to disk-to-disk and disk-to-disk-to-tape for local recovery and backup-to-archive operations. I have no objection to using tape for archival backup - meaning stuff that is merely being saved with no expectation of requiring it to be restored in order to continue to operate. But if you're backing up mission-critical data and you need to be sure of restoring it to get back in operation, disk-to-disk is the only way to go - it's faster and far more reliable and the expense is irrelevant compared to the costs of not being in operation. Further backup to tape for "last-ditch" restoration (if your building blows up) is also okay, but for that data, I'd recommend not to both archive and compress simultaneously - the risk is too high.

      Compression and archiving at the same time should only be used for data that is strictly archival and not a mission-critical backup.

      You people need to realize that the major problem with the IT industry is that, as Woody Allen once put it, "Nothing works and nobody cares". Much of what the industry produces is complicated and not reliable - that includes hardware and software. You simply cannot rely on somebody's promises that all this crap is going to work and work well together. Therefore it is imperative to keep it as simple as possible - disk to disk backups with no compression or archiving removes extra steps that can compromise reliability. It's that simple.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
  15. Input type? by reset_button · · Score: 3, Interesting

    Looks like the site got slashdotted while I was in the middle of reading it. What file types were used as input? Clearly compression algorithms differ on the file types that they work best on. Also, a better metric would probably have been space/time, rather than just using time. Also, I know that zlib, for example, allows you to choose the compression level - was this explored at all?

    Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.

    1. Re:Input type? by reset_button · · Score: 1

      Looks like the load on the site just went down and I was able to read the remainder of the article. Looks like they do use different input types, as well as a space vs. time metric. I'm not crazy about using a stopwatch, but that's probably the best you can do if you're working with a GUI. If any of you know the answer to my question at the end though, it would be appreciated.

    2. Re:Input type? by !equal · · Score: 1
      Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.

      I know of one for music called FLAC (Free Lossless Audio Codec).

    3. Re:Input type? by bigbigbison · · Score: 2, Interesting

      According to Maximum Compression, which is basically the best site for compression testing, Stuffit's new version is the best for lossless jpeg compression. I've got it and I can confirm that it does a much better job on jpegs than anything else I've tried. Unfortunately, it is only effective on jpegs not gifs, pngs, or even pdfs which seem to use jpeg compression. And, outside of the mac world, it is kind of rare.

      --
      http://www.popularculturegaming.com -- my blog about the culture of videogame players
  16. Blank page by Anonymous Coward · · Score: 0

    Anyone else having trouble viewing the site? It comes up utterly blank in IE with all patches on fully updated XP. View sources shows everything you'd expect to see but it's rendering blank. ?? (useless "don't use IE" type comments will be modded flamebait)

    1. Re:Blank page by Anonymous Coward · · Score: 0

      1) The site is Slashdotted.
      2) Don't use IE.

    2. Re:Blank page by level_headed_midwest · · Score: 1

      It's Slashdotted. Same result with Konqueror 3.5 on Linux.

      --
      Just "gittin-r-done," day after day.
  17. Why compress in weird formats? by canuck57 · · Score: 4, Insightful

    I generally prefer gzip/7-Zip.

    The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.

    For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.

    1. Re:Why compress in weird formats? by _Shorty-dammit · · Score: 2, Insightful

      haha, yeah, 7-zip isn't 'weird' at all. I like how you try to make it sound like it's just as pervasive as something like gzip, even though 7-zip's a pretty much unknown format.

    2. Re:Why compress in weird formats? by jp10558 · · Score: 1

      Yeah, when compressing files, I'm basically limited to .zip for most people, cause WinXP will handle that. For the savvy, I might get to use .rar for a little better compression.

      Has anyone heard of WinUHA yet? That is supposed to be pretty good, and I'd not mind testing out other archivers, as long as the time savings on transferring smaller files aren't overtaken by the compression/decompression time. Though, again, all these things are useless if no one can uncompress them.

      --
      Opera, Proxomitron-Grypen,GPG 0x0A1C6EE3
    3. Re:Why compress in weird formats? by hobuddy · · Score: 2, Interesting

      7-zip is the 16th most popular download on SourceForge (8544268 downloads so far), and it gets downloaded about 18000 times per day, so it must be going somewhere in terms of popularity.

      --
      Erlang.org: wow
  18. ...or NTFS by tepples · · Score: 1

    Why mess around with compressing individual files? DiskDoubler is definitely the way to go.

    And NTFS of Windows 2000 or later includes technology similar to DiskDoubler.

  19. rzip? by Anonymous Coward · · Score: 0

    how does it perform against the rest?

    http://rzip.samba.org/

    1. Re:rzip? by brejc8 · · Score: 1

      rzip and szip are the two compressors which I didnt know about before I started doing the review. They are both about a couple precent better than bzip2 and I will include them if I do an update.

    2. Re:rzip? by undeadly · · Score: 1

      According to this post there are cases where you often get much better results with rzip than bzip2. So testing also depends on the type of data one expect to compress.

  20. Why compress in the first place? To save time. by SineOtter · · Score: 1

    Not everyone cares about how great their data integrity is with compressed files- They just care about compressing a few files to send to someone over IM faster than if they were sending them uncompressed. When telling someone they have to wait 40 mintues for your file to finish sending because it's uncompressed, then speed/compression becomes the deciding factor.

    1. Re:Why compress in the first place? To save time. by tzanger · · Score: 1

      Sure, but we're talking 200M on a 2.5GB file. that's 8%. Frankly, if an 8% difference in speed is going to change your download time to your friend on IM by forty minutes, it's time to upgrade your connection.

  21. 'more of the same' - delta compression by erwincoumans · · Score: 1

    >backing up last years data to make room for more of the same. If it's really more of the same, using delta compression on new data using last-years data would work nicely.

  22. common compression utilities benchmarks by qazwsx789 · · Score: 0, Redundant

    I did a small test of the common linux compression commands back in 2000. Here are the results: (note that some of the command options have changed since then, for example tar now uses -j for bzip2)

    THE COMPRESSION UTILITY TEST

    Compression utilities tested: zip, rar, gzip, bzip2, tgz(tar with the z flag invoked). Each test was run three times. For each completed test the system was rebooted. Hardware used: Pentium2 350Mhz, 256Mb RAM. OS: linux Mandrake 7.1. The system load was minimal. The "time" commands was used to time the elapsed time, the "ls -l" command was used to determin the size and a script was used to determine the total size of gzip files.

    Note: gzip, packs individual files recursively. For bzip2, the command invoked was tar -cvIf file.bz2 dir (in gnu tar, the I flag invokes bzip2). for tgz, tar with the z flag invokes gzip.

    TEST 1 - compressing multiple files

    total size of the dir: 91.621.857 bytes, total files: 3540 (most of these files are ascii and html, but there are a few gifs and jpgs too.)

    default compression settings:

    tool time elapsed MB/s compressed to time elapsed uncompressing
    gzip 1m.44s 0.88 24.884.124 37s
    zip 1m.10s 1.3 25.813.958 41s
    rar 3m.25s 0.44 20.784.489 48s
    bzip2 3m.54s 0.39 17.399.561 1m.17s
    tgz 1m.09s 1.32 23.821.446 36s

    maximum compression settings:

    tool time elapsed MB/s compressed to time elapsed uncompressing
    gzip 2m.00s 0.76 24.670.516 36s
    zip 1m.42s 0.89 25.593.448 39s
    rar 10m.12s 0.14 18.698.710 1m.02s
    bzip2 n/a (the comprsession rate can not be specified through tar, is the maximum default?)
    tgz n/a (the compression rate can not be specified through tar, is the maximum default?)

    CONCLUSION: use tgz (tar with the z flag) if time is an issue, otherwise use bzip2(tar with the I flag)

    TEST 2 - compressing 1 ascii file

    size of the ascii file: 53.819.786 bytes (the file was taken out of my mailbox)

    default compression settings:

    tool time elapsed MB/s compressed to time elapsed uncompressing
    gzip 42s 1.28 15.560.144 15s
    zip 41s 1.31 15.560.261 17s
    rar 1m.57s 0.45 11.507.387 17s
    bzip2 1m.58s 0.45 10.788.502 39s
    tgz 54s 0.99 15.560.907 8s

    maximum compression settings:

    tool time elapsed MB/s compressed to time elapsed uncompressing
    gzip 44s 1.22 15.486.842 15s
    zip 45s 1.19 15.486.959 16s
    rar 6m.40s 0.08 09.582.810

  23. small mistake by ltwally · · Score: 4, Interesting
    There is a small mistake on page 3 of the article, in the first table: WinZip no longer offers free upgrades. If you have a serial for an older version (1-9), that serial will only work on the older versions. You need a new serial for v10.0, and that serial will not work when v11.0 comes out.

    Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.

    --



    /dev/random
    1. Re:small mistake by honor,+not+armor · · Score: 1

      Save yourself some trouble and use ZipGenius. It does 7-zip and other formats, integrates into the Explorer context menu if you choose, and it's freeware (even for corporate use). Downside is that it's not open-source, and the GUI could be improved a little.

  24. What about speed? by www.sorehands.com · · Score: 1

    It is not only the space, but also the speed. Once the data is compressed, backing up the compressed data takes less time. If you compress, then backup you have to compare the compression time to the transfer time. Now, if you compress once, then backup, then copy the backup you now compare the compression time to 2X of the transfer time.

    Outside of the pure speed issue, what media swapping? Once you exceed the media capacity (I'm talking removable media), the media needs to be swapped which not only takes time, but most like requires human interaction. If you have a 30GB tape, but you have a 40GB to backup, tape need to be swapped. This eliminates the "start the backup, go home" backup process.

    1. Re:What about speed? by Ironsides · · Score: 1

      On tape, this is not an issue. Serious tape libraries are automated. An arm manually loads in and extracts tapes used in backup. Mind you, I'm also assuming that any one really worrying about this is going to be "serious". LTO tapes (great for long term backup) hold 400GB (LTO3). Transfer speed is about 20MB/s (yes, megabytes). Tapes cost ~$100 each. Also, from my experience with compression, no compression algorithms (or computer hardware) can compress raw data fast enough to keep that rate up. It's going to stay a save money on tape/storage costs for the forseable future as far as I can tell.

      --
      Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
    2. Re:What about speed? by Wonko · · Score: 1

      On tape, this is not an issue. Serious tape libraries are automated. An arm manually loads in and extracts tapes used in backup.

      This is all fine, until you need one more tape than your library holds :).

      Mind you, I'm also assuming that any one really worrying about this is going to be "serious". LTO tapes (great for long term backup) hold 400GB (LTO3). Transfer speed is about 20MB/s (yes, megabytes).

      So, to improve your overall throughput your compression needs to get anything slightly better than a 1:1 ratio and it needs to run faster than 20 MB/sec.

      Tapes cost ~$100 each.

      Assuming you fill each tape and your compression only buys you 20%, you already pretty good savings in money.

      Also, from my experience with compression, no compression algorithms (or computer hardware) can compress raw data fast enough to keep that rate up. It's going to stay a save money on tape/storage costs for the forseable future as far as I can tell.

      I haven't been in the backup game for quite a few years (when I left, 100 MB LTO was pretty new). Depending on how compressable the data was, we used to get a big gain in throughput by compressing the data. I don't have any solid numbers, only my memory. However, I can do a quick test here on my desktop (Athlon MP 1700).

      I happen to have a bzip2ed cpio of my laptop's home directory sitting here. It should contain a pretty good mix of file types. I just uncompressed it, and it is about 750 MB (370 MB bzip2ed). "gzip -1" can compress the file down to 406 MB (54%) at a rate of 9.8 MB/sec. Redirecting to /dev/null improves that to 11.2 MB/sec. My machine is already over 3 years old, I would hope a modern server could at least double my throughput.

      There are faster compressors available that do not don't compress nearly as well. I was able to get 21 MB/sec to /dev/null using lzop (filesize 432 MB, 57%). There is probably something faster, but if you could pipe your backups through lzop you might be able to increase your backup speeds by 50% or more (or less, depending on your data of course).

      I am also cheating a bit. I am working with a single large file. I am assuming tape backup speeds have increased about linearly with hard disk speeds and seek times. If they have, your backups will slow to a crawl if you are backing up small files. I want to say that the 7200 RPM SCSI drives that were on most of our file server might have had an average seek time of 9ms (does that sound right?). That pretty much means that you lose 9 ms every time the backup process seeks to open a new file. It is amazing how much this screw up your throughput, even if the tape is 1/10th the speed of the disks...

    3. Re:What about speed? by ikea5 · · Score: 1
      If you have a 30GB tape, but you have a 40GB to backup, tape need to be swapped. This eliminates the "start the backup, go home" backup process.

      This is slashdot. You probly meant "start the backup, still home".

    4. Re:What about speed? by jabuzz · · Score: 1

      Just make sure you have a library that can be expanded to meet your capacity requirements. I am sure any storage company would be more than happy to sell you a system that will scale from one drive and maybe 20~30 slots, all the way through to a dozen or so drives and hundreds of tapes.

      It is something of a headache to keep a modern multiple drive tape library streaming. Say you have just four LTO3 drives, with a native speed each of at least 40MB/s, then you need to stream data at a speed of at least 160MB/s, and if you can achieve 2:1 compression then you need to get it off the hard disks at 320MB/s. However the tape drives can do 80MB/s, which pushes the transfer rate of the hard drives to 640MB/s.

      However as you suggest it is the seek time from the filesystem that really kills if you want to do a file level backup. The restore times suck even worse than the backup.

    5. Re:What about speed? by Anonymous Coward · · Score: 0

      Of course how do those LTO tapes manage to fit 400GB on them? That's right, compression (in the hardware)!

    6. Re:What about speed? by Ironsides · · Score: 1

      LTO tapes aren't really made for online storage. Nearline at best and usually archival storage is how they are used. For most archives, you don't need a tape library to hold all the tapes. This is especially true if you are storing old customer records. Generally you'd take tapes like that out of the tape library and store them someplace else.

      As to the restoration of data you talk about, it is interesting. One thing is that the max transfer speed I know of for any single drive hookup is 2Gbit/s. The limit of 250MB/s would hamper large restores.

      --
      Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
  25. Compress to 0K by Anonymous Coward · · Score: 2, Funny

    I always compress my compressed files over and over until I achieve absolute 0Kb.
    I carry all data of my entire serverfarm like that on a 128Mb USB-stick.

    1. Re:Compress to 0K by hawaiian717 · · Score: 1

      I found this really neat compression program rm that compresses files to 0Kb really quick, in one pass. OS X doesn't seem to come with the uncompressor though, so I don't use it much.

      --
      End of Line.
  26. Nothing to see here by Anonymous Coward · · Score: 5, Informative

    I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.

    I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.

    Ouch.

    1. Re:Nothing to see here by igrigorik · · Score: 1

      I agree with you but I also think that the whole pursuit for the 'best' compressor is misguided, even a set of ROC curves won't tell us much. From a practical standpoint a single compressor as 'jack of all trades' is obviously the best solution but due to the differences in the compression algorithms every data-set that you're going to push through the compressor will yield different results. If you even take the most basic/well studied Lempel-Ziv and Huffman algorithms you'll quickly find cases where each would be preferred over another.

      From a programmers point of view: - Sometimes I don't want to send my dictionary with my encoded file, sometimes I can even assume that we have the dictionaries on both end points of communication. - Sometimes I can wait 5 minutes to zip a file and 20 minutes to unzip it. When I'm trying to stream a file, I probably don't. - Sometimes I want everyone to be able to read my file (zip it!). Sometimes I don't.

      And since different algm's identify different patterns in the file their compressing, certain files will be compressed better by different algorithms and do much worse on the next file. Besides, we're not even getting into any discussion of lossy/lossless algm's here. (Think jpeg vs bmp).

    2. Re:Nothing to see here by Meostro · · Score: 2, Informative
      If you even take the most basic/well studied Lempel-Ziv and Huffman algorithms you'll quickly find cases where each would be preferred over another.
      That's sort of the point of this test though, to see which of the general-purpose compressors (GPC) is going to give you the best overall results. Yes, you should use FLAC for WAVs, and probably StuffIt for JPEGs, but what is your best choice if you're going to have just one, or just a few? I don't want 200 different compressors for 200 different content types, I want one.

      As a matter of practicality, right now you need zip or gzip, and bzip2 is gaining ground. If you're going to create new content, you should offer both bz2 and zip. In the future, maybe you should use 7z or sit instead, it depends on the rate of adoption. Personally, I don't think zip will ever die.
      And since different algm's identify different patterns in the file their compressing, certain files will be compressed better by different algorithms and do much worse on the next file. Besides, we're not even getting into any discussion of lossy/lossless algm's here. (Think jpeg vs bmp).
      Generally, you will pick a special-purpose compressor for lossy compression, and a GPC for lossless compression. Your audio compressor will probably be MP3 or OGG, your images will probably be JPG, videos will be MPG. It's not efficient to use MP3 compression on your images, it's designed with different constraints. Either for the same bitrate the image is much worse quality, or for the same quality the file will be much larger than necessary. The same goes for lossless compressors too, FLAC works much better than ZIP on audio data, but I would bet if you used a BMP file as the source for compression FLAC would probably be bad and ZIP would probably be average.

      If you want to compress 300 files of various types, you need a GPC. That doesn't mean that the GPC doesn't have special-purpose algorithms built into it, it just means that on-average it will perform better than a special-purpose compressor.

      Kolmogorov complexity, or at least an estimate thereof, is what you're talking about. For any specific dataset, the Kolmogorov complexity is the minimum size of compressed data + decompressor. It can't be calculated, but it is a measure of performance for any combination of compressor and dataset. For WAVs, you will probably see this:
      K(FLAC, WAVs) < K(GPC, WAVs)

      However, for an evenly-distributed general dataset of generic binary files, TXT, JPG, PDF, TIF, PNG, MP3, WAV, and MPG, you will probably find that for any SPC (special-purpose compressor for any of the individual data types):
      K(GPC, dataset) < K(SPC, dataset)
  27. Maximum Compression has efficiency comparisons by bigbigbison · · Score: 5, Informative

    Since the original site seems to be really slow and split into a billion pages, those who aren't aware of it might want to look at MaximumCompression since it has tests for several file formats and also has a multiple file compression test that is sorted by efficiency. A program called SBC does the best, but the much more common WinRAR comes in a respectable third.

    --
    http://www.popularculturegaming.com -- my blog about the culture of videogame players
  28. Related Links Broken by Karma+Farmer · · Score: 2, Funny

    The "related links" box for this story is horribly broken. Instead of being links related to the story, it's a bunch of advertising. I'm sure this was a mistake or a bug in slashcode itself.

    I've searched the FAQ, but I can't figure out how to contact slashdot admins. Does anyone know an email address or telephone number I can use to contact them about this serious problem? I'm sure they'll want to fix it as quickly as possible.

  29. No one ever looks at rzip by Mr.Ned · · Score: 3, Interesting

    http://rzip.samba.org/ is a phenomenal compressor. It does much better than bzip2 or rar on large files and is open source.

    1. Re:No one ever looks at rzip by Anonymous Coward · · Score: 0

      rzip has an outstanding combination of speed and compression ratio. See this review.

  30. Decompression Speed by Hamfist · · Score: 3, Interesting

    Interesting that the article talks about compression ratio and compression speed. When considering compression, Decompression time is extremely relevant. I don't mind witing more to compress the fileset, as long as decompression is fast. I normally compress once, and then decompress various times (media files and games for example).

    1. Re:Decompression Speed by droleary · · Score: 1

      When considering compression, Decompression time is extremely relevant. I don't mind witing more to compress the fileset, as long as decompression is fast.

      Another important consideration is how the format allows you to access the contained files. Starting in 2005 I finally switched from .tar.gz to .tar.bz2 for some backups to save quite a bit of space, but just last week I had to pull a couple files from a 2.15GB (compressed to 1.03GB) backup and it took ages to decompress and get at what I wanted. In 2006 I'm planning on switching to a compressed disk image (.dmg on my Mac; at 1.13GB just 100MB larger than the .tar.bz2 in question) because saving a lot more time in access has shown to be better than saving a bit more space in storage.

  31. Tutorial with rzip, graphs and bandwidth by Anonymous Coward · · Score: 0

    rzip wasn't reviewed but it uses hashing to quickly look for previously seen data. I think it's great. A tutorial with it and other linux compression tools is here. The tutorial also has graphs that make it easy to see the trade offs between speed and compression ratio, as well as advice on which compressors increase effective bandwidth the most for your CPU and network speed.

  32. Because it makes a hell of a lot of sense. by cbreaker · · Score: 4, Insightful

    If you're familiar with Usenet, you've probably encountered PAR files from time to time. A PAR file is a parity file which can be used to reconstruct lost data. It works sort of like a RAID, but with files as the units instead of disks.

    Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.

    Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.

    Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.

    Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.

    I can't fathom why a responsible and well informed admin would avoid compression.

    --
    - It's not the Macs I hate. It's Digg users. -
    1. Re:Because it makes a hell of a lot of sense. by Master+of+Transhuman · · Score: 0, Troll

      "While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple."

      Two problems with your response:

      1) If your data is that valuable, compressing makes it more likely to lose it.

      2) If your data is that valuable, making two copies takes twice the time and space - even with compression - and if you use compression and get a bad sector, fifty percent of your backup is now useless. Sure, the odds are good that you can recover from the second backup - but if IT has a bad sector - even in a different place - possibly because your device is going bad - then you've lost the second backup as well.

      If you backup more than once UNCOMPRESSED, you can recover almost anything because it is VERY unlikely that a bad sector will occur in the exact same spot or even in the same file (assuming the one file does not take up most of the specific media.)

      If your data is valuable, back up twice uncompressed. If your data is only so-so valuable, back up twice compressed. If your data is easily replaced, back up once uncompressed. NEVER back up once compressed - you might as well not back up at all then.

      Alternatively, use PAR files to recover - as long as you're willing to add the extra space and time - which sort of obviates the advantage of compression, doesn't it?

      And if the only valid argument for compression is saving the cost of media, then obviously your data is less valuable than you think it is - in which case why bother backing it up at all (other than legal requirements)? The cost of media simply is not a factor in comparison to the cost of the time required to back it up, the cost of the time to restore if needed, and the value of the data itself. That is being "penny-wise and pound-foolish" - a typical attitude among geeks who are obsessed with efficiency over effectiveness. Save a few gigabytes of space and lose the data - yeah, that's real smart...

      If you want to back up quickly and securely, have two devices backing up simultaneously uncompressed - or two devices backing up simultaneously compressed with PARs. You can't lose - it's that simple.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    2. Re:Because it makes a hell of a lot of sense. by cbreaker · · Score: 2, Insightful

      I don't undertstand why you think compression automatically destroys the chance of recovery? And how encoding in ASCII is better? What's the thing about "sectors"? I never said using a compressed volume on a hard disk was a good idea. Compressed files can be recovered too, you know. If you have the forensic expertise to recover a corrupted non-compressed file, changes are you'd also be able to recover the data from a compressed one.

      The only arguement for compression is not the cost of media - in fact I didn't mention media price at all. I did mention the library capacity, however - and getting an even bigger library is a lot more expensive of a prospect then the $.75 you quoted per GB. Did you read the whole part of my post about speeds? If I can restore that database in half the time because of compression, that means less down time and less money lost. (Although, the money-lost factor doesn't really apply at a government institution; we're not selling anything.)

      "If you backup more than once UNCOMPRESSED, you can recover almost anything because it is VERY unlikely that a bad sector will occur in the exact same spot or even in the same file (assuming the one file does not take up most of the specific media.)"

      Wouldn't this apply to a compressed backup, too? You're assuming here that the file was unchanged in between the two backups - thus it would apply to any data, compressed or not.

      "Alternatively, use PAR files to recover - as long as you're willing to add the extra space and time - which sort of obviates the advantage of compression, doesn't it?"

      No - it simply lowers the compression ratio a bit. If you're getting 2:1 compression and add 10% pars, you're still looking at a 1.8:1 compression ratio, but with recoverability.

      ----

      Within every IT budget, you must balance out the speed, recoverability, and cost of your backup solution.

      In your solution of never using compression (since no admin should do that, you mentioned) you lose a lot of speed in backups and restores. Speed of recovery is a key factor in many enviornments. It's often the top question asked when in discussion of new backup solutions. You talk about this as an important point yet excluding compression could double your restore times, or more. Not to mention backup speeds - if you can take your backups in half the time, you effectively double the number of servers you could backup in the same amount of time. Or, you reduce the amount of time servers are busy with backups.

      Recoverability is big - you want your backups to be reliable. Most of the time, any corruption is unacceptable, be it in a compressed file or not. It's either good or you throw it out and go back to the previous backup. Many IT shops are doing multiple backups these days - backup to disk first, then to tape. Then take snapshots of those tapes and bring them off-site. Compressed or not, testing your backups and ensuring you have no problems with hardware is much more effective then using uncompressed backups and performing forensics on them if they're bad. Speaking of which, I don't see why compressed data would be less recoverable.

      Finally, you have cost. Yes, even when data recoverability is a key factor, you still have to consider cost. So, what makes more sense? Using uncompressed backups that will backup and restore slower, cost a lot more for media and library capacity, and cause more personnel overhead for swapping tapes - or using compression and cutting all that in half? You'd rather lose all that in the off chance that MAYBE you could recovery more of your data, in the off chance that NONE of your other backups are good? I don't know any resposible IT manager that could agree with you.

      A proper backup and recovery plan with periodic testing and multiple copies held on-site and off is a much more effective solution then betting on forensic recovering of uncompressed data.

      Hey, I'm not claiming that compression is always right in every situation. That's far fro

      --
      - It's not the Macs I hate. It's Digg users. -
    3. Re:Because it makes a hell of a lot of sense. by Master+of+Transhuman · · Score: 1

      "If you have the forensic expertise to recover a corrupted non-compressed file, changes are you'd also be able to recover the data from a compressed one."

      This makes no sense at all. Most people don't have ANY "forensic expertise" - and it's far easier to recover data from an uncompressed backup than a compressed one.

      Look, this is really very simple. You backup one file on two CDs. Both of them end up on the same sectors because the backup is identical. One of the CDs gets a bad sector so you can't retrieve that file (without external assistance either from the archive program or PAR files.) The other CD is not likely to have a bad sector ON THE SAME FILE (unless as I said, the file is big enough to take up at least fifty percent of the CD - which an archive of many files, no matter how compressed, is obviously more likely to do than any single file).

      Therefore it is easier to prevent backup corruption by not compressing the backup (unless you want to go to the trouble of using PAR files - which ALSO have to be backed up and which are very large, thus significantly impacting the benefit of compression.)

      PARs aren't ten percent of the file - they can be as big as the file itself if you're paranoid and need to recover multiple errors (not usually necessary, especially on CD media, although it can be using them to handle Internet-transmitted files.) If your PARs are 20-50% of your file, compression's value is reduced significantly. If you get 2:1 compression, you could end up with 1.5:1 or less. Which, added to the low cost of disk space and the dangers of corruption, makes it much less useful.

      Speed of recovery is also impacted by compression - you have to uncompress those files before you can use ANY of them. Uncompressed files can be used immediately off the backup medium if necessary and can be restored individually more easily.

      I'm aware that compression helps backup time. The proper solution for a corporate environment is enough backup servers to handle the job in the required time frame, or sufficiently high performance backup methods such as hot backups disk-to-disk followed by slower offline backups to tape (if the volume warrants.)

      The reality is that many organizations have discovered that defects in media or drives result in totally lost archives that could have been prevented by backing up those files individually. And the cost of lost data is almost always higher than the time spent manipulating media or doing the backups in the first place.

      I don't have a small shop mindset. I'm aware that some organizations are battling backing up data warehouses with several terabytes of storage using backup tools that at most handle a couple hundred gigabytes and which require almost the entire night to backup. In such situations, compression may be necessary. But I think the issue should be decided on the basis of the actual value of the backup vrs the nickel and diming of media cost and the time to backup. As I said, if the data is that valuable that it needs to be backed up nightly, then the cost of media and servers is irrelevant (c'mon, what does another server cost if you're a company with a petabyte of data?), and the time to backup obviously calls for more servers and a more effective backup method.

      As for sending uncompressed files over the Net, obviously compression is useful there - I have no problem with that. The issue of whether it is necessary when doing a BACKUP over the Net is another matter. I can see compressing a backup in that situation because the cost of bandwidth can be significant, as well the fact that the data transfer rate is usually much slower than disk to disk. However, in some cases, it might still be better to decompress on the final storage side - again, it depends entirely on the value of the data and the cost of losing it.

      The bottom line: the point of backup is RECOVERY. If you can't recover, none of your backup policies are relevant. And archiving and compressing backups increases the chance of losing the data. It's that simple. And for what? To save some time? To save some media which costs pennies per gigabyte?

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
  33. Okay, I'll look at rzip then... by Anonymous Coward · · Score: 0

    Those are some pretty impressive compression ratios, but how does rzip do speed-wise? Is it faster, slower, or about the same as bzip2?

    Regardless of how fast it is, it looks like it's worth considering if you have large files to compress. Thanks for pointing it out--I'll give it a try next time I make backups.

  34. Unicode support? by icydog · · Score: 3, Informative

    Is there any mention made about unicode support? I know that WinZip is out of the question for me because I can't compress anything with Chinese filenames with it. They'll either not work at all, or become compressed but the filenames will turn into garbage. Even though the data stays intact, it doesn't help much if it's a binary and has no intelligible filename.

    I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.

  35. Agreed! by p3d0 · · Score: 1

    Why use JPEG or PNG when you can just use .BMP files?

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  36. accuracy test missing by Grimwiz · · Score: 2, Insightful

    A suitable level of paranoia would suggest that it would be good to decompress the compressed files and verify that they produce the identical dataset. I did not see this step in the overview.

    --
    -- Don't believe everything you read, hear or think
  37. There's an article in there somewhere? by cbreaker · · Score: 4, Insightful

    All I see is ads. I think I found a paragraph that looked like it may have been the article, but every other word was underlined with an ad-link so I didn't think that was it..

    --
    - It's not the Macs I hate. It's Digg users. -
    1. Re:There's an article in there somewhere? by theArtificial · · Score: 0

      This stumped me for a second too, there is a link in the lower right, and a drop down menu with an index of the different pages of the article. Glad to know I wasn't alone.

      --
      Man blir trött av att gå och göra ingenting.
  38. JPG compression by The+Famous+Druid · · Score: 5, Interesting

    It's interesting to note that Stuffit produces worthwhile compression of JPG images, something long thought to be impossible.
    I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.

    --
    Quidquid Latine dictum sit, altum videtur (anything said in Latin sounds important)
    1. Re:JPG compression by Kris_J · · Score: 1

      It's just a shame they've sat on this technology for almost a year now without releasing anything new.

  39. lzip? by Anonymous Coward · · Score: 0

    What about lzip? I've heard good things about this archiver but it's homepage seems to have gone down. Here's the archive.org link:

    http://web.archive.org/web/20041010014034/http://l zip.sourceforge.net/index.html

  40. Completely out of context by EdMcMan · · Score: 4, Informative

    It's a crime that the submitter didn't mention this was with the fastest compression settings.

  41. Why does ANYBODY Bother with WinZip? by Master+of+Transhuman · · Score: 3, Interesting


    Proprietary, costs money...

    I use ZipGenius - handles 20 compression formats including RAR, ACE, JAR, TAR, GZ, BZ, ARJ, CAB, LHA, LZH, RPM, 7-Zip, OpenOffice/StarOffice Zip files, UPX, tc.

    You can encrypt files with one of four algorhythms (CZIP, Blowfish, Twofish, Rijndael AES).

    If you set an antivirus path in ZipGenius options, the program will prompt you to perform an AV scan before running the selected file.

    It has an FTP client, TWAIN device image importing, file splitting, convert RAR into SFX, converts any Zip archive into an ISO image file, etc.

    And it's totally free.

    --
    Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    1. Re:Why does ANYBODY Bother with WinZip? by smoker2 · · Score: 1

      Yeah, but does it run linux ?

    2. Re:Why does ANYBODY Bother with WinZip? by Master+of+Transhuman · · Score: 1


      Don't know if anybody has tried running ZipGenius under WINE. I've heard 7-Zip can be run with WINE.

      The nice thing about ZipGenius is I can download a tar.gz file while operating from my Windows side and unpack it and examine it without having to boot up Linux.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
  42. This test is worthless by Dwedit · · Score: 3, Informative

    They are testing 7-zip at the FAST setting, which does a poor job compared to the BEST setting.

    1. Re:This test is worthless by imsabbel · · Score: 1

      Same for ANY other program involved.

      Not to mention that some programs differ a LOT between fasterst and slowest and some dont...

      Its just bullshit.
      Same for his example data: nearly EVERYTHING there was already compressed inside the file container... who the fuck wants to save space by compressing video or jpgs?

      A real field would be stuff where compression actually saves something, like log files. A look at maximumcompression tells me that there are programs that can compress apache logs to less than the half of bzip2...

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    2. Re:This test is worthless by Urusai · · Score: 1

      Took this long for people to notice? They don't include "default" and "best" compressions; I only use "best" myself. If I'm bothering to compress, it's to save space, not time, otherwise I'd save even more time and not bother.

      I hope they wait for the dupe until after they actually finish the article.

  43. Multiply Packing by gagge · · Score: 0, Redundant
    void SendVariables(int *var)
    {
    unsigned char c = 0;
    for(int x = 0; x < 5; x++)
    c = c*3 + var[x];
    SendChar(c);
    }

    void RetrieveVariables(int *var)
    {
    for(int x = 4; x >= 0; x--)
    {
    p[x] = c%3;
    c = c/3;
    }
    }
  44. Lest We Forget - Philip W. Katz by BigFoot48 · · Score: 4, Interesting
    While we're discussing compression and PKZip, I thought a little reminder of who started it all, and who died before his time, may be in order.

    Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.

    http://en.wikipedia.org/wiki/Phil_Katz

  45. mnb Re:Nice Comparison... by Anonymous Coward · · Score: 0

    The firewall is a service.
    Disable it.
    Simple as that.

  46. Opensource PKZIP? by Anonymous Coward · · Score: 0

    I wouldn't say PKZIP is like opensource tarballs using gzip or bzip2.

    Infozip comes to mind.

    Also check out textfiles.com creator Jason Scott's BBS documentary if you haven't yet!

    # Compression tells the story of the PKWARE/SEA legal battle of the late 1980s and how a fight that broke out over something as simple as data compression resulted in waylaid lives and lost opportunity.

    http://bbsdocumentary.com/

  47. Philip W. Katz: BBS documentary by Anonymous Coward · · Score: 0

    The extremely controversial debacle mentioned on wikipedia is also discused in Jason Scott's BBS documentary.

    http://www.bbsdocumentary.com/

    Compression tells the story of the PKWARE/SEA legal battle of the late 1980s and how a fight that broke out over something as simple as data compression resulted in waylaid lives and lost opportunity.

  48. Linux journal by Anonymous Coward · · Score: 0

    A while ago, linux journal had a great comparison of a lot of programs, with a lot of options, comparing speed and resulting size. If you want to know something about compression on unix, go and look. Everything! It even convinced me to buy the magazine! (Yep, I start to sound like an ad). Anyway, check this link

  49. Another compression test by Jugalator · · Score: 1

    I used to like this one: Archive Comparison Test, but unfortunately it hasn't seen updates since 2002 for general data compression. However, that's still in the post-WinRAR 3.00 era, and the Windows archiver summary explains a bit why WinRK may win here, but still not be too well-known. Good compression isn't everything -- one often have to keep the speed aspect in mind too. And when you've then picked an archiver with nice compression for the speed, you may start looking at the feature set. Again WinRK isn't state-of-the-art there. It's mostly a pure no frills compressor where you can ignore durations, especially for large archives. Not nearly "an archiver for everyone".

    Personally, after a couple of years of testing things out (OK, make that a decade -- time flies), I believe RAR by far exceed most archivers' features nowadays, and also hit the sweet spot of good compression for reasonably good speeds. I think RAR trumps both WinZIP 10, 7-zip, bzip2, and all other common archivers you throw at it as for features, and does really well in the compression field for being so all-around. It can decompress most common archive formats too. For a lower cost than WinZIP, while to me looking just as easy to use.

    WinACE was once an archiver preferred by some over RAR, but it sort of died out due to a lack of updates, or at least a lagging behind by RAR's improvements. What once looked promising there now looks more like a rarely used RAR-wannabe to me.

    7-zip is the one other archiver that has recently caught my attention because it's open source and generally compress better than RAR, still at pretty good speeds. However, it's nowhere near RAR's feature set and lacks pretty large chunks of important features for me to use it still, but I keep having an eye on it, and I don't dislike it at all, and can clearly understand why some prefer it. 7-zip has become my favorite over bzip2 (in turn over gzip) now as my favorite open source archiver, and its cross-platform support is looking better these days with OS X, Debian, Fedora, and Gentoo support, although unofficial, directly from its home page.

    --
    Beware: In C++, your friends can see your privates!
    1. Re:Another compression test by Anonymous Coward · · Score: 0

      Again WinRK isn't state-of-the-art there. It's mostly a pure no frills compressor where you can ignore durations, especially for large archives. Not nearly "an archiver for everyone".

      Please take a look at the recently released WinRK v3.0. This version goes a long way towards making WinRK an 'archiver for everyone' with many new features, and more reasonable speed modes.

      Malcolm

  50. Thanks. And data != pr0n. by mosel-saar-ruwer · · Score: 1

    If your data is that valuable, compressing makes it more likely to lose it.

    Thanks - I was getting a little lonely there.

    I think part of the problem is that most /.-ers believe that

    data == pr0n
    But, of course, pr0n has no inherent integrity, therefore it seems to me that maybe the concept of data integrity is essentially meaningless to the average /.-er.

    1. Re:Thanks. And data != pr0n. by cbreaker · · Score: 1


      Just because you might have a huge library of porn that you need to back up, doesn't mean all of us do too.

      Explain to me why and/or how compression reduces data integrity? In fact, I'd argue that it's the other way around in many cases where compression is appropriate since you'll know if corruption occured due to errors in decompression. Otherwise, you might not have as much fair warning. You do realize that we're talking lossless compression, not lossy (like JPEG) compression?

      Please. Explain.

      --
      - It's not the Macs I hate. It's Digg users. -
    2. Re:Thanks. And data != pr0n. by Master+of+Transhuman · · Score: 1

      "since you'll know if corruption occured due to errors in decompression. Otherwise, you might not have as much fair warning."

      Oh, brilliant. Gee, I really wanna backup so I can find out the backup is no good...

      The point of backup is RECOVERY. If you can't recover, it does you absolutely no good to realize you have a corrupted backup. Get a clue.

      "You do realize that we're talking lossless compression, not lossy (like JPEG) compression?"

      You do realize we're talking about media corruption, not backup software corruption?

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
  51. Compressing jpegs by Anonymous Coward · · Score: 1, Insightful

    It's rather pointless to compare compressing jpegs between gzip and anything else, because jpeg internally uses gzip to compress the blocks that make up the image.

    Also for a lot of applications, compression speed is not important, decompression speed is. If you're distributing software, it's not that much of a problem if it takes a lot of time to compress, but if the install takes ages because the decompression is too slow it does matter.

  52. Embarassing ads - This is an ad cash-grab by dr_skipper · · Score: 3, Insightful

    This is sad. Over and over slashdot is posting stories with nothing more than some lame tech review and dozens of ads. I really believe people are generating sites with crap technical content, packing them with ads, and submitting to slashdot hoping to win the impression/click lottery.

    Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.

  53. Requesting Entry into Comparison by l33tlamer · · Score: 1

    Please try l33tZip, the *BEST* compression software available. We have taken the best settings of WinRAR and changed its name to "FAST". OMFGWTFBBQ best invention ever!!!111

    --
    If I can do it, its probably not worth doing... probably
  54. Dual Proc Support? by Chordonblue · · Score: 1

    Yeah, in that same vein, how many (if any) of these compressors will take advantage of my shiny new Athlon 64 X2? It's amazing to see the difference in compression times with XVID or the new DiVX - but I have yet to see a compression program use two processors. That said, I usually use 7-zip as my main compression program. Flexible, compatible, free...

    --
    "...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
  55. Compression and corruption by hackwrench · · Score: 1

    Back in the days of Doublespace, I used to not compress because bad sectors were common and it was easier to recover parts of a file when you are dealing with an uncompressed file and the compression mechanism wasn't good at dealing with keeping the rest of the compressed disk image valid when parts of it got corrupted. Now I always compress NTFS volumes.

  56. Speed comparisons by Sheepdot · · Score: 1

    Ignoring that this article is just one big advertisment:

    900 MB of text data. Precisely 944,156,137 bytes of text files. AMD 1800 w/1.5 gig of RAM. Cable connection. My objective is often getting the data to someone else.

    Comparisons:
    7Zip = 5:24 to compress, :55 to decompress, file size 188,380,358 bytes.
    WinRK = 18:35 to compress, 3:48 to decompress, file size 132,097,001 bytes.

    Note that this is one of the fastest settings on 7Zip, I didn't have time to see if 7zip could beat it in size.

    That's a difference of about 50 meg, which may seem like a lot, but imagine if you just wanted to send these 900 megs of text files to someone in the quickest amount of time. With WinRK, immediately add the 18:35 and 3:48 to get 22:23. WinZip is 5:24 plus the :55 to get 6:19. That's a difference of 22:23 minus 6:19 to get 16:04 that the 50 meg needs to be sent in. Or 964 seconds.

    Actual amount to transfer: 188,380,358 - 132,097,001 = 56,283,357 bytes.

    I have a 256k upload, which is about 32K per second. 32Kpbs at 964 seconds is about 30.8 megs. So while 7zip isn't quite as good in a one peer to one peer transfer with say, a cable modem, it could be argued that the excessive processor time needed to compress and (especially!) decompress the file is ridiculous when compared to the saved space.

    1. Re:Speed comparisons by Lehk228 · · Score: 1

      7zip can be cranked up[ to recockulous levels, IIRC Half Life 2 compressed to 1.5 gigs with 7zip in ultra mode, which requires 384 megs of ram to compress

      --
      Snowden and Manning are heroes.
  57. Compression bad for data's health. by Anonymous Coward · · Score: 1, Informative

    Why is it bad?

    Because data backups corrupt, but often they do not corrupt all the way.

    Which leaves the possiblity open for partial recovery. Especially if only part of the data is needed, this can be "good enough".

    However if the entire data set is compressed and a part of it corrupts it can make it very difficult to recover the data that is still uncorrupted. In this case think of data compression as a low-grade form of data encryption.

    That's not to say that you CAN'T compress data in a safe way, it's just that you have to be very smart about it.

    Case in point.

    Lets say your a Unix/Linux user. You have nice choices between tar, dd, dump, cpio, and other forms of data copying utilities. Each with their own strengths and weaknesses.

    Then you have different compression technics to choose from, bzip2, zip, gzip, rar, etc. etc.

    So lets say you choose to use gzip and tar, which are good old standbyes that do a good job and are almost universally recognized and supported.

    However your directory system you want to backup is bigger then the medium your backing up to. Say you have a 8 gig directory system and your backing up to 650 meg cdroms.

    So the kneejerk response is to:
    tar czf - source | split -b 650m backup-
    So that will create a bunch of backup-aa, backup-ab, backup-ac, etc etc files that are 650 megs each which is a nice size for backing up to cdroms.

    However, if one of the cdroms is burned incorrectly or gets lost, then when you go:
    cat backup-* > tar zxf -

    Then you have hosed all your data.

    So instead of doing compression along with the tarball, THEN splitting, you do the tarball to split then compress. And then do smaller sizes of files to make it easier to handle, since now that data has different compression rates then you can make it so that it fits all neatly into cdrom-sized nuggets.
    tar cf - source | split -d 50m backup-
    then do the gzip and copy with a simple script or whatnot.

    Now if you have a missing cdrom or part of the cdrom is toast...
    you gunzip all the remaining files...
    cat backup-* | tar xf -
    it will bitch when comes accross a partially-their file and exit.

    Then with some manual work you remove the backup files that worked out so far, then finish up with the backup. Tar will complain that it's missing the starting point for some data and ignore that, but when it comes to a file header then it will happily finish up with what it has left.

    You still loose some data, but the rest of the data is easily recoverable.

    Also keep in mind that different data compresses differently.

    If you have uncompressed audio data it makes more sense to compress it to ogg, mp3, or flac before backing it up. Also with images it makes sense to agressively compress them to png (lossless) or jpeg (lossy) before backing them up. You'll get much more efficient compression in sizes. (However seeing that most of us get our information, of this type, off the net then it's probably already compressed.)

  58. No "explanation". Just experience. by mosel-saar-ruwer · · Score: 1

    Please. Explain.

    Look - I don't have an "explanation".

    And I'm even receptive to two of the pro-compression arguments:

    1) The greater the compression of a particular file, the fewer sectors that particular file touches, hence the lower the probability that a single bad sector will kill that particular file. [Note that this argument only holds for in the case of a single, isolated file; in particular, THIS ARGUMENT DOES NOT NECESSARILY HOLD FOR ALL FILES IN AGGREGATE.]

    2) The compression "algorithm" may include some extra error-correcting features above and beyond the error-correcting features of the file system and the underlying hardware, hence it is at least theoretically possible that the compression "algorithm" might make it easier to correct the error in situ.

    Nevertheless, I have had to deal with corrupted files, and have had to write my own file-recovery software to examine and alter bad files [at the byte-level], and I can tell you that RECONSTRUCTING A CORRUPTED FILE BY HAND IS AN UNMITIGATED DISASTER - EVEN IF YOU HAVE ACCESS TO THE SOURCE CODE THAT CREATED THE FILE IN THE FIRST PLACE.

    Trust me - you do not ever EVER EVER want to be handed the task of re-creating a corrupted file - even if you have access to source code. 'Cause if you are given that assignment, you can just about kiss goodbye the next several weeks of your life.

    And if the corrupted file was compressed with some weird-ass compression scheme [for which you may or may not have the source code], then hell - it might take you YEARS to figure out what happened. Maybe even forever.

    1. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      I appreciate the fact that it's more difficult to reconstruct a compressed file over an uncompressed one, although not impossible in the hands of the right people.

      What I'm saying is that to forgo compression - with it's numerous advantages and only one questionable benefit - is a real silly thing to do. You've eluded to the fact that compression in fact causes corruption - which is inherently false (although you've backed off this arguement a bit.) And you seem to ignore the point that speed and capacity are huge pros for compression and have any bearing on the arguement.

      I continue to believe that with a proper backup system and procedures, you'll never encounter a time when you'll have to reconstruct a corrupt file. I'll qualify that statement with this: Unless the file was corrupted from the source media. If you backup a hosed database, it'll be hosed on tape, and compression didn't play any part of it.

      You began this discussion with "I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression" and I felt compelled to respond to this insulting statement. Compression isn't corruption - it's a staple of large data storage and it has been for decades. I just hope I never end up on the job with someone of the same opinion. I take data integrity very seriously, and your arguement against compression has little credibility.

      Really: if you truly need that level of data paranoia (I can only think of maybe one or two institutions that might) a standard tape backup system just won't do for you anyways. There's other ways to ensure data integrity besides full-on backups, and they're designed with high availability and high integrity in mind. EMC's SRDF (in conjunction with a couple of Symms) and some of the Veritas replication tools help with these types of requirements. Of course, all of these systems employ some sort of compression to send the data over the wire.

      --
      - It's not the Macs I hate. It's Digg users. -
    2. Re:No "explanation". Just experience. by Minna+Kirai · · Score: 1

      You began this discussion with "I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression" and I felt compelled to respond to this insulting statement.

      Take 10,000 files. Randomly kill 1 byte. You now have 9,999 good files.

      Compress 10,000 files into an archive. Randomly kill 1 byte. You now have ZERO good files.

      What is hard to understand about that? You act like you don't even recognize that category of problem.

      Compression isn't corruption

      Are you blind to the words "risk" and "possible"? Do you just skip over them like they weren't even there? You've just wasted a huge amount of typing, because no one will take you seriously since you evidently didn't accurately read the starting post.

    3. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      Are you that ignorant of how backups work? Each file is compressed individually to tape or disk backup files.

      Are you also ignorant that the internet isn't all 100Mbit? You need compression or you'll be wasting a lot of time. And if you need to ensure integrity, you can use a recovery record in the archive and parity files on the outside.

      BILLIONS of compressed files are transferred over the internet every day without issue.

      If you take PROPER PRECAUTIONS instead of relying on forensic recovery you won't put yourself in a position to require it.

      Are you ignorant to risk mitigation? Are you ignorant to the fact that some risks are so insignificant that you shouldn't nerf your whole system because of it? How hard is that to understand? Disabling compression on backup or file transfer systems is a ridiculous idea.

      --
      - It's not the Macs I hate. It's Digg users. -
    4. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1


      Nobody is saying that compression BY ITSELF results in data corruption.

      What we are saying is that compression aggravates the problem of MEDIA corruption.

      That should have been obvious from the start. If it wasn't, my apologies - or I would apologize if this wasn't /.

      And I agree - a standard tape backup isn't reliable and never has been. It's also too damn slow - which is why disk to disk backup is taking hold, especially given the per-byte cost of disks vs tape these days. When a tape holding 200GB costs $100, and an entire hard drive holding 160-200GB costs less than $100, given the relative speed, the use of tape is not very logical. The only advantage tape has is ease of mobility - which is also its greatest disadvantage since that's usually where it gets damaged (other than in poorly maintained drives.)

      By the way, I have no objection to using compression to send data over the Net or a network - the protocols take care of reliability there (although I suppose there is a small chance of corruption as well which should be taken into consideration.)

      "with a proper backup system and procedures, you'll never encounter a time when you'll have to reconstruct a corrupt file."

      And this begs the question, which is - what IS a proper backup system and procedures? My point is that compression isn't. And if you have a proper system that does use compression, then most likely you're using such redundancy (multiple backups, PARs, etc.) that the actual value of the compression is much less than it seems to be.

      I actually have no problem with compression if the redundancy is sufficient. In many cases, a simple double backup will be enough (although if your drives are failing, there is a small chance of getting a bad sector on both backups - if both backups have archived data, you've lost them both.) It all depends on the value of the data.

      As I keep reiterating, the point of backup is RECOVERY. If you can't recover, because of compression or archiving, none of the justifications for using them are relevant.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    5. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      "Are you that ignorant of how backups work? Each file is compressed individually to tape or disk backup files."

      That's how it works IF that's how it works.

      Most people here are arguing for ARCHIVED backups that are THEN compressed.

      I have no problem with individually compressing files - as long as the compression method used is standard and easily handled by decompression programs external to the backup software (because that can fail as well.) I have no problem with using standard zip or gzip compression tools to compress files individually for backup. The issue is the use of archives AND compression. That is just asking for trouble - one bad sector, the entire backup is hosed - unless other means of redundancy are provided.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    6. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      Your post pretty much proves your inexperience in enterprise backup systems, or in IT systems in general.

      "a standard tape backup isn't reliable and never has been. It's also too damn slow - which is why disk to disk backup is taking hold"

      Tape backups are pretty damned reliable these days, and very fast. You have to take care to pay attention to tape lifespan and any drive problems that might arise from time to time. Small computer tape systems starting with AIT/DLT have set the bar for reliability - LTO and SDLT have taken it a step further with leaders that hardly ever break. LTO2 tape drives can perform at 40MB/sec sustained with no compression. With 2:1, you'll get 80MB/sec sustained transfer rates. LTO3 doubles these figures, and LTO5 doubles them again. There's even an LTO6 on the horizon that will be able to sustain 540MB/sec with 2:1 compression. Do you know how many SATA hard disk spindles you'd need to even come close to these figures?

      Quite often, backups to tape are faster then disk based backups. Disk backup systems *DO NOT* replace tape systems. They supplement them. They provide faster access to backups - no moving tapes around a library and you get random access. There's definite advantages to disk backups - especially when you introduce a snapshot technology like EMC's BCV. I use disk backups where I work, and make copies to tape for longer-term storage and off-site storage.

      You can't take hard drives off-site. With single tapes holding up to 800GB of data in a smaller and lighter form factor, it's a no-brainer. You can drop a tape and it'll be okay - and you can even repair some tape cartridge damage. Drop a hard drive and you're screwed. Plus, who wants to carry 20 hard drives around to bring them off-site? They're heavy! Not to mention I've never seen a hot-plug bar-coded hard drive library..

      "I have no objection to using compression to send data over the Net or a network - the protocols take care of reliability there (although I suppose there is a small chance of corruption as well which should be taken into consideration.)"

      So you're willing to accept that corruption could occur over the network, but compression is okay here. Sounds contradictory to your argument.

      "actually have no problem with compression if the redundancy is sufficient. In many cases, a simple double backup will be enough"

      In fact, in most cases a single backup is enough in itself - you're not overwriting your backups every day are you? Especially if you backup to disk first, which would be your first copy.

      "And this begs the question, which is - what IS a proper backup system and procedures?"

      Perhaps a little more experience with enterprise level backup and recovery would be more sufficient then a response from me here. I covered the very basics to get you started in my other posts on this thread. At the very least, your backup system should be well documented and periodically tested using stand-by systems for restoring data.

      "As I keep reiterating, the point of backup is RECOVERY. If you can't recover, because of compression or archiving, none of the justifications for using them are relevant."

      And if lightening strikes, and blows up the building, it's all for nothing too. Better move underground. And if terrorists set off a nuke, you better have a bunker under there. Actually, turn the computers off, unplug them, and submerge them in concrete. That way, you won't have to worry about corruption. You better not drive to work anymore either, because if you life is of any value to you, you'd avoid the risk of a backup issue that you would be too dead to fix. Better not trust Iron Mountain with your tapes either, because the truck could get into an accident. It's all stupid, I know, just like avoiding compression because forensic recovery of a file MIGHT be more difficult. The idea is to mitigate corruption problems to begin with and have contingency plans, not deal with forensic recovery of corruption when it hits you.

      Hey, do your uncompressed backups if you need that level of paranoia - I've made my point and I don't see that it needs to continue being made.

      --
      - It's not the Macs I hate. It's Digg users. -
    7. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      No, I'm not saying that files should be archived first, and then compressed onto tapes. There were two points of discussion - one was compressed archives, and it moved on to tape backup compression.

      Most tape drives these days employ hardware-based compression to ease the burden on the backup servers. It makes more sense as speeds increase on the drives faster then the CPU on the backup servers. This rings true more so when you stream multiple jobs onto multiple tapes simultaneously to maximize throughput. Each file or unit that's backed up to tape is compressed on the fly, individually.

      I don't agree that the compression has to occur outside of the backup software. While it may be true that there could be a bug, the same is true wherever you position it. Rest assured, IBM, Legato and Veritas have pretty much worked through compression bugs a decade ago or more..

      It's not inherently true that one error in an archive means the archive is blown. A simple recovery record in the archive, available in most archivers, takes care of this handily. Even if you didn't do that, however, a little corruption doesn't mean the entire archive is hosed. It does depend on the archiver to a degree, of course. And if you must do this before backing up to tape, you can include parity files to mitigate risk of corruption.

      You're not asking for trouble as long as you take steps to mitigate risk. That's what good BURP (Back-Up and Recovery Practices) is all about.

      --
      - It's not the Macs I hate. It's Digg users. -
    8. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      "Tape backups are pretty damned reliable these days, and very fast. You have to take care to pay attention to tape lifespan and any drive problems that might arise from time to time."

      What's wrong with this picture? Do I need to say more? Well, I will anyway.

      The two items you mention have been true since tape drives were introduced. And they are exactly why tape is not reliable. Granted, today's tape devices are much more sophisticated than the old nine-track devices. But the fundamental problem of tape - iron oxide being pulled past a reader - hasn't changed. Tape kinks, tape gets erased, tape gets bent, drives get out of spec, the list goes on and on.

      Drop a hard drive? Irrelevant. Drop a tape? Irrelevant. Acidents happen and aren't relevant to the discussion.

      And yes, you can take hard drives offsite. In fact, that's a very smart move. And one hard drive with 200GB will fit in a safe deposit box (we're talking small business here, not General Motors, for this concept) more easily than a bunch of tapes (granted some of the tape cartridges are damn small these days.)

      Comparing the probability of bad media to lightning and nuclear weapons is just stupid. Bad media is very common - the latter aren't. Ignoring the point of backup being recovery is also not a very sophisticated debating technigue - not to mention ignoring the point of enterprise IT backup which is precisely recovery (and archiving for legal reasons.)

      "There's even an LTO6 on the horizon that will be able to sustain 540MB/sec with 2:1 compression. Do you know how many SATA hard disk spindles you'd need to even come close to these figures?"

      And do you know, Mr. Enterprise Expert, how fast ISCSI, Ultra640, and Fibre Channel SANs go? Try 200-640MBps.

      Don't waste my time proving how little you know about enterprise backup. And I damn sure don't want you backing up MY enterprise with that kind of attitude about backup: save time, money and media - and lose the data.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    9. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      I'm aware that current backup systems try to make compression and archiving more reliable.

      Now ask yourself: WHY do they do this? Answer: Because they KNOW it's a problem and they are trying to provide a solution.

      Whether they are successful or not depends on whether you can afford their particular solution. For enterprise users, this may be fine. For smaller companies, relying on devices that do not provide this sort of safety net is not a good idea. For critical data, relying on the company to be RIGHT about whether their solution works is also risky. Why take the chance?

      "a little corruption doesn't mean the entire archive is hosed. It does depend on the archiver to a degree, of course. And if you must do this before backing up to tape, you can include parity files to mitigate risk of corruption."

      Right - show me the archiver which does NOT use some sort of parity system that can recover from corruption. I've NEVER had a damaged zip file be recovered by an archiver. They always fail. Your mileage may vary. I can believe that some zip archives can be recovered. My experience is not that promising.

      And parity files are indeed a good idea. But they reduce the value of compression significantly if you set the parity high enough to insure recovery from even multiple errors. They also can make backup and recovery a more complicated process - that may be acceptable for the enhanced reliability of recovery, but again it adds to the cost of archiving and compression.

      A double backup without compression is easier (at least for those situations where the required backup time can be met by a double backup.) You need to do two backups anyway - one for onsite, one for offsite - so why not take advantage of that need to enhance your reliability?

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    10. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      No, you tell ME, Mr. Drive Speed Expert, of one *single* drive that can perform at anywhere near what tape systems are capable of these days? You mentioned that a SAN could perform well, in a RAID CONFIGURATION. And only maybe. You mentioned pulling a single drive and moving it off site. How do you propose doing this in an enterprise enviornment?

      You mention small business for a hard drive going off site - how much have I talked about small business? None!

      "Drop a hard drive? Irrelevant. Drop a tape? Irrelevant. Acidents happen and aren't relevant to the discussion."

      Suddenly, it's irrelevant. Then so is compression! Right? What about the weight? Do *YOU* want to carry 20 or 40 hard drives around?

      "And I damn sure don't want you backing up MY enterprise with that kind of attitude about backup: save time, money and media - and lose the data."

      Ohh, snap. Ouch. That hurts. Here's a NEWSFLASH: time, money, and media are hugely important to a company, no? So why not get all the benefits of a modern tape backup system? Because you're a fool. Somehow, running compression on a backup kills backups. I prefer a sound backup methodology over irrational fear.

      "Comparing the probability of bad media to lightning and nuclear weapons is just stupid."

      Yes, and I said it was if you actually bothered to fucking read it. Your argument against compression is equally baseless and bullshit.

      You ignore points that you agree with, and you harp on the little things that you ignorantly don't. Over the past posts on this topic I've discussed the basics of any good backup practice. You can't see it because you haven't worked in the enterprise. Hey, fuck it. I'm arguing with dumb and dumber here.

      Points you've made:

      - Take entire SANS off site for safe keeping.
      - Don't use compression, it's evil. No better reason.
      - Tapes are the devil. Use USB drives instead.
      - Hard drives never go bad.
      - Forensic bit-recovery is better then a sound backup and recovery practice.

      Have fun with that.

      --
      - It's not the Macs I hate. It's Digg users. -
    11. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      Enterprises send backups offsite all the time. You can just as easily pull a bunch of removable drives and send them offsite as you can a bunch of tape. That's just a matter of organization. This is irrelevant, however, because in most cases you backup to SANs for LOCAL restoration, then backup to tape from the SAN for offsite archival storage (or last-ditch restoration if your building blows up.) (That, or you send the data over a commo link to another site for storage as tape or disk.)

      Neither of which is relevant to the archive/compression discussion. The issue is should the backup be archived and compressed rather than stored as individual (compressed or not) files.

      As for single drives running at tape speeds, what did I say? The new SCSI standards do just that.

      Also, exactly how expensive are these wonderful tape drives you refer to? Compare the cost of them to the cost of even an expensive SCSI drive. You want to save money by replacing an eight hundred dollar 300GB SCSI enterprise class hard drive with a FOUR THOUSAND DOLLAR 400GB (WITH COMPRESSION) Quantum LTO-3 tape drive? Good luck with that!

      And that doesn't count the cost of the media at $100 a pop! I can buy an entire (ATA) drive for that!

      Also, why do you think Quantum is making hard disk backup systems now? Because the market is demanding it, that's why.

      Once again, you compare accidents to compression. Accidents are just that - accidents. Archiving/compressing is deliberately INVITING data loss. That's a difference.

      I said nothing about taking SANs offsite. I'm talking about backing up to SANs for local restoration and in specific response to your ignorant statement that tape is faster than hard drives - which is bullshit. And if tape IS faster than hard drives, then why the hell is time to backup being better with compression your primary argument? If tape is so fast, then you don't need compression, so why not gain the extra reliability?

      "Hard drives never go bad." I never said that. However, hard drives ARE far more reliable than tape - especially if they are only being used for backup. While modern tape systems are far better than the old nine-track - as I SAID - they are still far from completely reliable.

      "Forensic bit-recovery is better then a sound backup and recovery practice."

      Never said that for an instant. You might try rereading my posts. What I said was that sound backup and recovery practice precludes using archiving and compression together for critical data - and if you do it, you'd better have redundancy built in which means PAR files which means increased complexity and cost for the dubious benefits of archiving and compression.

      I also said that the value of data to be restored outweighs the minor cost savings of media and time achievable with archiving/compression - points you have done nothing to dispell and everything to ignore.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    12. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      "You want to save money by replacing an eight hundred dollar 300GB SCSI enterprise class hard drive with a FOUR THOUSAND DOLLAR 400GB (WITH COMPRESSION) Quantum LTO-3 tape drive?"

      First of all, LTO-3 is 400GB Native/800GB with 2:1. They can do 80MB/sec sustained for uncompressed streams.

      Second of all, you can get new LTO-3 tapes for less then $100 each. They're still new, so they're still a bit expensive. Of course, that compares very favorably to a used, low-end 300GB, OEM SCA SCSI disk at $500. Chances are you'll need to buy an approved disk with special housing or risk voiding your warranty - which easily doubles the price, or more. And you can't seriously buy used disks for production backups. And - I'd love to see a single hard disk stream 80GB/sec for the entire 300GB, especially at bargain basement prices. SATA drives will yield less performance, albeit for less money. But again, you'll be hard pressed to insert a SATA drive you get at CompUSA into most storage cabinets.

      It's not the cost of the drive, but rather how much you can back up with them when they're utilized in libraries and longer term storage, the ease of swapping media, and the feasibility of off-site storage. You can't swap out hard drive spindles from a hard drive. And what? You get a free drive shelf with every 10 disks you buy or something? Those things are expensive, even for a lower cost, older tech SCSI enclosure.

      Let's do a little simple math here:

      Drive enclosure: $5000 (PV220S) per 14 slots, filled with 14 100GB drives @$800: $16,200
      Total space: 4200MB (uncompressed)

      LTO-3 10 slot library with one drive: Adic 10-slot library, $7000. 10 tapes: $900. Total: $7900.
      Total space: 4000MB (uncompressed)

      You could buy 90 more tapes for the extra cost of the disk array, which is another 36TB (uncompressed.)

      Disk backups are not cheaper then tapes, so don't even try to make that argument.

      "Also, why do you think Quantum is making hard disk backup systems now? Because the market is demanding it, that's why."

      You've already said that. And they don't replace tape backup systems, they compliment them. They don't make sense for long-term storage. Unless, such as in your case perhaps, you're backing up a single SBS server or something. For very small shops, you might get away with a Maxtor "single button" backup USB drive. But as you say, if you need full redundancy, you'll have to buy a few of them and the costs add up fast. You'd be better off with a lower end tape drive.

      "hard drives ARE far more reliable than tape - especially if they are only being used for backup"

      Are you sure about that? Do you know of a SCSI enclosure that spins down disks when they're not being actively used? Otherwise, it doesn't really matter. Hard drive failures are about equal in idle (running) drives versus actively used ones.

      I agree that hard drives are less prone to failure then tapes as a general rule, but the risk is great enough in both mediums that the difference doesn't even matter. Usually, disk backup systems are in a raid configuration - which of course makes it all but stupid to try and take raid sets off site on a regular basis for off-site storage. Plus, they're more fragile in transit, heavier, and don't hold as much as tapes.

      "What I said was that sound backup and recovery practice precludes using archiving and compression together for critical data - and if you do it, you'd better have redundancy"

      Wait - so if you have critical data, but you don't use compression, you don't need redundancy? I know that's not what you mean, but I feel as though the risks are equal regardless of whether or not you compress; meaning, you need to take the same precautions either way.

      "which means PAR files which means increased complexity and cost for the dubious benefits of archiving"

      I thought we got past this. The conversation evolved into tape backup/disk backup systems. Neither of which would ZIP your files in

      --
      - It's not the Macs I hate. It's Digg users. -
    13. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      You can't compare media to disk just on capacity. When I say $100 per media is bad compared to disk, that means the media is worthless without the drive. The $100 disk IS the drive. You have to ADD the cost of media to the cost of the drive. How many of those 400GB tape cartridges does somebody need? Add that to the cost of the drives.

      14 100GB drives at $800 apiece - nice, ignore the 300GB drives out now. Cut the cost by a factor of three, doesn't it? That's simple math, too.

      I never said disk made sense for long-term storage. I said you CAN take it offsite for offsite storage. I didn't say how long. Disk drives have stiction problems when not used for a long time; tape has tensioning problems and heat and humidity problems if not stored properly. Long-term storage is another issue entirely than storage for restoration purposes. Again, I'm not talking about archival storage, as I said repeatedly; tape is adequate for that and that's all it should be used for in my opinion - and in the opinion of a lot of other people these days who advise on enterprise IT.

      When I said redundancy, I refer to either PAR files or multiple backups of the same data. If the data is critical, you'd better have one or the other - and no matter whether you use disk or tape. That's the original point of discussion. If you use archiving and compression together, you damn sure better have redundancy if the data is critical, i.e., data that absolutely has to be there if it needs to be restored.

      Yes, I really do mean backing up individual files (compressed or not depends on the criticality of the individual files) to disk without archiving and compression. You have to read the freakin' files to compress them, it's far faster and more reliable to just back up the entire file. Yes, it eats up space - which is only important to a home user like me who has to pay for DVDs. For corporate data, the media cost is irrelevant to the value of the data (again, distinguishing between critical data and purely archival data.)

      Your entire argument is about cost, about time savings, about this and that - and has nothing to do with the core issue - making sure that data can be recovered when it is needed.

      Here's the bottom line I found on the Net which sums it all up pretty well:

      "Diogenes Analytical Laboratories, an IT advisory company that performs independent product lab evaluations and advises IT buyers, estimated that, on average, between 5 and 20 percent of nightly tape-based backup/recovery jobs fail. According to ESG, roughly one-fourth of SMB users reported that 20 percent or more of their tape-based backup/recovery jobs fail. The number one reason cited: media failure (e.g., lost, damaged or corrupted tapes).

      In addition to media failure, there is the high percentage of operational errors, including operator (human) errors such as storing the wrong tape, and procedural errors such as backing up wrong or empty files. Any time there is human intervention, the opportunity for error increases dramatically.

      Analysts generally agree that IT management costs represent five to seven times the cost of capital expenditure. It is therefore essential to consider operation and administrative costs (including media management and tape swapping) in addition to acquisition costs when doing a financial analysis of tape versus alternatives. For large organizations, the tape media costs (which generally come out of the expense budget versus capital budget) mount up as well...

      For a simple cost comparison, W. Curtis Preston offers this scenario: a midrange tape library costs roughly $4 to $11 per gigabyte (GB) while disk prices are hovering around $3 to $11 per GB. This puts disk and tape about even (excluding reliability and recovery time factors).

      Capacity Optimization

      Perhaps the most significant change in cost comparison comes from the introduction of a new technology which some have named capacity optimization (CO). Also called data reduction, disk deduplication, or commonality factoring,

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    14. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      Sorry, but I need to point something out:

      "14 100GB drives at $800 apiece - nice, ignore the 300GB drives out now. Cut the cost by a factor of three, doesn't it? That's simple math, too."

      Obviously it was a typo - if you read the post instead of trying to think of what to say next, you'd have seen that the total was 4200MB, which meant I quoted the 300GB drives you said were about $800 each. Jackass.

      I thought about posting a small "I meant 300GB" but I didn't think you were that dim.

      Hey if you look in the paragraph that you quoted, from that very satisfied user, you'd see the comment "we tape out during the day."

      Meaning, they STILL USE TAPES. Like I said, backup to disk is a great thing that compliments backup to tape. I believe I did mention that I do backup to disk in my data center. It's not appropriate in every environment, in every situation, however. Disks are very expensive to run 24 hours a day, every day, and they have a high cost per MB over tapes. And realistically, you can't take them off site.

      The only way to elimite tapes is to transfer your data out once it's been backed up via some other method, such as replication. This is NOT feasable in many environments. First of all, you need an off-site. Second, you need enough bandwidth to that site to handle the load, which will likely be extremely expensive. Third, it might not fit into your companies IT systems. It's potentially great for large companies with enough capital and huge IT budgets. I'm willing to bet that the survey you quoted was of medium-large companies.

      Perhaps they ship less tapes out to Iron Mountain then they used to, which is a benefit. If they can afford a big enough disk system to hold data online for three months, good for them. It makes data retrieval easier, definitely. But it's expensive spinning all those disks 24 hours a day, and if they have a large volume of data, it's extremely expensive. And it doesn't replace tapes - they still need to ship tapes off site in the event that there's a flood or something that destroys the array or building. And it's NOT a reliability thing, it's an ease of use thing.

      Notice that they aren't shipping hard disks off-site? And where does it say they don't use compression? I'll bet a week's salary that they DO do compression to those disks. And that the backup sets sit in large files on that array, much like a big archive that you're so afraid of.

      The whole quote simply rehashes what I said, not what you've been saying.

      You speak of cost like it doesn't matter. There's a difference between spending your money on a solid solution and spending far too much money with no benefit. Shit, it would be geat if we could have fifty redundant data centers with geospan clusters. I mean, cost is moot, right? What's a few million?

      At least I didn't cut'n'paste some random marketing materials to hold my argument. You did that for me.

      Good day.

      --
      - It's not the Macs I hate. It's Digg users. -
    15. Re:No "explanation". Just experience. by Master+of+Transhuman · · Score: 1

      So now that you're hit with the facts of the industry move to disk-to-disk, you're upset.

      I didn't bother to add up GB from your post because I couldn't care less - and you were the one in such a hurry to drop my argument you mistyped your figures.

      "they still use tapes." - Never said tapes weren't useful as archival. I also said that while offsite backup via network was useful, it was costly and suitable only for those who can afford it relative to other means of offsite transport. So you're being redundant again.

      As for compression, they probably use Content Optimization, which was another focus of the article. While this is a form of compression, it's a more reliable form (although I'm sure they use regular compression as well - that wasn't my point in this case - the point was disk-to-disk is replacing tape for the purpose I cited - local emergency restoration. And it is.)

      Expensive spinning disks? Jesus, you are cheap, aren't you? Complaining about the power cost to spin drives - many of which probably aren't spinning at all until they're backed up to. Compared to the value of the data and the obvious value of quick efficient restoration in the event of primary system failure, power costs are not relevant to these people, obviously. And disks spin whether they're used for backup or not. They outlast tape drives in longevity, too. I don't know how many times I've heard people say that, at least the cheaper tape drives, if not the really expensive ones, barely last a year.

      The quote does not rehash anything you said, it clearly explains my points about disk-to-disk. The only thing it says about archiving and compression, however, is that TAPES FAIL - which is why I say don't use archiving and compression together.

      Last night I received a confirmation of my policy: I was changing my partition structure on my home system, and had to wipe and restore a partition from a backup made yesterday as well. That went okay, even though it was some 11 DVDs with 250,000 files. I then decided to restore some data I'd taken off to make room last year, about 3GB worth that would be hard to replace should it be lost. That data was backed up on the old LiteOn DVD drive that was flaky.

      Restoring it on the newer NEC drive, sure enough, a bad sector read caused one of the two DVDs to fail to restore one or more files. Since I'd had the foresight to back up the data twice, I used the second DVD to restore the missing files. And this was even with individual uncompressed unarchived files. Had I backed up with archiving and compression (and no PARs or other redundancy), I would have lost 3GB of data.

      That says it all. There isn't anything more to be said.

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
    16. Re:No "explanation". Just experience. by cbreaker · · Score: 1

      Yes that does say it all. Don't back up to DVD's using an old LiteOn DVD.

      Disks do cost a lot of money to spin. Unless you're talking USB backup drives, which I'm not, spinning fifty 10K RPM disks in SCSI enclosures isn't cheap. Disks can fail quite often when you have a bunch of them, and yes, power isn't free. The several thousand dollars a year it could cost you on power should be onto the list of expenses here. It's not something that would tip the scale usually, but it's cost.

      Backup to Disk is a great thing that's only been recently a big deal because the costs have come down enough to make it feasable. It's still expensive, but it's not as unbearable as it has been in the past. All along I've said that I agree that backup to disk is a great thing. I do believe we'll see more and more of it. It can and does replace tape systems as a complete solution - but you still need tapes to store more then a few weeks or a couple months of backups, unless you were able to procure a huge amount of backup disks. And you still need tapes to take off site. My stance on this hasn't changed at all on these points.

      The thing is, you just don't "get it." So I can't really continue a discussion with someone that absolutely will not see my point of view at all. On my end, I understand your concerns about compression, and "Archiving" as you call it - (what do you mean by that anyways? Wait, I really don't care) but I feel as though there's not going to be any difference in the way you perform your backup whether you compress or not. You either do it the right way to ensure that any corruption won't kill you, compressed or not, or you don't. It doesn't matter what you're backing up - if you get corruption it's a junk backup. To disk, tape, whatever. It can and will occur and you shouldn't be more relaxed with your backups just because you believe you should be able to reconstruct corrupted uncompressed files.

      You put up these hypothetical situations where you didn't perform the right steps to ensure your backup was good, and somehow not using compression saved you. These situations don't matter in (any) business. If you'd have tested your backup, you wouldn't have been in the position of having to muddle through a corrupt restore. If you'd have had two copies of your backup (disk and tape, or tape and tape, or whatever) you'd have not had to deal with it. If you had a previous backup from the night before, you'd have not had to deal with it. It's not about saving money, it's about doing the right thing. If you have a good backup plan, it doesn't even matter if compression caused the entire backup to be no good (although I don't agree that it would.) Don't you see?

      Maybe it's a difference in IT experience. When I talk about cost savings, I'm talking for a medium to large enterprise system. The difference between two cost factors could be $350,000. In your place, if you're dealing with small data sets, the ratio of cost might be the same but the values wouldn't be. $400 is only $400 no matter the size of the company.

      Oh well, thanks for the discussions. It's been fun. Happy new year.

      --
      - It's not the Macs I hate. It's Digg users. -
  59. Pile of ad-laden shit article by hazem · · Score: 2, Funny

    I know not many of you actual RTFA, but that article was so damned annoying. There's a table in there - think it's to compare compression schemes? nope - it's for processors. There are red links.. article related? Nope - ad links. Blue underlined links - yup, for more ads.

    What a steaming pile of shit. Happy new year.

  60. The analysis is kind of silly by speedplane · · Score: 2, Informative

    Lossless data compression is a pretty well studied subject. Shannon started it back in the 40's and plenty of research has gone into it since.

    There are basically three ways to do lossless compression: Huffman, Arithmetic, and LZW. Technically Huffman can achieve the best of three, however its generally the worst because of implmentation issues (it would take a lot of processing to do rigourous Huffman encoding).

    Arithmetic coding is generally better but is difficult to implment. I think IBM is the company who actually sells an arithmetic coder (I could be wrong though).

    LZW is by far the best of the three (you can read online how it works), but alas it is patented and anyone who gives away free copies of it will get sued.

    I know for a fact that gzip uses Huffman, which would explain its lackluster performance. I haven't researched it further, but I would not be suprised if the three proprietary compression programs which "won" this review use LZW. I also wouldn't be suprised if they pay a good amount to LZW's patent holders (Unisys I think).

    I'd be interested to see how gzip performs on its "maximum compression" setting. Like I said earlier, Huffman can can achieve the theortical limit on compression where LZW cannot.

    --
    Fast Federal Court and I.T.C. updates
    1. Re:The analysis is kind of silly by yeremein · · Score: 1

      The LZW patent expired in 2003 (in the US) or 2004 (in Europe and Japan). In any event, LZW is just an optimization of LZ, which was never patented and is widely used (for example, by NTFS).

    2. Re:The analysis is kind of silly by speedplane · · Score: 1

      LZW is just an optimization of LZ
      No it isn't. LZ has a serious flaw in it which occurs pretty frequently. The 'W' was a guy working for Sperry (they made submarines) who found the flaw and fixed it. It was Sperry which originally had the patent on the 'W' part which made it proprietary.
      The next part gets a bit fuzzy.... I think Sperry sold it to another company which eventually made its way to Unisys. A previous commenter said that the patents expired, which would be great. But I think there may be more legal trouble ahead (otherwise why wouldn't gzip use it?).

      --
      Fast Federal Court and I.T.C. updates
  61. Are you serious? by Noose+For+A+Neck · · Score: 1

    Using gzip to back up terabytes of files sounds like a very dumb idea, since gzip has no error recovery mechanisms.

    --

    Software piracy is victimless theft.

  62. Re:Speaking of Comparisons by chronicon · · Score: 2, Interesting
    Speaking of Comparisons (Score:-1, Redundant)

    I knew I had seen this story before but it wasn't here. This article was up on Digg three days ago--with only three Diggs to it's name (at the time of this writing), but it's front page news here? Interesting to say the least...

    I predict that this Digg will become frontpage Slashdot news shortly. It was quite popular (914 diggs so far) and it's hit the three-day mark...

    I know, this is all so OT, but it's no worse then whining about duplicate postings here...

    Oh the irony here is just too much to take without laughing! My comment gets hammered with the REDUNDANT pummel when I point out that /. is being REDUNDANT in posting old Diggs? Man, it just doesn't get any better then this to make a point.

    Moderators: did you catch the not-so-subtle play I made here by quoting ALL of my original message? In case you didn't, I'm beinging REDUNDANTLY sarcastic...

    Enjoy!

  63. Patents? by thogard · · Score: 1

    There are plenty of good compression algorithms out there but most of them are covered by patents. There have been many cases where a small company comes out with some cool new way of compressing stuff and then later being told to pay royalties. It can be a real pain trying to decompress data in a few years when the company that made the decompressor is no longer in business.

  64. Heh. Wonder if... by jd · · Score: 1
    ...that's because something in his test suite breaks with SP2. Be interesting to know! :)


    Seriously, there are many freebie compression tools which weren't mentioned but which are in common enough use that they can be regarded as highly significant in the market, or which are simply SO good that they are likely to become significant. Zzip and SZip are big ones that didn't get mentioned.


    Further, since speed is considered, it is unfair to list bzip2 without mentioning pbzip2 or bzip2smp (two parallel versions), as you'd obviously get a speed boost from non-sequential compression. Not sure what it does to the compression ratio, though.


    Finally, some forms of compression - notably Huffmann Compression - rely on the size of the compression table to determine how well you'll compress data. On a modern computer, where multiple gigs of RAM is no longer unusual, you could reasonably look at frequencies for 24-bit strings.


    Most Huffmann compression will use 8-bit frequency tables, a few will use 16-bit, because the memory requirements get big. Fast. Not only do you need to record the total frequency of the wordsize you're using, you also need to have space to build the encoding tree. Even then, the level of compression you'll get will only improve to a point. After that, longer words will produce worse compression or even inflation.


    In the examples used - audio and video data - you will most likely want a 16-bit word for the audio and a 24-bit word for the video, because that reflects the nature of the data. 8-bit words on the frequency tables are going to be crap, because you're compressing random fragments of words, so artificially worsening the encoding tree you're going to build.


    I have two points here. First is that by picking the right (or wrong!) parameters for the data, you can always rig a benchmark. My second point is that you can often tailor an algorithm that would normally be worse than some other algorithm such that the worse of the two will outperform the default behavior of the better one.


    Ideally, you'd use a form of arithmetic encoding, but that is so riddled with patents that although you could (in theory) develop a system which was numerically identical but did not infringe on the wording of the patents, most Open Source and low-cost vendors don't bother trying.


    The secret, in compression, is not to use default algorithms if you can avoid it. (Ideally, this would be when the compression header stores enough information for you to tweak table sizes, etc, so that off-the-shelf decoders will work with your custom encoders.)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  65. Wrong, wrong, wrong, wrong, wrong. by hereticmessiah · · Score: 3, Informative

    Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding. Range encoding and arithmetic coding are both variants of the same basic technique of entropy encoding. That said, the compression difference between huffman coding and arithmetic coding is minimal. I think (though I'm not entirely sure), entropy encoding might be a subset of a larger family of algorithms called markov modelling.

    LZW is a refinement on LZ78, which has other variants such as LZSS. It is a dictionary coding algorithm. Similarly, the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE, as does xZip and PNG. DEFLATE first compresses the stream with an LZ77 variant, and then compresses the resulting stream with huffman coding to squeeze out some redundancy. LZW is no longer covered by patents, at least not here in Europe.

    So what you wrote about huffman coding, arithmetic coding and LZW was largely misinformed. There are two lossless methods: entropy encoding and dictionary coding, huffman coding and arithmetic coding representing the former and LZW representing the latter. Some compression algorithms combine the two, DEFLATE being an example.

    --
    I don't like trolls and mod against me if you like, but I'd prefer if you'd reply.
    1. Re:Wrong, wrong, wrong, wrong, wrong. by speedplane · · Score: 1

      entropy encoding might be a subset of a larger family of algorithms called markov modelling
      Markov modeling is used in linear prediction which is normally (although not always) a lossy form of compression. It is typically used in speech.

      There are two lossless methods: entropy encoding and dictionary coding
      Dictionary coding is a type of entropy coding. Arithmetic coding is another type of entropy coding, and so is Huffman. LZW is was the first algorithm to use a dictionary approach and it is commonly synonymous with the approach.

      the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE,
      I would be suprised to here that gzip uses both dictionary and Huffman coding. A proper dictionary based algorithm normally outperforms gzip by about 10 to 30%.

      --
      Fast Federal Court and I.T.C. updates
    2. Re:Wrong, wrong, wrong, wrong, wrong. by hereticmessiah · · Score: 1

      Markov chains are a Markov model, and I know of at least one lossless compression algorithm that uses them (LZMA).

      And for an explaination of how DEFLATE uses both LZ77 and Huffman coding, read this: http://www.zlib.org/feldspar.html, or you can read the RFC or the zlib source.

      --
      I don't like trolls and mod against me if you like, but I'd prefer if you'd reply.
  66. Coral Cache by Skal+Tura · · Score: 1

    Here's coral cache link: http://www.rojakpot.com.nyud.net:8090/showarticle. aspx?artno=4&pgno=0 for those who do not bother to edit the URL manually :)

  67. $3 a megabyte? by Mr2001 · · Score: 1

    If you're paying $3 per megabyte for cellular data, you're getting screwed.

    Mine is billed by the minute and comes out to between $0.17 and $0.54 per megabyte (assuming 100 kbps on average), depending on whether I'm using my plan minutes or overage minutes. And that's just during the day - it's free between 9:00 PM and 6:00 AM every day, all day Saturday and Sunday, and on holidays.

    --
    Visual IRC: Fast. Powerful. Free.
  68. Tapes aren't so attractive nowadays by TheLink · · Score: 1

    On current x86 hardware I get on average ~30MB/sec with lzop and ~50% compression when imaging HDD images[1].

    USD100 for LTO3? Sure looks like tapes are pretty expensive. I'd use tapes for legacy backups or where physical shock is an issue, or when you have tons of tapes and need automated loaders.

    But removable hard drives seem a more attractive option for most cases nowadays (small to medium businesses). Large corporations can probably afford to be locked in to a particular tape technology, for the convenience of automated tape libraries.

    LTO3=400GB storage at 10MB/sec (native) @ about USD100 per tape and USD2K for the cheapest drive.

    SATA= 250GB storage at 40MB/sec (native+ average sequential transfer, 60MB peak) @ ~USD100 per drive.
    SATA hotswap cage = USD100-USD200.

    PATA+USB= 250GB storage at 20MB/sec (native+ average sequential transfer) @~USD100 per drive. PATA to USB enclosure USD30 to USD50 (for a decent one).

    Plus PATA/SATA is less of a "locked-in" technology compared to LTO3 or other tape technologies.

    With tape drives you have to deal with two main standards that could become obsolete. First = the tape standard (e.g. LTO3, DLT, DDS etc) , second = the tape drive interface standard (e.g. SCSI).

    If you got an expensive LTO2 drive in 2003 you are stuck with 200GB native capacity media. Same goes for DLT, DDS etc. You'd have to pay for an expensive LTO3 drive, and then when LTO4 comes out, you're still stuck with LTO3 capacity unless you pay for a probably expensive LTO4 drive.

    In contrast with hard drives you just deal with the drive interface standard (e.g. SCSI, SATA, PATA). With HDDs each "tape" comes with its own drive ;).

    If 800GB SATA drives become cheaply available, you can start using them with your existing backup systems.

    In desperate situations you are more likely to be able to find servers/PCs where you can plug the "media" to and start restoring stuff. Whereas with tape you need a mucho expensive tape drive for each backup/restore point.

    A decent backup/restore server with a decent drive cage can hold multiple drives and you can backup multiple machines to different drives, and on decent hardware you can get 40MB/sec for each backup (multiple gigabit interfaces, SMP/multi-core CPUs). If you have a server with a 4 drive cage and 2 x 1 gigabit NICs, you can easily get 4 x 40MB/sec backup/restore streams going from/to different targets.

    For tape you'd need four expensive tape drives to do that.

    You get sub-second random access. No need to wait 10 seconds to seek.

    Last but not least: if I have to restore backups with The Boss/Customer breathing down my neck, I'd pick 40MB/sec over 10MB/sec. Perspective: 11 hours to read 400GB from an LTO3 tape vs 2.8 hours to sequentially read 400GB from SATA drives.

    [1] first 131MB of a disk image
    time dd if=drive.img bs=131072 count=1000 | lzop -c |wc -c
    1000+0 records in 1000+0 records out 66784879

    real 0m3.307s user 0m2.442s sys 0m0.842s

    39MB/sec 1.96:1 compression

    For first 131MB of linux kernel tar uncompressed ball (cached in RAM):

    time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | lzop -c |wc -c
    1000+0 records in 1000+0 records out 46473494

    real 0m2.483s user 0m1.660s sys 0m0.821s

    52MB/sec 2.82:1 compression

    time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | gzip -c --fast |wc -c
    1000+0 records in 1000+0 records out 36786087

    real 0m5.965s user 0m5.297s sys 0m0.659s

    22MB/sec 3.56:1 compression

    time dd if=linux-2.6.14.4.tar bs=131072 count=1000 | gzip -c |wc -c
    1000+0 records in 1000+0 records out 29724624

    real 0m11.283s user 0m10.640s sys 0m0.615s

    11.6MB/sec 4.41:1 compression

    First 131MB of wave file (cached in RAM)

    time dd if=somewave.wav bs=131072 count=1000 | lzop -c |wc -c
    1000+0 records in 1000+0 records out 128273334

    real 0m5.520s user 0m4.597s sys

    --
  69. An even more thorough comparison site by glyph42 · · Score: 1

    Jeff Gilchrist's Archive Comparison Test has been around for years, and covers many more archivers and uses several different data sets, on several different platforms. It has even been cited in compression literature:

    http://www.compression.ca/

    --
    Music speeds up when you yawn, but does not change pitch.
  70. 85 benchmarks by Morse in Sep'05 LinuxJournal by Anonymous Coward · · Score: 0
    Sept'05 LJ #137 Compression Tools Compared by Kingsley G. Morse Jr
    Choosing a compression utility is a delicate trade-off between CPU time and compression achieved. Get a perfect match for your available processing time and bandwidth.

    Use top-performing but little-known lossless data compression tools to increase your storage and bandwidth by up to 400%

    ... benchmarked 87 combinations of tools and levels

    PS, Comparison_of_file_archivers externals links