Slashdot Mirror


Best Format for Archive Distribution?

Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."

18 of 109 comments (clear)

  1. Look into specific compress utilities, not generic by gus+goose · · Score: 2, Informative

    I have found that some formats are far better at some data types than others.

    e.g. for:
    text,ascii,documents: use any of bzip2, gzip, zip.
    audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
    video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.

    bzip encodes/decodes slower, but has typically better compression ratios.

    So, use whatever people commonly use for the data type you are compressing.

    gus

    --
    .. if only.
  2. Re:Zip by EvilIdler · · Score: 2, Informative

    Zip and gzip use the same compression.

    Zip compresses each file in an archive individually.

    Tar+gzip compresses the entire contents as a whole - meaning better
    compression than zip archives (unless you add uncompressed files to
    an archive, THEN compress the entire archive..)

    WinZip supports tar+gzip archives, from what I remember, but WinRAR
    supports .gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
    on Windows?

    Then again, you could use solid RAR archives. Generally the best
    size+performance ratio I've tried of these (all compressed as a whole,
    some error recovery).

  3. My summary.. by Chris_Jefferson · · Score: 3, Informative

    I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

    Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

    It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

    If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.

    --
    Combination - fun iPhone puzzling
    1. Re:My summary.. by Chris_Jefferson · · Score: 2, Informative

      damn, damn, damn, damn.. I meant of course that zip is the worst, 7zip the best... which I could edit comments!

      --
      Combination - fun iPhone puzzling
  4. Re:CPIO by Meostro · · Score: 2, Informative

    Probably a good idea in general, but:
    1. No obvious support on Windows for cpio
    2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.

    I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".

  5. Re:It really depends... by OAB_X · · Score: 2, Informative

    Winrar can open tar.gz files

  6. Re:One other choice by MindStalker · · Score: 4, Informative

    Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
    Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.

    Luckly gentoo knows it so you can simply emerge unace.

  7. Re:One other choice by Meostro · · Score: 2, Informative

    I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...

  8. RAR's license is garbage by Noksagt · · Score: 2, Informative

    While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.

    md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.

    RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.

  9. Technical format comparison chart by uler · · Score: 2, Informative
    I've got a rather technical format comparison chart started up [1]. It's still a draft, but it's pretty complete.

    It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.

    [1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)

  10. don't use rar, arj, 7zip, etc by mqx · · Score: 3, Informative


    These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

    Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

    Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

  11. Re:One other choice by Deagol · · Score: 3, Informative
    Isn't PKZip pushing the 20 year mark? And I think that Unix tar'ed and/or compress(1)'ed files are well over 20 years old.

    Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.

    Also, see graphics formats.

  12. Re:Zip by JimDabell · · Score: 2, Informative

    Zip and gzip use the same compression.

    According to the ZIP file format specification, ZIP can use a dynamic LZW algorithm.

    The whole reason gzip exists is because the standard UNIX compress uses LZW - which, until recently, was protected by a patent (that was the problem with GIFs).

    Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.

    So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!

  13. My Preference by vbrtrmn · · Score: 3, Informative

    For my own personal archives, I have taken the methods from the masters in USENET.

    OS X & UNIX: I'm lazy just: tar.gz

    For Win32, I back-up a lot more files under win32 than *nix.

    Compression
    WinRAR
    Compression Method: Best
    Split to Volumes: 20MB
    Parity
    QuickPar
    With general settings.

    I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.

    --
    it's a sig, wtf?
  14. not what you asked, but... by leehwtsohg · · Score: 2, Informative

    gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems.
    Never backup using tar.gz - use tar.bz2 instead.
    Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your .tar.gz backup was damaged, just when you need it)

  15. uharc by biryokumaru · · Score: 2, Informative

    no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.

    --
    When you're afraid to download music illegally in your own home, then the terrorists have won!
  16. 7-zip: No Drag-n-drop by denis-The-menace · · Score: 2, Informative

    7-Zip is a one-man project that needs help to add features
    or at least manage its SourceForge support and RFE forums.

    Drag-n-drop has been requested for almost 2 years and now,
    some of its users are defecting to TUGZIP because of it.
    http://sourceforge.net/tracker/index.php?func=deta il&aid=663095&group_id=14481&atid=364481

    Either the guy is too busy, doesn't care or just doesn't want to share control.

    Maybe it's time to fork 7-zip?

    --
    Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
  17. Re:stuffit by Hes+Nikke · · Score: 2, Informative

    While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/, which reduces JPEG file sizes losslessly. that can help things too :D

    *Battens down the hatches for an incoming barrage of slashdot traffic*

    --
    Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.