Slashdot Mirror


Best Format for Archive Distribution?

Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."

12 of 109 comments (clear)

  1. One other choice by gowen · · Score: 4, Insightful

    tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.

    RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.

    --
    Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
    1. Re:One other choice by MindStalker · · Score: 4, Informative

      Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
      Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.

      Luckly gentoo knows it so you can simply emerge unace.

    2. Re:One other choice by harrkev · · Score: 5, Insightful

      One problem with this is that it is not a common format. For limited use (one-time distribution, short-term backup), this is OK. But what about long-term archives.

      If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?

      If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.

      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
    3. Re:One other choice by Deagol · · Score: 3, Informative
      Isn't PKZip pushing the 20 year mark? And I think that Unix tar'ed and/or compress(1)'ed files are well over 20 years old.

      Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.

      Also, see graphics formats.

  2. CPIO by DarkDust · · Score: 3, Interesting

    I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.

  3. Multi-format by sporktoast · · Score: 4, Insightful

    Have you considered going multi-format?

    Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.

    --
    In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
  4. My summary.. by Chris_Jefferson · · Score: 3, Informative

    I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

    Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

    It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

    If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.

    --
    Combination - fun iPhone puzzling
  5. Whatever you choose... by BinLadenMyHero · · Score: 3, Insightful

    ...avoid closed formats.
    Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.

  6. Re:RAR's license is garbage by gowen · · Score: 3, Interesting
    md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
    But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.

    You're right about the license, though.
    --
    Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
  7. don't use rar, arj, 7zip, etc by mqx · · Score: 3, Informative


    These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

    Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

    Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

  8. For Longevity by 4of12 · · Score: 4, Insightful

    Pick any system for which the source code is available, eg .tar.bz2

    Anything else is gambling.

    I still gamble, but only that a C compiler will exist in the future.

    --
    "Provided by the management for your protection."
  9. My Preference by vbrtrmn · · Score: 3, Informative

    For my own personal archives, I have taken the methods from the masters in USENET.

    OS X & UNIX: I'm lazy just: tar.gz

    For Win32, I back-up a lot more files under win32 than *nix.

    Compression
    WinRAR
    Compression Method: Best
    Split to Volumes: 20MB
    Parity
    QuickPar
    With general settings.

    I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.

    --
    it's a sig, wtf?