Slashdot Mirror


Best Format for Archive Distribution?

Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."

31 of 109 comments (clear)

  1. One other choice by gowen · · Score: 4, Insightful

    tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.

    RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.

    --
    Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
    1. Re:One other choice by MindStalker · · Score: 4, Informative

      Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
      Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.

      Luckly gentoo knows it so you can simply emerge unace.

    2. Re:One other choice by Meostro · · Score: 2, Informative

      I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...

    3. Re:One other choice by harrkev · · Score: 5, Insightful

      One problem with this is that it is not a common format. For limited use (one-time distribution, short-term backup), this is OK. But what about long-term archives.

      If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?

      If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.

      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
    4. Re:One other choice by Deagol · · Score: 3, Informative
      Isn't PKZip pushing the 20 year mark? And I think that Unix tar'ed and/or compress(1)'ed files are well over 20 years old.

      Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.

      Also, see graphics formats.

  2. Zip by isaac · · Score: 2, Insightful

    Zip.

    Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.

    Forget .tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.

    -Isaac

    --
    I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
    1. Re:Zip by EvilIdler · · Score: 2, Informative

      Zip and gzip use the same compression.

      Zip compresses each file in an archive individually.

      Tar+gzip compresses the entire contents as a whole - meaning better
      compression than zip archives (unless you add uncompressed files to
      an archive, THEN compress the entire archive..)

      WinZip supports tar+gzip archives, from what I remember, but WinRAR
      supports .gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
      on Windows?

      Then again, you could use solid RAR archives. Generally the best
      size+performance ratio I've tried of these (all compressed as a whole,
      some error recovery).

    2. Re:Zip by JimDabell · · Score: 2, Informative

      Zip and gzip use the same compression.

      According to the ZIP file format specification, ZIP can use a dynamic LZW algorithm.

      The whole reason gzip exists is because the standard UNIX compress uses LZW - which, until recently, was protected by a patent (that was the problem with GIFs).

      Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.

      So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!

  3. CPIO by DarkDust · · Score: 3, Interesting

    I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.

    1. Re:CPIO by Meostro · · Score: 2, Informative

      Probably a good idea in general, but:
      1. No obvious support on Windows for cpio
      2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.

      I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".

  4. It really depends... by node+3 · · Score: 2, Insightful

    Zip is probably the most commonly installed archiver across all systems.

    tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.

    After that, you're not really going to find an archival format that's really common.

    In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).

    1. Re:It really depends... by OAB_X · · Score: 2, Informative

      Winrar can open tar.gz files

  5. Multi-format by sporktoast · · Score: 4, Insightful

    Have you considered going multi-format?

    Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.

    --
    In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
  6. Look into specific compress utilities, not generic by gus+goose · · Score: 2, Informative

    I have found that some formats are far better at some data types than others.

    e.g. for:
    text,ascii,documents: use any of bzip2, gzip, zip.
    audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
    video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.

    bzip encodes/decodes slower, but has typically better compression ratios.

    So, use whatever people commonly use for the data type you are compressing.

    gus

    --
    .. if only.
  7. My summary.. by Chris_Jefferson · · Score: 3, Informative

    I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

    Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

    It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

    If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.

    --
    Combination - fun iPhone puzzling
    1. Re:My summary.. by Chris_Jefferson · · Score: 2, Informative

      damn, damn, damn, damn.. I meant of course that zip is the worst, 7zip the best... which I could edit comments!

      --
      Combination - fun iPhone puzzling
  8. .zip file ease-of-use beats out saving 4 bytes by Blakey+Rat · · Score: 2, Interesting

    Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.

  9. Whatever you choose... by BinLadenMyHero · · Score: 3, Insightful

    ...avoid closed formats.
    Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.

  10. RAR's license is garbage by Noksagt · · Score: 2, Informative

    While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.

    md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.

    RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.

    1. Re:RAR's license is garbage by gowen · · Score: 3, Interesting
      md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
      But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.

      You're right about the license, though.
      --
      Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
  11. Technical format comparison chart by uler · · Score: 2, Informative
    I've got a rather technical format comparison chart started up [1]. It's still a draft, but it's pretty complete.

    It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.

    [1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)

  12. don't use rar, arj, 7zip, etc by mqx · · Score: 3, Informative


    These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

    Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

    Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

  13. For Longevity by 4of12 · · Score: 4, Insightful

    Pick any system for which the source code is available, eg .tar.bz2

    Anything else is gambling.

    I still gamble, but only that a C compiler will exist in the future.

    --
    "Provided by the management for your protection."
  14. My Preference by vbrtrmn · · Score: 3, Informative

    For my own personal archives, I have taken the methods from the masters in USENET.

    OS X & UNIX: I'm lazy just: tar.gz

    For Win32, I back-up a lot more files under win32 than *nix.

    Compression
    WinRAR
    Compression Method: Best
    Split to Volumes: 20MB
    Parity
    QuickPar
    With general settings.

    I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.

    --
    it's a sig, wtf?
  15. Being one who also generates multi-GB to of data.. by Trelane · · Score: 2, Interesting

    I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.

    --

    --
    Given enough personal experience, all stereotypes are shallow.
  16. not what you asked, but... by leehwtsohg · · Score: 2, Informative

    gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems.
    Never backup using tar.gz - use tar.bz2 instead.
    Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your .tar.gz backup was damaged, just when you need it)

  17. uharc by biryokumaru · · Score: 2, Informative

    no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.

    --
    When you're afraid to download music illegally in your own home, then the terrorists have won!
  18. 7-zip: No Drag-n-drop by denis-The-menace · · Score: 2, Informative

    7-Zip is a one-man project that needs help to add features
    or at least manage its SourceForge support and RFE forums.

    Drag-n-drop has been requested for almost 2 years and now,
    some of its users are defecting to TUGZIP because of it.
    http://sourceforge.net/tracker/index.php?func=deta il&aid=663095&group_id=14481&atid=364481

    Either the guy is too busy, doesn't care or just doesn't want to share control.

    Maybe it's time to fork 7-zip?

    --
    Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
  19. you need real time web compression by mozkill · · Score: 2, Insightful

    if you are making a site which is for people to download stuff from then why not use real-time web compression?

    1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.

    this saves you time from having to zip and unzip all the time and it SERVES your original purpose.

    --

    -- Betting on the survival of the media industry is a serious risk. I advise investing elsewhere.
  20. Re:stuffit by Hes+Nikke · · Score: 2, Informative

    While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/, which reduces JPEG file sizes losslessly. that can help things too :D

    *Battens down the hatches for an incoming barrage of slashdot traffic*

    --
    Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.
  21. Large File/Archive Support by Detritus · · Score: 2, Interesting

    When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.

    --
    Mea navis aericumbens anguillis abundat