Best Format for Archive Distribution?
Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."
tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.
RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
Zip.
.tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.
Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.
Forget
-Isaac
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.
Zip is probably the most commonly installed archiver across all systems.
tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.
After that, you're not really going to find an archival format that's really common.
In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).
Have you considered going multi-format?
Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
I have found that some formats are far better at some data types than others.
e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.
bzip encodes/decodes slower, but has typically better compression ratios.
So, use whatever people commonly use for the data type you are compressing.
gus
.. if only.
I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)
.zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.
Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.
It really comes down to how much you want to make people download compared to how much trouble you want the to go to.
If you want to be "minimal effort", I'd advise providing a
Combination - fun iPhone puzzling
Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.
Comment of the year
...avoid closed formats.
Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.
While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.
md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.
[1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)
These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.
Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.
Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.
Pick any system for which the source code is available, eg .tar.bz2
Anything else is gambling.
I still gamble, but only that a C compiler will exist in the future.
"Provided by the management for your protection."
For my own personal archives, I have taken the methods from the masters in USENET.
OS X & UNIX: I'm lazy just: tar.gz
For Win32, I back-up a lot more files under win32 than *nix.
Compression
WinRAR
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar
With general settings.
I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.
it's a sig, wtf?
I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.
--
Given enough personal experience, all stereotypes are shallow.
gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems. .tar.gz backup was damaged, just when you need it)
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your
no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.
When you're afraid to download music illegally in your own home, then the terrorists have won!
7-Zip is a one-man project that needs help to add features
a il&aid=663095&group_id=14481&atid=364481
or at least manage its SourceForge support and RFE forums.
Drag-n-drop has been requested for almost 2 years and now,
some of its users are defecting to TUGZIP because of it.
http://sourceforge.net/tracker/index.php?func=det
Either the guy is too busy, doesn't care or just doesn't want to share control.
Maybe it's time to fork 7-zip?
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
if you are making a site which is for people to download stuff from then why not use real-time web compression?
1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.
this saves you time from having to zip and unzip all the time and it SERVES your original purpose.
-- Betting on the survival of the media industry is a serious risk. I advise investing elsewhere.
While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/, which reduces JPEG file sizes losslessly. that can help things too :D
*Battens down the hatches for an incoming barrage of slashdot traffic*
Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.
When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.
Mea navis aericumbens anguillis abundat