Best Format for Archive Distribution?
Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."
I have found that some formats are far better at some data types than others.
e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.
bzip encodes/decodes slower, but has typically better compression ratios.
So, use whatever people commonly use for the data type you are compressing.
gus
.. if only.
Zip and gzip use the same compression.
.gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
Zip compresses each file in an archive individually.
Tar+gzip compresses the entire contents as a whole - meaning better
compression than zip archives (unless you add uncompressed files to
an archive, THEN compress the entire archive..)
WinZip supports tar+gzip archives, from what I remember, but WinRAR
supports
on Windows?
Then again, you could use solid RAR archives. Generally the best
size+performance ratio I've tried of these (all compressed as a whole,
some error recovery).
I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)
.zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.
Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.
It really comes down to how much you want to make people download compared to how much trouble you want the to go to.
If you want to be "minimal effort", I'd advise providing a
Combination - fun iPhone puzzling
Probably a good idea in general, but:
1. No obvious support on Windows for cpio
2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.
I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".
Winrar can open tar.gz files
Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.
Luckly gentoo knows it so you can simply emerge unace.
I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...
While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.
md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.
[1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)
These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.
Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.
Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.
Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.
Also, see graphics formats.
Method of processing duck feet
According to the ZIP file format specification, ZIP can use a dynamic LZW algorithm.
The whole reason gzip exists is because the standard UNIX compress uses LZW - which, until recently, was protected by a patent (that was the problem with GIFs).
Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.
So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
For my own personal archives, I have taken the methods from the masters in USENET.
OS X & UNIX: I'm lazy just: tar.gz
For Win32, I back-up a lot more files under win32 than *nix.
Compression
WinRAR
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar
With general settings.
I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.
it's a sig, wtf?
gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems. .tar.gz backup was damaged, just when you need it)
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your
no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.
When you're afraid to download music illegally in your own home, then the terrorists have won!
7-Zip is a one-man project that needs help to add features
a il&aid=663095&group_id=14481&atid=364481
or at least manage its SourceForge support and RFE forums.
Drag-n-drop has been requested for almost 2 years and now,
some of its users are defecting to TUGZIP because of it.
http://sourceforge.net/tracker/index.php?func=det
Either the guy is too busy, doesn't care or just doesn't want to share control.
Maybe it's time to fork 7-zip?
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/, which reduces JPEG file sizes losslessly. that can help things too :D
*Battens down the hatches for an incoming barrage of slashdot traffic*
Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.