Best Format for Archive Distribution?
Meostro asks: "I'm looking for the best format to use to distribute arbitrary datasets. Tarballs compressed with gzip seem to be the most common thing out there, with zip coming in a close second. What advanced compression packages are the most widely recognized or available on the widest array of systems? Cross-platform compatibility is my most important goal, followed by compression ratio, decompression time, compression time and extra features (solid archives, support for multiple files, etc.). I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more. I hope this will grow to be a significant (read: multi-gigabyte) archive, so I want to start off right with my distribution format. Right now the plan is data.tar.bz2, but i'm open to anything that will give me better compression as long as it's available for Linux, Windows and Mac."
tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.
RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
Zip.
.tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.
Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.
Forget
-Isaac
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.
best compression ratios with various data
Zip is probably the most commonly installed archiver across all systems.
tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.
After that, you're not really going to find an archival format that's really common.
In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).
Have you considered going multi-format?
Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
I have found that some formats are far better at some data types than others.
e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.
bzip encodes/decodes slower, but has typically better compression ratios.
So, use whatever people commonly use for the data type you are compressing.
gus
.. if only.
What, no Stuffit!?!
I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)
.zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.
Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.
It really comes down to how much you want to make people download compared to how much trouble you want the to go to.
If you want to be "minimal effort", I'd advise providing a
Combination - fun iPhone puzzling
While 7-zip is a nice open format & archivers come with open licenses, it isn't mature enough yet. The *nix ports are still fairly young. And few third-party apps would unzip them. Finally, RAR does not use an open license. I would discourage you from using it for this project.
Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.
Comment of the year
My experience is that Zip doesn't handle archiving multiple files with the same name. Zip fails if you have a directory structure like.. /images/foo.txt
foo.txt
I've also seen zip fail completely trying to compress a directory structure containing very large numbers of small files > 10,000.
I always use RAR unless I know the recpient can't handle a RAR file.
...avoid closed formats.
Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.
While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.
md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.
Make two versions of your file available. Use "zip" for its universality, so anyone on any platform can get your file, if they want.
Then, make a more efficiently compressed one for those who know how to download and use it. Bzip2 seems to be the current favorite, especially for text.
Don't forget about security issues. If you intend to mail these files as attachments, ZIP and RAR may be blocked by mail servers because both can be executable under Windows. tar.bz2 may be more difficult for a Windows user to figure out, but at least it's not going to infect their computer without a lot of work on their part.
My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / uncompression. Is the extra time you spend worth the 5% you get? Is using a less popular format (like tar.bz2) worth the time you'll spend leading a Windows user through the uncompression process over and over again?
For the data hosted by our institution, we offer tar.gz and zip formats. Since most of the people using the data (climatologists) are running a Unix variant, most people are grabbing the tar.gz files. Just goes to show that the best format really depends on your situation and your expected users.
cswingle Fairbanks AK
The Best one I've used for years now is RAR.
Running a Archive Backup on the NT4 & Novell 3 Server (Old but robust, Novell as life thought unaffected every Virus attack). Backing up DataBases from 20MB to 8Gigs...You can easly open the files in WinRAR under windows and access them as easy as with Explorer....
Plus works on all OS.., from good old DOS / Novell put to Linux, Winblows, Mac OS.... etc.I think the point is this:
(lotsa files) -> compress -> (one archive) -> de-compress -> (lotsa files)
Audio and video codecs do not create an archive, and I think that the point is to have a general process, without having a bunch of exceptions based on file type.
BTW: all audio and video codecs (except FLAC) are lossy. Data out != data in.
"-1 Troll" is the apparently the same as "-1 I disagree with you."
It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.
[1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)
is the best compression mechanism I've seen. Getting your data back is a bitch, though.
These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.
Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.
Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.
even though it's not free, i'm quite fond of Stuffit's sitx format. the expander is availble as a free (not as in beer) download from http://www.stuffit.com/, as well as being included on the mac platform.
Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.
Pick any system for which the source code is available, eg .tar.bz2
Anything else is gambling.
I still gamble, but only that a C compiler will exist in the future.
"Provided by the management for your protection."
Lossy = very bad, i'm looking for a few general-purpose archivers. I'm sure there will be some MPEG stuff up there if for nothing other than a file format sample, but most stuff isn't going to be lossy-able and still make sense.
.zip or .tar.gz since they're the most universal, but i'll consider anything that's cross-platform or available as source.
I'd be perfectly happy to have separate archivers for different formats, but my main concern is universiality, even above compression ratio. FLAC or Monkey's for audio sounds great, but I need to be sure that everyone will be able to handle it. That's why I might be stuck with
Making sure your data doesn't get corrupted should be more of an issue than how compressed you can get it. Sadly, Stuffit is the only thing I can find that has error correction. I'm suprised because error correction is old as the hills. The Bose-Chaudhuri algorithm comes to mind.
I just went on a search for some ten year old data and the first place I found it, it was corrupted. Thank goodness for redundancy. I finally found an uncorrupted version but it took me a couple of days which could have been put to a better use.
Maybe someone could code an error correcting version of tar.
lzip, of course http://sourceforge.net/projects/lzip/
Don't worry about cross platform stuff. Choose something open and it can be ported, even to new, currently non-existant platforms.
Open algorithms, open source, no BS, makes your choice easy.
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
For my own personal archives, I have taken the methods from the masters in USENET.
OS X & UNIX: I'm lazy just: tar.gz
For Win32, I back-up a lot more files under win32 than *nix.
Compression
WinRAR
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar
With general settings.
I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.
it's a sig, wtf?
Dictionary-based compression schemes work well on data which might be described as "linguistic," i.e., data which has some kind of grammar describing it. English text, machine code (binaries), source code, HTML, etc. It won't work very well at all on audio or image data, at least not without some kind of preprocessing (usually called filtering). Bzip2 does somewhat better on audio because it is not a dictionary-based scheme.
There's no really simple way to get everything you are asking for. Do you want to distribute archives of multiple, compressed audio files? Then the files should be compressed individually, then archived. OTOH, if you want to distribute a bunch of .txt files, you should do the opposite -- archive first, and THEN compress.
If you are really serious about saving disk space, there's simply no way you're going to have a single, universal compressed archival format. OTOH, if the main goal is to have a format everybody can deal with, and perhaps as a side effect get a little compression out of it, I'd just use plain ZIP.
If the point of this is to make data available, then cross-platform availability needs to be my primary concern.
I can spec and write and offer to the public the SuperXtreme Archive format, and make my data available only in that format. Unless there is a compeling reason to switch to SXA for other purposes (general adoption), then it's essentially proprietary to my site and won't really be of any use to anyone.
OSS is not the be-all and end-all of utility or availability, only of portability.
I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.
--
Given enough personal experience, all stereotypes are shallow.
gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems. .tar.gz backup was damaged, just when you need it)
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your
no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.
When you're afraid to download music illegally in your own home, then the terrorists have won!
The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?
.txt files, you should do the opposite -- archive first, and THEN compress.
Compresison on any sort of data. Give me a good general-purpose compressor. Give me a good one just for audio, just for video, just for text, whatever. I have no idea what kinds of data i'm going to get, or how much of each kind there might be. A custom compressor for audio (FLAC) will almost always outperform a general-purpose compressor (gzip) because it uses specialized knowledge of the data to make better guesses as to how to compress it, but if only eight people in the world use FLAC versus 8 million for gzip, I had better use gzip.
What's a good dictionary-based compressor? What filters are generally available? Bzip2 is a block-sorting arithmetic compressor, what do you recommend I use that on versus using gzip, compress, etc.? I know what I can find on Google, and I know what I use, I just don't know what "everyone" has. There are hundreds of algorithms and formats out there, and I need to pick the right two or three.
There's no really simple way to get everything you are asking for. Do you want to distribute archives of multiple, compressed audio files? Then the files should be compressed individually, then archived. OTOH, if you want to distribute a bunch of
This is exactly the kind of thing i'm looking for. I've nearly always used the arc-then-comp method to get the best compression, I wouldn't have thought that individual files compress better than shared datasets.
I found tar to be a dated format that has no checksumming of individual files. I ran into a situation where a large tarball was made, and tar tf foo.tar done to verify it. A later attempt at extract failed due to corruption.
There are horrors that arise with tar. First, there are multiple tar record formats. The original tar only supported 14-character file names (original unix file system limitation). Along came a second tar format, but even that ended up with variants. Most people are using the GNU tar format these days, but the old stuff still crops up. Heck, early versions of Mac OS X had two separate commands (tar, gnutar) because of this.
The worst, however, is the streaming nature of tar. A tar file header contains the length of the file data. It is assumed the data immediately following that are another tar file header. If you have corruption you get "lost" in the data stream and very few tar implementations will find a resync point so you can recover files later in the stream.
I had the joy of needing to do just this to recover as much as I could from a corrupt tarball. It was kind of a fun java exercise, but after I was done I swore I'd never use tar again for archiving data. Zip is my choice now. It's supported everywhere (even Java's jar files are zip format) and is much more reliable than tar.
7-Zip is a one-man project that needs help to add features
a il&aid=663095&group_id=14481&atid=364481
or at least manage its SourceForge support and RFE forums.
Drag-n-drop has been requested for almost 2 years and now,
some of its users are defecting to TUGZIP because of it.
http://sourceforge.net/tracker/index.php?func=det
Either the guy is too busy, doesn't care or just doesn't want to share control.
Maybe it's time to fork 7-zip?
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
if you are making a site which is for people to download stuff from then why not use real-time web compression?
1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.
this saves you time from having to zip and unzip all the time and it SERVES your original purpose.
-- Betting on the survival of the media industry is a serious risk. I advise investing elsewhere.
There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.
To answer your specific questions, a good dictionary compressor is the Flate algorithm used by gzip. Very fast decompression, moderately fast compression, which can be tuned by a single parameter ("compression level" ranging from 0 to 9).
The filters depend again on the sort of data you are compressing. For non-photographic image data, the standard PNG linear filters, as well as the Paeth filter also supported by PNG, usually do quite well. For photographic data, just use high quality JPEG and consider it lossless (it could be argued that the amount of noise injected by JPEG is on the same order of magnitude as the noise in the digital imaging process, so who cares where the noise comes from, it's there whether you want it or not). For audio data, channel decorrelation followed by linear prediction is pretty much standard.
Bzip2 does well on any sort of data with repeating patterns, or nearly repeating patterns. In general, it works well on the sorts of data that a dictionary approach would work on, but it also works fairly well on audio data (much better than a dictionary approach, but still worse than if you had applied a filter specifically designed for audio data and then compressed with something simple like Flate or even just Huffman or Rice codes).
When bzip compressing many small files it is imperative to archive first and then compress, so that the 900k block sizes are used to their fullest advantage.
What does "everyone" have? It seems to me that if you're starting a mass data archival site, you can boost the number of algorithms by making them available on your site. At the bare minimum I would expect people to support zip, flate, bzip2, jpg, flac.
Is lossy compression an option, or not an option?
My "general purpose" is exactly what you said, without the "would do well" part, or substituting "would not do horribly" in its place. I know that all algorithms have strengths and weaknesses, i'm just looking for what has the fewest weaknesses and the most strengths. My definition is vague because I have vague data to compress; I don't have all audio or all images or all text, i've got a mix. As I said (and I believe you implied), I don't expect to get optimal compression using one or two algorithms, but I would like something that's pretty good all-around.
The image filters you mention, AFAIK are only available in conjunction with PNG. Nothing exists today that could work like There is no common set of transform filters that can be used as a front-end for generic coders (gzip/bzip2). I might end up creating them, with something like a Java interface to make it instantly cross-platform, with the hopes that the filters would be included in some major distros down the line.
What does "everyone" have? It seems to me that if you're starting a mass data archival site, you can boost the number of algorithms by making them available on your site. At the bare minimum I would expect people to support zip, flate, bzip2, jpg, flac.
As with your previous post, this is exactly what i'm looking for: what does everyone have, what is commonly supported. If all I needed was an OSS implementation of my compression, I'd bastardize gz, bz2, 7z and ppm together with some filters, some Rice, Huffman, Shannon, Fano, Lempel, Ziv and Welch and a bit of Burrowes and Wheeler to make "Best... Compressor... Ever" and unleash it upon the world. =)
Is lossy compression an option, or not an option?
Both, actually. Some stuff like language samples will probably end up as MP3, some of the video sequences will probably be MPEG and most images will end up JPG. Anything where minute details don't matter will probably be lossy, but probably 80% or more will be text, code and other data that needs to be lossless. That's what i'm looking for here, since lossy formats for audio/video/images are pretty standard MP3/MPG/JPG, and I can tweak the loss to get the size where I want it, at least to some point.
Whether text compression needs to be lossless is actually debatable. I'm gonna veer a little off topic here, but hey, it's Slashdot...
Suppose you are compressing English text by Huffman encoding entire words at a time. However, people make typos, so the actual set of words to be encoded will be larger than a set where there were no typos. By first running a spell check you can slightly increase the compression efficiency. Strictly, this is "lossy" compression since the decoded data is not the same as the encoded data. But not only have you improved compression efficiency, you've fixed the typos.
For something like C code, it is not necessary to precisely reproduce the indentation style of the original code. It could be boiled down to a simple canonical format and then compressed with a C-grammar-aware method, then run through "indent" on the other end to recover reasonable indentation.
Though personally, I prefer 7-zip and Stuffit (can't wait for the new version).
Wait, I thought that Microsoft is already using LZIP to distribute their software.
Otherwise, how can you explain...
Oh, wait.
When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.
Mea navis aericumbens anguillis abundat
Maybe this is a bit off topic, but whatevery happened to the ARJ format? It used to be king. One of its best features was being able to arj something onto multiple floppy disks.
Can anyone enlighten me on the fate of this once most favoured compression algorithm?
...is to upload the material to a Gmail account, then send the recipient the account name and key. Let Google handle the data compression, backup, system maintenance, etc.
Interesting, and true. The only problem is that no spell-checker will get everything right, so it may "correct" words to the wrong thing, and that could make a huge difference in meaning.
Don't forget the concept of letter order in English (and some foreign) text, it might be possible to alpha-sort the interior of words and still present readable text, although there is an example included that shows it might not be the best idea:
A dootcr has aimttded the magltheuansr of a tageene ceacnr pintaet who deid aetfr a hatospil durg blendur
BTW, I sent you an email suggesting a few more data sets you might host on your site.
I recently tried unpacking a bzip2 package under windows. It took me ages to find something that would recognize it and extract it. Which is a shame because it is a nice format ... if you aren't doing this a lot since it takes more time.
However, winzip out of the box will open tarballs and of course zip. And gzip / unzip are pretty much universal on *nix. I have however found that very large tarballs can be a problem with Winzip (like 100+ MB) but that was a long time ago.
And I would never, for the original poster's purpose, use anything with a proprietary licence like RAR. It could easily end up like the GIF fiasco all over again.
Bitter and proud of it.