Best Format for Archive Distribution?

One other choice by gowen · 2005-03-09 03:29 · Score: 4, Insightful

tar.gz and tar.bz2 are ok for small archives (20MB or so), but if you're dealing with large archives there's only one solution.

RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.

--
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.

Re:One other choice by MindStalker · 2005-03-09 03:50 · Score: 4, Informative

Similar to rar I've found that ACE (www.winace.com) in maximum compression compresses most things better than RAR and is similar in fuctionality (it supports rar as well)
Its linux verion is called unace and there is a macunace as well. Sadly these programs are a bit harder to find on the website but they are there.

Luckly gentoo knows it so you can simply emerge unace.
Re:One other choice by Meostro · 2005-03-09 03:53 · Score: 2, Informative

I've used it on Windows forever, and I know I obtained unrar for Linux and AIX, bow cross-platform is RAR, really? Does it come standard in most distributions? I think if it does then it's probably an excellent choice, I've compressed some stuff almost 2:1 over bzip2 using RAR...
Re:One other choice by harrkev · 2005-03-09 04:04 · Score: 5, Insightful

One problem with this is that it is not a common format. For limited use (one-time distribution, short-term backup), this is OK. But what about long-term archives.

If you want to de-compress this stuff in 10 or 20 years, will you be able to find software then that can handle it? Epspecially if the new cell processors somehow become popular, will Windows BOHICA 2025 edition be able to run 20-year-old binaries in order to read this thing?

If the source is available, the job is easier in Linux, but if the format is not actively maintained, it may take a lot of work to modify the program to run whatever Linux looks like in 20 years.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:One other choice by Meostro · 2005-03-09 04:07 · Score: 1

doh... bow = but how
Re:One other choice by Anonymous Coward · 2005-03-09 04:08 · Score: 0

You're not going to find a format that will still be widely used in 20 years, I am almost willing to bet. So, if in 5 years you find out the format has changed, just recompress everything. It should be simple if you use scriptable programs...
Re:One other choice by Deagol · 2005-03-09 04:35 · Score: 3, Informative

Isn't PKZip pushing the 20 year mark? And I think that Unix tar'ed and/or compress(1)'ed files are well over 20 years old.
Give me a foo.tar.Z file from the early 80s and I can still uncompress it. Give me a foo.zip from a mid-80's BBS archive and I can still see what's insite.
Also, see graphics formats.

--
Method of processing duck feet
Re:One other choice by Gnulix · 2005-03-09 04:57 · Score: 1

RAR -- cross platform, built in integrity checking, and when used with Parity files, makes splitting and reassembling archives an absolute doddle.
Now that's what I call comedy!
Re:One other choice by Jebediah21 · 2005-03-09 06:09 · Score: 1

I don't know about anybody else, but when I come across one of those nasty RAR files I have to run WinRAR under wine to open it since the Debian unrar packages are fucked. That is not anywhere close to convenient.

--

Everytime you look at porn a devil gets their horns.
Re:One other choice by fm6 · 2005-03-09 06:58 · Score: 1

Zip does all these things as well. I couldn't say whether RAR or Zip does them better.
Re:One other choice by zerblat · 2005-03-09 07:16 · Score: 1

The free unrar in Debian main can only handle older versions of RAR. Use the non-free (shareware) rar instead.

--
Please alter my pants as fashion dictates.
Re:One other choice by Em+Adespoton · 2005-03-09 11:16 · Score: 1

don't forget http://p7zip.sf.net/ when talking about large archives; the 7Zip formats regularly beat rar; I had a 280MB file compress down to 54MB with rar a -m5, and down to 17MB with 7za -ultra. 7Zip has the added benefit of being less encumbered than RAR or ACE, and more open in use of algorithms.
Re:One other choice by Jebediah21 · 2005-03-09 16:23 · Score: 1

Thanks. I have tried all the versions of unrar available. Do I instead have to install rar and not unrar for this?

--

Everytime you look at porn a devil gets their horns.
Re:One other choice by Anonymous Coward · 2005-03-11 11:31 · Score: 0

That would be "diddle".

Zip by isaac · 2005-03-09 03:29 · Score: 2, Insightful

Zip.

Zip is miles more common than anything else and compresses better (generally) than gzip. It's supported out of the box on almost every OS either natively or with bundled software. Even Solaris comes with unzip.

Forget .tar.bz2 unless your audience is the type of people you'd expect to have cygwin or 3rd-party compression tools installed on their windows peecees.

-Isaac

--
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.

Re:Zip by Anonymous Coward · 2005-03-09 03:33 · Score: 0

Well, I find it amazing that zip compresses better then gzip as they both use the same bloody algorithm, you know the one also known as zlib....
Re:Zip by DarkDust · 2005-03-09 03:34 · Score: 1

My experience has been that ZIP doesn't compress as good as gzip, let alone bzip2. But yes, almost everyone can handle ZIPs.

BTW, WinZIP can handle .tar.gz, I'm not sure whether it can handle .tar.bz2 as well.
Re:Zip by Anonymous Coward · 2005-03-09 03:35 · Score: 0

Generally zip compresses nearly the same as gzip. I agree with you that it's much more common though.
bzip2 compresses better but is much slower. ppm compresses best but is even slower. forget z :) rar is between bzip2 and ppm usually.
Re:Zip by EvilIdler · 2005-03-09 03:40 · Score: 2, Informative

Zip and gzip use the same compression.

Zip compresses each file in an archive individually.

Tar+gzip compresses the entire contents as a whole - meaning better
compression than zip archives (unless you add uncompressed files to
an archive, THEN compress the entire archive..)

WinZip supports tar+gzip archives, from what I remember, but WinRAR
supports .gz, tar.gz, .bz2 and .tar.bz2 files, so why use anything else
on Windows?

Then again, you could use solid RAR archives. Generally the best
size+performance ratio I've tried of these (all compressed as a whole,
some error recovery).
Re:Zip by Anonymous Coward · 2005-03-09 04:06 · Score: 0

Zip is decidedly WORSE than tar+gzip at compression. I managed a 40K java open source distro which is provided both in zip and tar+gzip. The zip distro is typically almost twice the size of the tar+gzip version.
Re:Zip by JimDabell · 2005-03-09 04:39 · Score: 2, Informative

Zip and gzip use the same compression.

According to the ZIP file format specification, ZIP can use a dynamic LZW algorithm.

The whole reason gzip exists is because the standard UNIX compress uses LZW - which, until recently, was protected by a patent (that was the problem with GIFs).

Instead of using LZW, gzip uses the unprotected LZ algorithm, which doesn't contain the improvements that Welch (the 'W' in LZW) made.

So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
Re:Zip by isaac · 2005-03-09 05:13 · Score: 1

So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!

Thank you. I thought this was common knowledge.
-Isaac

--
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
Re:Zip by Directrix1 · 2005-03-09 05:28 · Score: 1

It can

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
Re:Zip by DeeKayWon · 2005-03-09 09:48 · Score: 1

It looks like you're munging compress and zip together here. gzip was created in response to the patent status of the algorithm in compress, and the GP said that gzip uses the same algorithm as zip.
So not only do they not use the same algorithm, but that's the whole point of gzip in the first place!
Well, let's look at a quote from the gzip page you linked:
The first version of the compression algorithm used by gzip appeared in zip 0.9, publicly released on July 11th 1991.
There you have it. Zip uses deflate, just like gzip does. Sure, newer versions of zip can use LZW, but how many programs actually generate zip files that use it?
Re:Zip by JimDabell · 2005-03-09 20:29 · Score: 1

gzip was created in response to the patent status of the algorithm in compress

I know. That's what I said. compress uses LZW.

the GP said that gzip uses the same algorithm as zip.

ZIP can use multiple algorithms, one of them being LZW - the very algorithm that gzip was created to avoid. ZIP and compress both use this algorithm, gzip does not.

Zip uses deflate, just like gzip does. Sure, newer versions of zip can use LZW

No, deflate is just one of the algorithms that ZIP can use. LZW is another one. gzip deliberately avoids LZW, which is probably why it does not compress as well as ZIP (which is how this thread started).
Re:Zip by DeeKayWon · 2005-03-10 06:30 · Score: 1

I asked a question: "but how many programs actually generate zip files that use [LZW]?" Please answer it.
Actually, I've done some research, and a few sources tell me that LZW is called "shrink" in zip vernacular and was only commonly used in the days of PKZip 1.1. It moved to Deflate as the default after that, and indeed, Info-Zip's unzip utility doesn't even enable unshrink by default. If LZW in zip files were common, that wouldn't be a very pragmatic thing to do, would it?
Every zip utility out there now uses Deflate, not LZW. Thus when comparing gzip to zip you're comparing Deflate to Deflate. Any differences in compression level are merely different implementations with different optimizations (cf. pngcrush, pngout, etc).

CPIO by DarkDust · 2005-03-09 03:31 · Score: 3, Interesting

I prefer .cpio.bz2 because unlike tar cpio can handle special devices just fine (or do I miss some switch for tar which makes it able to handle devices and links ?). Since it's also in the POSIX standard this should be pretty portable as well.

Re:CPIO by flok · 2005-03-09 03:35 · Score: 1

yeah, but it handles spars-files rather funny

--

www.vanheusden.com - home of Multitail, HTTPing, CoffeeSaint, EntropyBroker, rsstail, bsod, listener, nagcon, nagi
Re:CPIO by Meostro · 2005-03-09 03:48 · Score: 2, Informative

Probably a good idea in general, but:
1. No obvious support on Windows for cpio
2. These are going to be test files, so I shouldn't have to worry about special devices / links / sparse files.

I know our admin here at work switched the backups from tar to cpio for exactly that reason, but it's just not universal enough to justify the departure from "normal".
Re:CPIO by ComputerSlicer23 · 2005-03-09 04:02 · Score: 1

Use a decent tar implementation. GNU tar handles block special devices just fine. It archives the block special devices, not the data you get if you open the contents of the device and read from it.
Kirby
Re:CPIO by hey! · 2005-03-09 10:18 · Score: 1

Yes, not to mention its charming syntax.

Once you get it to do something (other than the "find. -depth | cpio -pdl /destdir" kind of thing that is part of your fingers' auxillary programming), why not round it off with breakfast at Milliways?

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:CPIO by phoenix_rizzen · 2005-03-11 08:45 · Score: 1

Get a better tar. :) For instance, the bsdtar from FreeBSD can handle cpio, pax, and several different tar formats (for creating and extracting).

Or, use pax. It's got a much nicer syntax than cpio, and can also handle cpio, pax, and tar formats.

We used to use a horrible combination of cpio, bzip2, and split to image our servers. Was a royal pain to use, especially if you only wanted 1 or 2 files out of the backup. Switched to pax on Linux and bsdtar on *BSD, and everything is just hunky-dorry.

Not completely cross-platform if you want free Windows support (although PowerArchiver does handle these tarballs), but it works for our uses.

ppm by Anonymous Coward · 2005-03-09 03:32 · Score: 0

best compression ratios with various data

It really depends... by node+3 · 2005-03-09 03:37 · Score: 2, Insightful

Zip is probably the most commonly installed archiver across all systems.

tar/tar.gz/tar.bz is supported out-of-the-box on Linux and Mac OS X, but can throw Windows users for a loop (easily remedied, but they aren't likely to have untar installed, and will find the file extension at least a bit odd). For some data tar.bz will result in noticeably smaller files, but at a greater cost of compression/decompression time.

After that, you're not really going to find an archival format that's really common.

In the end, it depends on what type of data you are archiving, and your target audience, but unless you have a specific reason otherwise, zip with an md5 checksum file is probably the solution of least effort (just make sure you back-up the archive--don't want to have a problem with the only copy you have!).

Re:It really depends... by OAB_X · 2005-03-09 03:49 · Score: 2, Informative

Winrar can open tar.gz files
Re:It really depends... by Michael.Forman · 2005-03-09 05:25 · Score: 1

This is incorrect. Winzip handles "tar.gz" files out of the box, making it an excellent choice for cross-platform directory storage and compression.

Michael.

--
Linux : Mac :: VW : Mercedes

Multi-format by sporktoast · 2005-03-09 03:37 · Score: 4, Insightful

Have you considered going multi-format?

Either increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)), or use compression-on-the-fly libraries (behind a cache to reduce server load). This would allow the recipient to decide, and end up supporting perhaps a larger population.

--
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.

Re:Multi-format by Meostro · 2005-03-09 04:04 · Score: 1

...handle 2 or 3 of the more popular and widely available formats...

That's where i'm leaning, something like .tar.gz for universiality and RAR or similar for those that can handle it. I might offer "hot" sets in other formats, so the most popular stuff would be the most accessible but random, esoteric stuff would only come in one or two flavors.
Re:Multi-format by hackstraw · 2005-03-09 10:26 · Score: 1

... increase the size of your storage to handle 2 or 3 of the more popular and widely available formats (zip, rar, tar.(gz|bz2)

That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer any more in any compression format. Odds are if I need the copy again, I can download and have it on my harddisk in less than 5 minutes either at home or work. Many times the copy will be updated and new and improved.

Back to the original question. It depends on the target audience. If its windows land predominately or even significantly use zip. If its pretty much a *NIX land, use tar.gz. But under no circumstances should you ever use stuffit. My god, that is the worst ever. I can't tell you how many times I've "unstuffed" something, and then spend a few minutes trying to find what came out of the archive. Its difficult to "take a peak" at a stuffit archive to see what is in there, blah, blah. That is one program that kept me off of macs for 10 or so years. Arggg.
Re:Multi-format by Meostro · 2005-03-10 03:21 · Score: 1

That is amusing. I've thought about having a .sig that says "bzip2, saving disk space on servers by using twice the disk space". Yes, bzip2 is a better compressor, but is it really significant today? I rarely if ever come across bandwidth problems. Disk space on my end is cheap. Bandwidth on my end is cheap. In fact, I don't keep commonly available archives on my computer any more in any compression format. Odds are if I need the copy again, I can download and have it on my harddisk in less than 5 minutes either at home or work. Many times the copy will be updated and new and improved.
You're right, it doesn't make sense, but I'm going to have to disagree with you. The reason you can have that file on your harddisk "in less than 5 minutes" is because of bz2/gz/zip. Disk and band are cheap for you because you're not saturated, you have more than you need of both. If i'm going to be providing this data for $0, I need to do everything I can to save myself as much band and disk as possible. Right now disk is cheaper than transfer, so it doesn't matter if I keep 2 or 3 or 5 copies of something for compatibility, the amount of transfer that having that smaller copy will save me is enough to justify it.

If I have a 10MB gz file that gets downloaded a thousand times, that's 10GB transfer. I could also store an 8MB bz2 version of the same thing, and if half of the people get that one instead of the 10MB version, that's only 9GB transfer. Since storage is cheap versus transfer, and since my storage:transfer ratio is going to be low, it makes much more sense to "waste" some extra disk space to save myself as much bandwidth as possible. Also from a server-load point of view, I can serve 111 more files in the same amount of time if I make both available, since i'm essentially saving 10% of transfer time too.

One point of this thread which I forgot to put in the description actually is:
If the consensus is to use what's out there now (.zip, .tar.gz and .tar.bz2), then my question becomes Within the decompression spec of each of these, how do I get the most compression?

7-Zip claims 2-10% better compression on the same source data for zip and gz formats, so gzip -9 xyz will be slightly larger than 7z a -tgzip xyz.gz xyz -mx9. What do you all know that can do better than that?
Re:Multi-format by hackstraw · 2005-03-10 04:24 · Score: 1

If I have a 10MB gz file that gets downloaded a thousand times, that's 10GB transfer. I could also store an 8MB bz2 version of the same thing, and if half of the people get that one instead of the 10MB version, that's only 9GB transfer. Since storage is cheap versus transfer, and since my storage:transfer ratio is going to be low, it makes much more sense to "waste" some extra disk space to save myself as much bandwidth as possible. Also from a server-load point of view, I can serve 111 more files in the same amount of time if I make both available, since i'm essentially saving 10% of transfer time too.

I agree. I thought about that as I was typing my response, but my advice is to make available .zip's if the target is for windows or macs, a tar.gz for the other people.

Being that you cannot predict if anyone on this planet will even want to download your content, its premature to grobble about 2-10% savings in bandwidth. Also, if it becomes popular, then you should be able to find someone to help with the bandwidth. Its akin to that "premature optimization" quote.

If worse comes to worse, you could just dump the file on sourceforge. They seem to have OK bandwidth, and they don't even charge (for now). Thats what I do.

But stick to something sane like zip or gzip (or possibly bzip if your into it). Using 7zip, being that today is the first time I've heard of it, is pointless. There are more than enough compression types as it is, just pick one.

Look into specific compress utilities, not generic by gus+goose · 2005-03-09 03:38 · Score: 2, Informative

I have found that some formats are far better at some data types than others.

e.g. for:
text,ascii,documents: use any of bzip2, gzip, zip.
audio: use nothing if MP3/AAC/etc. flac for other "raw" formats
video: use the most appropriate encoding (mpeg4/divx,etc) and then don't try to compress.

bzip encodes/decodes slower, but has typically better compression ratios.

So, use whatever people commonly use for the data type you are compressing.

gus

--
.. if only.

Dude! by Anonymous Coward · 2005-03-09 03:40 · Score: 0

What, no Stuffit!?!

My summary.. by Chris_Jefferson · 2005-03-09 03:41 · Score: 3, Informative

I find that for my data (your data may be different) I tend to get the ordering "zip,bzip2,rar,7zip" (from best to worse), with rar and 7zip often being much smaller than bzip2 (my data tends to contain lots of similar large files, which tends to lead to unusually large differences between compressors)

Everyone has an unzipping program. I find on windows more people have (and get) rar than bzip2, particularily if they are afraid of command lines. 7zip gives the best compression of everyone, so is particularily useful for big datasets (I often send around 9-20GB data sets) but eats memory like no-ones business.

It really comes down to how much you want to make people download compared to how much trouble you want the to go to.

If you want to be "minimal effort", I'd advise providing a .zip along with other things, perhaps listing the size next to the files so people can see it's much bigger (like most sourceforge projects) for those windows users who can't be bothered to get anything else.

--
Combination - fun iPhone puzzling

Re:My summary.. by Chris_Jefferson · 2005-03-09 03:44 · Score: 2, Informative

damn, damn, damn, damn.. I meant of course that zip is the worst, 7zip the best... which I could edit comments!

--
Combination - fun iPhone puzzling
Re:My summary.. by Scuff · 2005-03-09 03:58 · Score: 1

don't worry too much about it, it was easy enough to figure out what you meant as long as people read the whole comment

Know your audience by Noksagt · 2005-03-09 03:47 · Score: 1

I'm starting up a free data site to provide test data for anything you can imagine: images for compression and format interpretation, text and audio for language processing, programming language examples to test parsing, and more.

Your audience will be developers. Hopefully F/OSS developers. So distribute in a developer-friendly format. BZ2 files can be decompressed with Free software & most of the proprietary applications out there that would be decompressing alternative formats.

While 7-zip is a nice open format & archivers come with open licenses, it isn't mature enough yet. The *nix ports are still fairly young. And few third-party apps would unzip them. Finally, RAR does not use an open license. I would discourage you from using it for this project.

Re:Know your audience by the_greywolf · 2005-03-09 03:56 · Score: 1

RAR has an open decompression library that allows for derivative works that decompress RAR formats. you can link it, modify it, use it, redistribute it modified, whatever, as long as you don't try to reverse-engineer the compression scheme. go download UnRAR and read the damned license.

--
grey wolf
LET FORTRAN DIE!
Re:Know your audience by Noksagt · 2005-03-09 04:03 · Score: 1

I acknowledged this in another post. I also use an LGPLed unrarer. But the RAR compression algorithm is, as the license makes very clear, "proprietary." I've seen very little F/OSS distributed as RAR archives, and I don't think it is coincidence.
Re:Know your audience by Meostro · 2005-03-09 04:38 · Score: 1

The WinRar license states:
Neither RAR binary code, WinRAR binary code, UnRAR source or UnRAR binary code may be used or reverse engineered to re-create the RAR compression algorithm, which is proprietary, without written permission of the author.

and the source code:
The unRAR sources may be used in any software to handle RAR archives without limitations free of charge, but cannot be used to re-create the RAR compression algorithm, which is proprietary. Distribution of modified unRAR sources in separate form or as a part of other software is permitted, provided that it is clearly stated in the documentation and source comments that the code may not be used to develop a RAR (WinRAR) compatible archiver.

So the guy is basically paranoid about keeping his "trade secret" a secret, which makes perfect sense from a business perspective. As far as FOSS is concerned, he even presents decompression code free for all to use.

How much of a problem is a proprietary compressor if the end-user never has to deal with it? Is it a problem if you can't create the archives, as long as you will have free access to use them?

I see it like gzip: gzip is defined by a couple of RFCs, and it's just a file format and a specification for decompressing the data. As long as you end up with something in .gz format, it doesn't matter if you use an open algorithm or a supersecret-proprietary-patented-licensed algorithm to get there. The decompressor is the same in either case, and you already said that you have a LGPL version of the RAR decompressor.
Re:Know your audience by Noksagt · 2005-03-09 05:07 · Score: 1

the RAR compression algorithm, which is proprietary
And there lies the problem that I mentioned. The compression algorithm isn't Free.
So the guy is basically paranoid about keeping his "trade secret" a secret, which makes perfect sense from a business perspective.
But why bother catering to his business when you could use something like 7-zip?
As far as FOSS is concerned, he even presents decompression code free for all to use.
It also makes sense from a business perspective: formats which are distributed gain you more customers when they can be read by everyone. MS makes a free (as in beer) viewer for their office formats too.
How much of a problem is a proprietary compressor if the end-user never has to deal with it?
When there is a completely Free and Open format that works as well or better, it is a significant problem. I know many developers who don't have a RAR program on their system. They never write and never need to read the format. I, myself, use an LGPLed decompressor. But I don't have it on all of my boxes & even uninstall it on some of them after I use it.

Perhaps you are right & only a zealot would take the position I take. But test data benefits developers & you should be using WHATEVER archival format they use for the other software that they use and make. I don't think this will be RAR or another proprietary format.

Of course I'm sure there are plenty of unpopular Open formats that would also be unsuitable too. Your primary concern should be what everyone else is using. The license of your archiver has already impacted this, so perhaps we don't need to argue about how (un)Free RAR is.
I see it like gzip: gzip is defined by a couple of RFCs, and it's just a file format and a specification for decompressing the data. As long as you end up with something in .gz format
Well, where is the RFC or open documentation for the RAR format? Where are the free compressors? And where is the software distributed using this format? RAR is much less Free than gzip, bzip2, or 7-zip.
Re:Know your audience by M1FCJ · 2005-03-09 05:44 · Score: 1

7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.
Re:Know your audience by Noksagt · 2005-03-09 06:01 · Score: 1

7zip? Unmature? It is perfectly usable. I finally deleted my licensed copy of Winzip because 7zip does its work better. 'nuff said.
7-zip is also on the win32 boxes I administer & winzip isn't. But this kind of use is not enough to prove the kind of maturity that I'm talking about. One piece of evidence is that most users aren't using it for 7zip archives. Another is that the *nix version of 7-zip is about 8 months old and listed as beta. Only in October did KDE and GNOME add support IN THEIR CVS. So, no, I wouldn't release any archives to the world in the format unless I had a significantly better technological reason to use it over bzip2.

.zip file ease-of-use beats out saving 4 bytes by Blakey+Rat · 2005-03-09 03:47 · Score: 2, Interesting

Sure, maybe your 17 MB file is 13.5 with .zip and 13.24 with .7z or whatever, but what it all comes down to is that every current operating system supports .zip files out-of-the-box. Do you think that extra 3 seconds of download is worth making your customer hunt down and install an entirely different program just to see your file? It's not. Why make things harder? .zip is the standard, use it.

--
Comment of the year

Re:.zip file ease-of-use beats out saving 4 bytes by mongolian · 2005-03-09 04:04 · Score: 1

The goal is not always to get data to a customer. What if I want to store some files for myself, as I often do or perhaps am transferring data to a computer that I know supports a given compression. While I will agree that for mass data distribution, more common formats like .zip are the way to go, one should not make a habit of compressing zip in all cases.
Re:.zip file ease-of-use beats out saving 4 bytes by Blakey+Rat · 2005-03-09 05:59 · Score: 1

Read the article summary. The goal *is* to get data to a customer.

--
Comment of the year

Zip bad for multiple files with same name by Jjeff1 · 2005-03-09 03:47 · Score: 1

My experience is that Zip doesn't handle archiving multiple files with the same name. Zip fails if you have a directory structure like..
foo.txt /images/foo.txt
I've also seen zip fail completely trying to compress a directory structure containing very large numbers of small files > 10,000.

I always use RAR unless I know the recpient can't handle a RAR file.

Whatever you choose... by BinLadenMyHero · 2005-03-09 03:55 · Score: 3, Insightful

...avoid closed formats.
Using Free software will help you archive your number one goal: that everyone can access the data, now and forever.

RAR's license is garbage by Noksagt · 2005-03-09 03:56 · Score: 2, Informative

While there are Free rar unpackers, the primary packer/unpacker has a proprietary license. He is catering to open source developers. It is a poor choice.

md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.

RAR doesn't have a monopoly on integrating these, though. 7zip certainly has many of the service features available in rar.

Re:RAR's license is garbage by gowen · 2005-03-09 04:10 · Score: 3, Interesting

md5 probably provides enough integrity checking for test data & split/cat make splitting/reassembling ANYTHING easy.
But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.

You're right about the license, though.

--
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
Re:RAR's license is garbage by Noksagt · 2005-03-09 04:36 · Score: 1

But RAR/PAR encoding use a Reed-Solomon error correction scheme, so you can send a few additional files and the data blocks within them can be used to replace *any* lost data blocks from the original set. It's really crafty.
You're right. It is clever & is better than md5 for repair purposes. Which is why is aaid md5 was probably enough for test data (I think bz2 archives are the way to go, so I'm a bit biased).

But, once again, this is not a unique feature of rar. I know I've seen proprietary ZIP programs that offer it. If 7-zip doesn't currently offer it, it will in the future (I know I saw it on the roadmap & think I saw an option the last time I used 7-zip on win32).
Re:RAR's license is garbage by UnrefinedLayman · 2005-03-09 09:28 · Score: 0

It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now? What you mention is an effort to backport the technology that hasn't even been done yet.

I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR has the track record it does for doing what it was designed to do.

Two formats by Tom7 · 2005-03-09 03:57 · Score: 1

Make two versions of your file available. Use "zip" for its universality, so anyone on any platform can get your file, if they want.

Then, make a more efficiently compressed one for those who know how to download and use it. Bzip2 seems to be the current favorite, especially for text.

Re:Two formats by Anonymous Coward · 2005-03-11 01:58 · Score: 0

And you have made the critical point about which format to use and which format not to use.
Use "zip" for its universality...
Absolutely. Linus, Bill, and everyone in between can read this format.
Bzip2 seems to be the current favorite... (emphasis mine)
And therein lies the problem with Bzip2. Will Bzip2 last, or is it just a passing fad? I don't know. And neither do you.

Like others have said, use what people can read. And that boils down to .zip and .tar.Z. Those have been the "universal" compressed archive format for many, many years.

Security by cswingle · 2005-03-09 03:57 · Score: 1

Don't forget about security issues. If you intend to mail these files as attachments, ZIP and RAR may be blocked by mail servers because both can be executable under Windows. tar.bz2 may be more difficult for a Windows user to figure out, but at least it's not going to infect their computer without a lot of work on their part.

My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / uncompression. Is the extra time you spend worth the 5% you get? Is using a less popular format (like tar.bz2) worth the time you'll spend leading a Windows user through the uncompression process over and over again?

For the data hosted by our institution, we offer tar.gz and zip formats. Since most of the people using the data (climatologists) are running a Unix variant, most people are grabbing the tar.gz files. Just goes to show that the best format really depends on your situation and your expected users.

--
cswingle Fairbanks AK

Re:Security by Meostro · 2005-03-09 05:10 · Score: 1

My other comment is to do some experiments with *your* data -- which format actually yields the best compression rate, and how much more time do you spend doing the compression / uncompression. Is the extra time you spend worth the 5% you get?

My data doesn't exist yet, that's part of the problem. I need one or a few good general-purpose archive formats, the particulars of which may perform better on one dataset than another. Decompression time matters because it will affect end-users, but compression time can almost be discounted as free: compress once, keep forever. That extra 5% means 5% less bandwidth, forever again, so an extra 20% on compression time might be worth it to me. It also depends on the size of the dataset, 5% of a meg is nothing, but 5% of an ISO is significant.

Is using a less popular format (like tar.bz2) worth the time you'll spend leading a Windows user through the uncompression process over and over again?

In general, the people who would use the data would be relatively savvy since they're already doing development / testing, but I would try to make it as idiot-proof as possible. Whatever formats I use would have decompressors available on-site, probably with binaries for the usual suspects (well, for Windows at least).

RAR is the Ruler of The New World Order..... by Shadow_139 · 2005-03-09 03:59 · Score: 0

The Best one I've used for years now is RAR.

Running a Archive Backup on the NT4 & Novell 3 Server (Old but robust, Novell as life thought unaffected every Virus attack). Backing up DataBases from 20MB to 8Gigs...

You can easly open the files in WinRAR under windows and access them as easy as with Explorer....

Plus works on all OS.., from good old DOS / Novell put to Linux, Winblows, Mac OS.... etc.

Re:Look into specific compress utilities, not gene by harrkev · 2005-03-09 04:12 · Score: 1

I think the point is this:

(lotsa files) -> compress -> (one archive) -> de-compress -> (lotsa files)

Audio and video codecs do not create an archive, and I think that the point is to have a general process, without having a bunch of exceptions based on file type.

BTW: all audio and video codecs (except FLAC) are lossy. Data out != data in.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."

Technical format comparison chart by uler · 2005-03-09 04:14 · Score: 2, Informative

I've got a rather technical format comparison chart started up [1]. It's still a draft, but it's pretty complete.

It doesn't directly address relative compression ratios nor benchmarks. And it's mostly about the formats themselves, not the libraries that implement them. But it's still good for a birds-eye view, I think.

[1] http://darbinding.sourceforge.net/about_dar.php (The chart is at the bottom of the page.)

/dev/null by BSDFreak · 2005-03-09 04:17 · Score: 1

is the best compression mechanism I've seen. Getting your data back is a bitch, though.

Re:/dev/null by bluGill · 2005-03-09 07:10 · Score: 1

You can always get your data back just fine from /dev/random. You just have to figure out where it starts there, which can sometimes be difficult.
Re:/dev/null by aled · 2005-03-12 00:27 · Score: 1

I found that the restoring process can be enhanced greatly disabling any form of CRC checking when reading from /dev/random...

--

"I think this line is mostly filler"

don't use rar, arj, 7zip, etc by mqx · 2005-03-09 04:26 · Score: 3, Informative

These formats (rar, arj, 7zip, whatever) are not as widely supported as (a) tar/gz, tar/bz2, (b) gzip/bzip2, (c) zip. You can get (a)-(c) on just about every platform. Numerous times I've downloaded "unzip" to make (c) work. It's so simple.

Another point not mentioned elsewhere: virus scanners: at least with the popular tar/zip formats, you know that virus scanners understand them and can look for problems.

Sure, you may get a few extra features or a little more room out of non-standard archivers, but that's largely not an issue.

stuffit by Hes+Nikke · 2005-03-09 04:29 · Score: 1

even though it's not free, i'm quite fond of Stuffit's sitx format. the expander is availble as a free (not as in beer) download from http://www.stuffit.com/, as well as being included on the mac platform.

--
Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.

Re:stuffit by Kris_J · 2005-03-09 12:13 · Score: 1

The new version of Stuffit looks like it will rock. That 25-30% JPEG compression, plus a generally competitive algorithm shows great potential.
Re:stuffit by Hes+Nikke · 2005-03-09 17:07 · Score: 2, Informative

While we are on the subject of JPEG compression, i recently launched http://jpgcrunch.com/, which reduces JPEG file sizes losslessly. that can help things too :D

*Battens down the hatches for an incoming barrage of slashdot traffic*

--
Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.

For Longevity by 4of12 · 2005-03-09 04:42 · Score: 4, Insightful

Pick any system for which the source code is available, eg .tar.bz2

Anything else is gambling.

I still gamble, but only that a C compiler will exist in the future.

--
"Provided by the management for your protection."

Re:Look into specific compress utilities, not gene by Meostro · 2005-03-09 04:46 · Score: 1

Lossy = very bad, i'm looking for a few general-purpose archivers. I'm sure there will be some MPEG stuff up there if for nothing other than a file format sample, but most stuff isn't going to be lossy-able and still make sense.

I'd be perfectly happy to have separate archivers for different formats, but my main concern is universiality, even above compression ratio. FLAC or Monkey's for audio sounds great, but I need to be sure that everyone will be able to handle it. That's why I might be stuck with .zip or .tar.gz since they're the most universal, but i'll consider anything that's cross-platform or available as source.

Stuffit also has error correction by Anonymous Coward · 2005-03-09 04:58 · Score: 1, Interesting

Making sure your data doesn't get corrupted should be more of an issue than how compressed you can get it. Sadly, Stuffit is the only thing I can find that has error correction. I'm suprised because error correction is old as the hills. The Bose-Chaudhuri algorithm comes to mind.

I just went on a search for some ten year old data and the first place I found it, it was corrupted. Thank goodness for redundancy. I finally found an uncorrupted version but it took me a couple of days which could have been put to a better use.

Maybe someone could code an error correcting version of tar.

Re:Stuffit also has error correction by Meostro · 2005-03-09 06:52 · Score: 1

RAR at least has error correction built in, StuffIt isn't the only thing out there.

In this case, corruption is not an issue. I intend to keep redundant backups of the original datasets, so even if the web-based archives get corrupted I will be able to recover the data.

LZIP by Victor_Os · 2005-03-09 05:10 · Score: 1, Funny

lzip, of course http://sourceforge.net/projects/lzip/

Re:LZIP by larley · 2005-03-09 05:29 · Score: 1

Sure, if you WANT lossy compression. I doubt he'd really want that when he's distributing software... RTFP next time.
Re:LZIP by Anonymous Coward · 2005-03-09 05:41 · Score: 0

Wooosh...

P.S. he is not distributing software...
distribute arbitrary datasets
RTFA next time.
Re:LZIP by Anonymous Coward · 2005-03-09 07:31 · Score: 0

Hint to mods: This was a very relevant joke. It wouldn't hurt you to mod it funny just because he has a history of being a firstpost troll.
Re:LZIP by The+Bungi · 2005-03-09 10:06 · Score: 1

C'mon mods, this is funny. Check out the lzip website, you'll get it.

Re:Whatever you choose...Agreed by marcus · 2005-03-09 05:49 · Score: 1

Don't worry about cross platform stuff. Choose something open and it can be ported, even to new, currently non-existant platforms.

Open algorithms, open source, no BS, makes your choice easy.

--
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO

My Preference by vbrtrmn · 2005-03-09 06:00 · Score: 3, Informative

For my own personal archives, I have taken the methods from the masters in USENET.

OS X & UNIX: I'm lazy just: tar.gz

For Win32, I back-up a lot more files under win32 than *nix.

Compression
WinRAR
Compression Method: Best
Split to Volumes: 20MB
Parity
QuickPar
With general settings.

I back-up to decent quality DVD media, as I have had a lot of problems with CD media rotting after about a year.

--
it's a sig, wtf?

You're looking for something that doesn't exist. by pclminion · 2005-03-09 06:13 · Score: 1

The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?

Dictionary-based compression schemes work well on data which might be described as "linguistic," i.e., data which has some kind of grammar describing it. English text, machine code (binaries), source code, HTML, etc. It won't work very well at all on audio or image data, at least not without some kind of preprocessing (usually called filtering). Bzip2 does somewhat better on audio because it is not a dictionary-based scheme.

There's no really simple way to get everything you are asking for. Do you want to distribute archives of multiple, compressed audio files? Then the files should be compressed individually, then archived. OTOH, if you want to distribute a bunch of .txt files, you should do the opposite -- archive first, and THEN compress.

If you are really serious about saving disk space, there's simply no way you're going to have a single, universal compressed archival format. OTOH, if the main goal is to have a format everybody can deal with, and perhaps as a side effect get a little compression out of it, I'd just use plain ZIP.

Re:Whatever you choose...Agreed by Meostro · 2005-03-09 06:31 · Score: 1

If the point of this is to make data available, then cross-platform availability needs to be my primary concern.

I can spec and write and offer to the public the SuperXtreme Archive format, and make my data available only in that format. Unless there is a compeling reason to switch to SXA for other purposes (general adoption), then it's essentially proprietary to my site and won't really be of any use to anyone.

OSS is not the be-all and end-all of utility or availability, only of portability.

Being one who also generates multi-GB to of data.. by Trelane · 2005-03-09 07:14 · Score: 2, Interesting

I've used both gzip and bzip2. I rather like bzip2 for plain text data files, but there is a rather large cost--compressing and uncompressing can take a much longer time than with gzip! These are important considerations to make, especially if you're gonna need to pull this data back off the shelf anytime soon. For this reason (time), I currently use gzip for intermediate-range storage.

--

--
Given enough personal experience, all stereotypes are shallow.

not what you asked, but... by leehwtsohg · 2005-03-09 07:30 · Score: 2, Informative

gzip is really horrible in error recovery. Trying to recover data from a damaged tar.gzip file is hard, because gzip does not keep byte boundaries. bzip2 is much better in this respect, and it is much better to recover from problems.
Never backup using tar.gz - use tar.bz2 instead.
Not on topic, but I had to say it, 'cause I've been hurt (nothing like finding that your .tar.gz backup was damaged, just when you need it)

uharc by biryokumaru · 2005-03-09 07:41 · Score: 2, Informative

no one seems to remember uharc these days. uharc is still a tighter compression than rar or bz2. it's hard to come by, though. personally i still use tar bz2, cuz no one can decompress uharcs, and its usually a little better than rars, atleast for source files. i think it was open source at some point, which solves the long-term dilemma faced with flash in the pan formats... if you could find the source.

--
When you're afraid to download music illegally in your own home, then the terrorists have won!

Re:You're looking for something that doesn't exist by Meostro · 2005-03-09 07:42 · Score: 1

The various types of data you mention are not uniformly compressable by any single algorithm. Therefore your "compression ratio" criterion is dubious. Compression ratio on what sort of data?

Compresison on any sort of data. Give me a good general-purpose compressor. Give me a good one just for audio, just for video, just for text, whatever. I have no idea what kinds of data i'm going to get, or how much of each kind there might be. A custom compressor for audio (FLAC) will almost always outperform a general-purpose compressor (gzip) because it uses specialized knowledge of the data to make better guesses as to how to compress it, but if only eight people in the world use FLAC versus 8 million for gzip, I had better use gzip.

What's a good dictionary-based compressor? What filters are generally available? Bzip2 is a block-sorting arithmetic compressor, what do you recommend I use that on versus using gzip, compress, etc.? I know what I can find on Google, and I know what I use, I just don't know what "everyone" has. There are hundreds of algorithms and formats out there, and I need to pick the right two or three.

There's no really simple way to get everything you are asking for. Do you want to distribute archives of multiple, compressed audio files? Then the files should be compressed individually, then archived. OTOH, if you want to distribute a bunch of .txt files, you should do the opposite -- archive first, and THEN compress.

This is exactly the kind of thing i'm looking for. I've nearly always used the arc-then-comp method to get the best compression, I wouldn't have thought that individual files compress better than shared datasets.

tar is bad for integrity by kherr · 2005-03-09 08:01 · Score: 1

I found tar to be a dated format that has no checksumming of individual files. I ran into a situation where a large tarball was made, and tar tf foo.tar done to verify it. A later attempt at extract failed due to corruption.

There are horrors that arise with tar. First, there are multiple tar record formats. The original tar only supported 14-character file names (original unix file system limitation). Along came a second tar format, but even that ended up with variants. Most people are using the GNU tar format these days, but the old stuff still crops up. Heck, early versions of Mac OS X had two separate commands (tar, gnutar) because of this.

The worst, however, is the streaming nature of tar. A tar file header contains the length of the file data. It is assumed the data immediately following that are another tar file header. If you have corruption you get "lost" in the data stream and very few tar implementations will find a resync point so you can recover files later in the stream.

I had the joy of needing to do just this to recover as much as I could from a corrupt tarball. It was kind of a fun java exercise, but after I was done I swore I'd never use tar again for archiving data. Zip is my choice now. It's supported everywhere (even Java's jar files are zip format) and is much more reliable than tar.

7-zip: No Drag-n-drop by denis-The-menace · 2005-03-09 08:36 · Score: 2, Informative

7-Zip is a one-man project that needs help to add features
or at least manage its SourceForge support and RFE forums.

Drag-n-drop has been requested for almost 2 years and now,
some of its users are defecting to TUGZIP because of it.
http://sourceforge.net/tracker/index.php?func=deta il&aid=663095&group_id=14481&atid=364481

Either the guy is too busy, doesn't care or just doesn't want to share control.

Maybe it's time to fork 7-zip?

--
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration

you need real time web compression by mozkill · 2005-03-09 09:11 · Score: 2, Insightful

if you are making a site which is for people to download stuff from then why not use real-time web compression?

1. your web server can display the un-compressed version of the file on the server , 2. then the user starts the download from the browser, 3. the webserver compresses it on the fly and delivers it to the browser which unzips it when its done.

this saves you time from having to zip and unzip all the time and it SERVES your original purpose.

--

-- Betting on the survival of the media industry is a serious risk. I advise investing elsewhere.

Re:You're looking for something that doesn't exist by pclminion · 2005-03-09 09:29 · Score: 1

Compresison on any sort of data. Give me a good general-purpose compressor.

There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.

To answer your specific questions, a good dictionary compressor is the Flate algorithm used by gzip. Very fast decompression, moderately fast compression, which can be tuned by a single parameter ("compression level" ranging from 0 to 9).

The filters depend again on the sort of data you are compressing. For non-photographic image data, the standard PNG linear filters, as well as the Paeth filter also supported by PNG, usually do quite well. For photographic data, just use high quality JPEG and consider it lossless (it could be argued that the amount of noise injected by JPEG is on the same order of magnitude as the noise in the digital imaging process, so who cares where the noise comes from, it's there whether you want it or not). For audio data, channel decorrelation followed by linear prediction is pretty much standard.

Bzip2 does well on any sort of data with repeating patterns, or nearly repeating patterns. In general, it works well on the sorts of data that a dictionary approach would work on, but it also works fairly well on audio data (much better than a dictionary approach, but still worse than if you had applied a filter specifically designed for audio data and then compressed with something simple like Flate or even just Huffman or Rice codes).

When bzip compressing many small files it is imperative to archive first and then compress, so that the 900k block sizes are used to their fullest advantage.

What does "everyone" have? It seems to me that if you're starting a mass data archival site, you can boost the number of algorithms by making them available on your site. At the bare minimum I would expect people to support zip, flate, bzip2, jpg, flac.

Is lossy compression an option, or not an option?

PAR File Recovery by Noksagt · 2005-03-09 10:10 · Score: 1

It bears mentioning that RAR was designed from the get-go to support the features mentioned, and has for what, seven years now?
RAR was not designed from the get-go to support Reed-Solomon repair. This was a feature added for 2.0 (released 1996). So, it has supported it for 8 years, but it was also back-ported.

You also don't need to use RAR to use the same file recovery mechanisms. Use PAR on ANY type of file to benefit from it!I'm not trying to trash 7-zip, but I'm also not going trying 7-zip when RAR has the track record it does for doing what it was designed to do.
7-zip does what it is designed to do as well: use an open source program to both create and extract archives with very high compression ratios.

Re:You're looking for something that doesn't exist by Meostro · 2005-03-09 10:55 · Score: 1

There isn't any such thing, unless you use a very narrow definition of "general purpose." To me, general purpose would imply that I could throw any sort of data at it I please, and it would do well, provided the data is not just random data. Since we're working with very vague definitions I can only give very vague suggestions.

My "general purpose" is exactly what you said, without the "would do well" part, or substituting "would not do horribly" in its place. I know that all algorithms have strengths and weaknesses, i'm just looking for what has the fewest weaknesses and the most strengths. My definition is vague because I have vague data to compress; I don't have all audio or all images or all text, i've got a mix. As I said (and I believe you implied), I don't expect to get optimal compression using one or two algorithms, but I would like something that's pretty good all-around.

The image filters you mention, AFAIK are only available in conjunction with PNG. Nothing exists today that could work like

cat image.dat | paeth_filter | gzip -c9 > image.compressed

There is no common set of transform filters that can be used as a front-end for generic coders (gzip/bzip2). I might end up creating them, with something like a Java interface to make it instantly cross-platform, with the hopes that the filters would be included in some major distros down the line.

What does "everyone" have? It seems to me that if you're starting a mass data archival site, you can boost the number of algorithms by making them available on your site. At the bare minimum I would expect people to support zip, flate, bzip2, jpg, flac.

As with your previous post, this is exactly what i'm looking for: what does everyone have, what is commonly supported. If all I needed was an OSS implementation of my compression, I'd bastardize gz, bz2, 7z and ppm together with some filters, some Rice, Huffman, Shannon, Fano, Lempel, Ziv and Welch and a bit of Burrowes and Wheeler to make "Best... Compressor... Ever" and unleash it upon the world. =)

Is lossy compression an option, or not an option?

Both, actually. Some stuff like language samples will probably end up as MP3, some of the video sequences will probably be MPEG and most images will end up JPG. Anything where minute details don't matter will probably be lossy, but probably 80% or more will be text, code and other data that needs to be lossless. That's what i'm looking for here, since lossy formats for audio/video/images are pretty standard MP3/MPG/JPG, and I can tweak the loss to get the size where I want it, at least to some point.

Re:You're looking for something that doesn't exist by pclminion · 2005-03-09 11:05 · Score: 1

Anything where minute details don't matter will probably be lossy, but probably 80% or more will be text, code and other data that needs to be lossless.

Whether text compression needs to be lossless is actually debatable. I'm gonna veer a little off topic here, but hey, it's Slashdot...

Suppose you are compressing English text by Huffman encoding entire words at a time. However, people make typos, so the actual set of words to be encoded will be larger than a set where there were no typos. By first running a spell check you can slightly increase the compression efficiency. Strictly, this is "lossy" compression since the decoded data is not the same as the encoded data. But not only have you improved compression efficiency, you've fixed the typos.

For something like C code, it is not necessary to precisely reproduce the indentation style of the original code. It could be boiled down to a simple canonical format and then compressed with a C-grammar-aware method, then run through "indent" on the other end to recover reasonable indentation.

Example: High Voltage SID Collection by Kris_J · 2005-03-09 12:09 · Score: 1

The HVSC is a 40-50MB (compressed) collection of a huge number of small files. They provide the current version in zip and rar format, with a set of incremental upgrades as zips. I would start by looking at their model.

Though personally, I prefer 7-zip and Stuffit (can't wait for the new version).

MS using LZIP? by Anonymous Coward · 2005-03-09 19:52 · Score: 0

Wait, I thought that Microsoft is already using LZIP to distribute their software.

Otherwise, how can you explain...

Oh, wait.

Large File/Archive Support by Detritus · 2005-03-09 21:50 · Score: 2, Interesting

When choosing a standard, don't forget to check how the format deals with large files and large archives. I've run into numerous problems with software that can't deal with anything larger than 2 GB, either for individual files or total archive size.

--
Mea navis aericumbens anguillis abundat

Whatever happened to ARJ? by l0rd · 2005-03-09 23:20 · Score: 1

Maybe this is a bit off topic, but whatevery happened to the ARJ format? It used to be king. One of its best features was being able to arj something onto multiple floppy disks.

Can anyone enlighten me on the fate of this once most favoured compression algorithm?

Best way to distribute archives... by spywarearcata.com · 2005-03-10 02:19 · Score: 1

...is to upload the material to a Gmail account, then send the recipient the account name and key. Let Google handle the data compression, backup, system maintenance, etc.

Re:You're looking for something that doesn't exist by Meostro · 2005-03-10 02:58 · Score: 1

Interesting, and true. The only problem is that no spell-checker will get everything right, so it may "correct" words to the wrong thing, and that could make a huge difference in meaning.

Don't forget the concept of letter order in English (and some foreign) text, it might be possible to alpha-sort the interior of words and still present readable text, although there is an example included that shows it might not be the best idea:

A dootcr has aimttded the magltheuansr of a tageene ceacnr pintaet who deid aetfr a hatospil durg blendur

Re:You're looking for something that doesn't exist by pclminion · 2005-03-10 04:55 · Score: 1

Interesting idea to sort the letters! That would enhance bzip's compression efficiency somewhat. On the decoding side, it would be fairly simple to map the sorted words back to the originals. An ambiguity resolver based on, say, a third-order Markov model might be able to make the right choice most of the time (similar to some OCR cleanup techniques).

BTW, I sent you an email suggesting a few more data sets you might host on your site.

zip or tgz: yes. bzip2 sadly no. by Evil+Pete · 2005-03-11 10:46 · Score: 1

I recently tried unpacking a bzip2 package under windows. It took me ages to find something that would recognize it and extract it. Which is a shame because it is a nice format ... if you aren't doing this a lot since it takes more time.

However, winzip out of the box will open tarballs and of course zip. And gzip / unzip are pretty much universal on *nix. I have however found that very large tarballs can be a problem with Winzip (like 100+ MB) but that was a long time ago.

And I would never, for the original poster's purpose, use anything with a proprietary licence like RAR. It could easily end up like the GIF fiasco all over again.

--
Bitter and proud of it.

Slashdot Mirror

Best Format for Archive Distribution?

109 comments