Question gzip Maven Jean-loup Gailly
Jean-loup Gailly is the author of gzip and, now, CTO for Mandrakesoft, purveyors of Linux-Mandrake. Jean-loup's home page tells you quite a bit about him, including some interesting peeks into his life beyond Linux and open source software. Please try to keep it down to one question per post. Submitted questions will be chosen from the highest-moderated. Answers will appear within the next week.
Unisys has contacted several open source package maintainers about the used of LZW. They have claimed that the patent not only covers compression but also decompression. As a result, authors of software that may only do decompression of LZW get harrassed by Unisys for expensive licensing or loose the ability to do decompression of LZW information. One popular case of this is the Xpdf package being forced into doing LZW indirectly. How exactly does gzip handle decompression of ".Z" files in a way that avoids the wording of the patent?
I would say the warnings, these days, are probably more in the line of disclaimers than real warnings. (ie: Don't blame me if your machine turns into a mushroom.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I was reading through the docs that come with the latest bzip2 source recently. They still have all of the "WARNING: BETA SOFTWARE ALERT" messages all over the docs talking about how the software *should* work, but has not been thoroughly tested and may cause data corruption or loss.
Does anybody know how long bzip2 has been out? (Honestly curious) It's a damn good compression algorithm, even if it is slow. I've never had a single problem with data corruption or loss (with the one exception of a file that got trashed because the disk was bad, not the compression algorithm). I would think that if the bzip2 format is good enough to use to distribute the linux kernel to the world, it's probably good enough for every day files.
Any other people with comments/thoughts on the stability of bzip2? Why all those warning messages - just because it isn't v1.0 yet?
-- Truth goes out the door when rumor comes innuendo. -- Groucho Marx
When I was at the NYC Linux Expo last month, I spoke with one of Mandrake's head honchos. I don't remember his name, but he was a rather skinny guy who looked like he was in his 50's, with a mustache and a very thick french accent. Really nice guy, he kinda lamented that he didn't have as much time to code recently.
He told me that LinuxOne had done a lot of things such as printed up stickers and promotional items with the Mandrake character on them, and also other stuff with the tophat and magic wand on it to promote LinuxOne type things. For those of you who were at the Expo, you'd probably think it was pretty brazen since LinuxOne's booth was right behind Mandrakes! It was, and it was true, they were handing some of that stuff out on Friday (the last day of the show).
He said, (as I remember) "We would have given them permission to use them if they had just asked, but they never did". What he didn't say, (but that I got out of the conversation) was "Those guys have got a lotta fuckin' nerve to be doing that."
-- Truth goes out the door when rumor comes innuendo. -- Groucho Marx
I was curious as to how you percieved big money and the compention vs. cooperation debate in the open source community. In other words, how much should be cooperative between you and your fellow linux distrobuters and how much should be compentition?
;)
And i know, one per post, but I've always wondered. Should big money support developers directly by hiring them, or by donating to development groups? I sure wish a big company would pay me to work on opensource projects from home, I could do a lot more work if i didn't have to go to my job every day
That's easy: there's a fairly exhaustive patent space of arithmetic encoders. A quick search of www.patents.ibm.com on "arithmetic encoder" listed seven patents, including 4488143, which featured a convenient link to how to license the patent from IBM. The other 6 were owned by Canon (2), Mitsubishi (2), Lucent, and Sharp. IIRC, there were a couple others (I used to work in compression) owned by Sony, etc., that didn't turn up in this rudimentary (30-second) snapshot.
This is my opinion and my opinion only. Incidentally, IANAL.
MOO;IANAL.
There used to be a picture linked here.
the bwt (burrows wheeler transform---the algorithm that gave us bzip/bzip2) is now about 5-6 years old, and appears to be heading for the big time, especially for large files (i.e. >2M tarballs), where it's very strong. in the areas where it's weak (small files, compressing real-time streamed data, etc) it appears to be making few inroads.
what do you see it's future as being, in relation to gzip ?
do you see any wide-spread use of large-model compression schemes (for example PPMD) ?
If I recall correctly, AT&T holds a patent on that. The original bzip used arithmetic encoding, but then the patent came to the attention of the author. He pulled the program, redesigned the algorithm, and released it as bzip2.
Arithmetic encoding is no big deal, however. The author has stated (I forgot where) that it only gives you a 1-2% improvement in compression ratio. I think it also runs faster, but only by a similarly small margin.
iSKUNK!
When will a version of Mandrake for i386
be released?
Our user group has a number of 486 machines
we'd like to put Mandrake on. Mandrake has the
best 'out of box' desktop right now.
Help achieve Liberty in your lifetime - join the Free State Project - http://www.freestateproject.org
Algorithms differ in what they offer at what price:
Gzip is a good general-purpose compressor with the additional quality of being written in C and availalbe under the GPL.
So you have to choose according to your needs.
> I think the next big breakthrough will be a compressor that
> can take a file with not much repetition of data (therefore hard
> to compress using current algorithms) and create a file with
> much more repetition in it (and perhaps larger) and then compress
> that down.
That's precisely what the filters in the PNG graphics format do. By calculating differences between adjacent pixels, a new datastream is created that has more repetition in it, and that new datastream is handed to Jean-loup's DEFLATE engine. Because the PNG filters are reversible, the original datastream can be recovered after decompression.
The way you phrase your reply makes it sound liek you still don't even understand the most basic tenets of data compression. The idea is that if you have a body of data that holds, say, 16 separate bit-pair combinations, and they are repeated over and over, you only need 5 bits in the compressed file to represent the 16 which are produced for output. The idea of increasing the repetition in some reliable way was just an idea I had a few days ago, I'm not a charlatan or an imbecile, I've simply never read (and I've read a lot about data compression) anything that even thought about artificially creating repetition. In a simple run-length algorithm this could work. Just keep track of the position where you replaced a bit, and instead of having
(9)(B)(1)(Y)(9)(B) you'd have (19)(B)(10)Y, there, compression. 6 bytes down to 4. That's just a general idea and not the idea I had at all (mine involved bitwise operations on a couple of bytes, producing 3 bytes from which the original 2 could be reconstructed, and I think the 3 produced would have more repetition in a large random set than the 2 originals alone, I might be completely wrong about that though).
Esperandi
"Whatever your 'magical' method or algorithm would be, it is *mathematicallly impossible* to have an algorithm that works on arbitrary files of say N bits and compresses each of them to N-1 or less bits. "
I'm going to have to spell this out very carefully being as its the third time I have said it and no one seems to comprehend it. NO DUH! I have never claimed that any technique I may come up with or anyone else might come up with would be such a universal compressor. I didn't even say in my original response that the jumps in compression would be in universal compressors. You are right that no compression algorithm can be guaranteed to, say, reduce a file to exactly half of its original size. I've never thought, stated, or insinuated anything even remotely along those lines.
I'm surprised you claim that its mathematically impossible to create artificial repetition in order to improve compression considring several things. Number one, I gave an example that proves that it is possible, at least in a simple way. Number two, as per a couple replies to an earlier post of mine in this thread (whihc you must not have read), PNG and bzip2 do exactly this. So apparently they're using a different set of mathematics than you.
While it may be true that the average person sees that text files get real small with zipping and the average person takes about an hour to convince them that they can't recursively zip a file into 1 byte, I would think from my previous discussions mentioning various algorithms that it would become obvious that I'm quite well-versed in data compression (not in theoretical entropy sustaining junk, in real taking X number of bits and figuring a way to reconstruct them from a smaller set of bits).
The advances in data compression will come like MP3 did (already a disproof that there will be no great advances, audio compression before MP3 was either inefficient or extremely lossy), taking advantage of the peculiarities of the data being compressed. Video compression still has a lot of room for improvement, and that improvement will be developed by streaming media companies. 3D point data, texturing files, text, etc... there are thousands of kinds of files. In the future there'll be one compression program that uses hundreds or thousands of algorithms to deal with them. it might even use the same extension for its files to further confuse people...
Esperandi
The future is going to be fun. Everything is always getting better. Anyone who tells you the past was better is wrong.
Everyone knows MandrakeSoft was started as a French company, but I was wondering what kind of problems you encountered while trying to start a Linux based company in France.
Data compression IS impossible in the most general case, as paradoxical as that may seem. The best we can do are some heuristic algorithms that happen to reduce in size most of the inputs that such algorithms encounter in day to day usage. Lossless compression will and always must be limited by information theory; you cannot get more information out of a given output to a compression algorithm than can be represented by the set of all states of the output: a n-bit output to ANY reversible compression algorithm can have no more than 2^n preimages that when compressed produced such an output. This is not an empirical observation based on a specific compression algorithm now in use, but a theoretical one based on simple information entropy considerations.
A lossless reversible compression algorithm that signficantly (indeed at all) reduces in length more than a minute proportion of all possible inputs (please contact me if you want a clarification of this statement; I'm handwaving the precise definition of the phrase minute proportion) is impossible. It will simply not happen. The pigeonhole principle is the most basic argument against universal compressors; a given string of bits cannot represent more *distinct* other string of bits than the total number of *distinct* states of that string of bits.
Your "big breakthrough" is a standard claim made by a) well-meaning people who think they have a great new algorithm that revolutionizes data compression or b) con-artists who want to persuade people to part with their money for the next best thing. The key difference is that the first category is usually willing to learn.
For more information, please read the comp.compression FAQ.
The Wheeler-Burrow block sorting compression algorithm implemented in bzip2 does not "create a file with much more repetition in it." Rather, it takes advantage of the *heuristic* observation that on *most* of the files that people compress, adjacent bit strings will tend to clump more with certain bit strings than others. The block sorting algorithm itself produces an output with the exact same number of each byte as the input to the algorithm; the next phase in the algorithm uses a simple most-recently-used filter to create an output that will most likely have a substantially lower number of set bits than the input. A simple Huffman code is then used to compress this highly redundant output.
This notwithstanding, the simple fact remains that no algorithm can compress all possible inputs, regardless of what transformations are performed on it, out of simple uniqueness and distinguishibility considerations.
Keep up the good work.
Not everyone deserves a 320i
I've used Mandrake as my only distro ever since I got into using Linux. I really like the i586-enhanced compiled binaries, but I really miss the ability to use Netware servers with an easy client/server application.
Is Mandrake planning on including such apps the way Debian is? This would be another big plus for Mandrake, IMHO.
Keep Mandrake coming!
mad-cat
In the near future, do you see a breakthrough in the efficiency of data compression... a new algorithm or technique that could improve current compression rates by orders of magnitude?
As far as modern compression algorithms go, gzip is one of the faster and more efficient around. Since increases in memory and processor speed seem to be coming along so often these days, do you think the algorithm could be modified to wring out a little bit of extra compression, or would we be better served by using considerably more resource-intensive compression schemes such as Burrows-Wheeler[bzip2] or PPM?
Using tar makes things unnecessarily complicated. There is less support for tar around on non-UNIX platforms, and 'embedded compression/archiving' seems to cause great trouble for newbies who can just about handle WinZip and nothing more.
If gzip is to become a truly viable alternative to patented zip, I think the .tar.gz should become a thing of the past.
Remove the old legacy tape archive!
Would you think it wise to roll alternatives to the Lempel-Ziv algorithms into gzip to make other compression utilities less attractive?
It seems that this approach is adopted by other applications (ssh uses multiple encryption engines, and TIFF has allowed several compression techniques for quite a long time).
Would you support an effort to implement bzip2 within gzip? Do you think such a thing could be done while maintaining gzip's stability?
I notice you are a keen Go player... the GNOME version of Go (Iagno) seems much more attractive to me than the KDE version (kgo). I was wondering what software you use to play games, or are you not really interested in the interface at your level of play?
Regards,
Denny
# Using Linux in the UK? Check out Linux UK
Police State UK - news and
On your website, in the history section, you have a link to some information about pulsars...
Were you an astronomy student, and if so how did you go from studying pulsars to CTO of a major Linux distributor?!?
Regards,
Denny
# Using Linux in the UK? Check out Linux UK
Police State UK - news and
I have read about an agreement between Mandrake and LinuxOne to create a chinese Linux development center. Did any good come out of this? Couldn't Mandrake's otherwise excellent reputation be damaged by such relationships?
I strongly believe that trying to be clever is detrimental to your health. -- Linus Torvalds
Why do you write code like this:
z = (z = g - w) > (unsigned)l ? l : z;
It makes your code almost impossible to read. Do you even know what this line does anymore?
If you could compress anything and put it in your pocket what would you choose and why?
blog and junk
I see from your homepage you avoided patents when writing gzip, how do you feel about the current explosion of software related patents?
blog and junk
I guess that you have at least a little something to say about this.
Is the 586 optimization enough to justify Mandrake's position? Are you especially proud of any of the architectural differences between the distributions (from what I have been told, the Apache-PHP layout is quite a bit different).
How do feel about the steps that Red Hat has taken to change their distribution in reaction to yours?
The Data Compression Book was an excellent reference when it came out, but there are some hot topics in compression that it doesn't cover - frequency-domain lossy audio techniques (MP3), video techniques (MPEG2 and especially MPEG4), wavelets (Sorenson video uses these, I believe, and JPEG2000 will), and the Burrows-Wheeler transform from bzip.
Do you have any plans for a new edition of the book, or good Web references for these techniques? BZip is covered well by a Digital research note, but documentation for MPEG2 seems only to exist as source code and I can't find anything concrete about using wavelets for compression. The data is all there on the comp.compression FAQ, but the excellent exposition of the book is sorely lacking.
When I think of a game like go or chess, I think that each player develops there own algorithm to beat their opponent. If you agree, what relationships or similarities do you see between your intrest in Go and your intrest in compression?
Inquiring minds want to know.
When is gzip going to provide (transparent) support for bzip2 files and the Burrows-Wheeler algorithm?
Will BW be an algorithm option within the gzip file format itself ever?
However, much of the software you've written has started gathering a few grey hairs. Gzip, for example, has been at 1.2.4 for many, many moons, and looks about ready to collect it's gold watch.
Is compression software in a category that inherently plateus quickly, so that significant further work simply isn't possible? Or is there some other reason, such as Real Life(tm) intruding and preventing any substantial development?
(I noticed, for example, a patch for >4Gb files for gzip, which could have been rolled into the master sources to make a 1.2.5. This hasn't happened.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I noticed that you allowed the people who make the Winzip product to incorporate code written for Gzip. I think it's cool that you did that, because it would be horrible if winzip couldn't handle the gzip format, but at the same time, what are your thoughts about allowing free software code to be included in closed-source products?
Just out of curiosity, (tell me it's none of my business if you want to and I'll be OK with that) did you receive a licensure fee from the company that makes Winzip for the code?
-- Truth goes out the door when rumor comes innuendo. -- Groucho Marx
The field of compression has been thronged with patents for a long time - but patents at least reveal the algorithm.
What do you think of the expansion of trade-secret algorithms (MP3 quantisation tables, Sorensen, RealAudio and RealVideo, Microsoft Streaming Media) where the format of the data stream is not documented anywhere?
Tom
The compression world has many patents, notably for Lempel-Ziv compression as used in GIF. What is your view on companies patenting non-obvious algorithms for such processes as data compression?
11.0010010000111111011010101000100010000101101000
I am a happy owner of The Data Compression Book (2nd Ed). With the increasing availability of compression routines within libraries (Java's GZIP streams spring to mind), does this make your book a little unnecessary?
Should software authors continue to write their own compression routines, or simply trust the versions available to them in library form?
I can see some definite advantages to library code, i.e. the ability to upgrade routines, and having standardized algorithms which can be read by any program which utilizes the library.
Doug
Venn ist das nurnstuck git und Slotermeyer? Ya! Beigerhund das oder die Flipperwaldt gersput!
As we all know, at first Mandrake was little more than a repackaged version of RedHat. That's changed a bit with the newer versions. My question is this: to what degree will Mandrake continue to differ from RedHat and will there ever be a "developer" version (i.e. one that is centered towards those who are a bit more technically competant)?
Brad Johnson
--We are the Music Makers, and we
are the Dreamers of Dreams
Brad Johnson