How I Completed The $5000 Compression Challenge

From the FAQ. by Anonymous Coward · 2001-04-23 23:07 · Score: 1

When the FAQ explains why it doesn't work, you'd assume he wouldn't try it. Goldman is correct.

In short, the counting argument says that if a lossless compression program compresses some files, it must expand others, *regardless* of the compression method, because otherwise there are simply not enough bits to enumerate all possible output files. Despite the extreme simplicity of this theorem and its proof, some people still fail to grasp it and waste a lot of time trying to find a counter-example. This assumes of course that the information available to the decompressor is only the bit sequence of the compressed data. If external information such as a file name, a number of iterations, or a bit length is necessary to decompress the data, the bits necessary to provide the extra information must be included in the bit count of the compressed data. Otherwise, it would be sufficient to consider any input data as a number, use this as the file name, iteration count or bit length, and pretend that the compressed size is zero. For an example of storing information in the file name, see the program lmfjyh in the 1993 International Obfuscated C Code Contest, available on all comp.sources.misc archives (Volume 39, Issue 104).

Re:sorry! by Stephan+Schulz · 2001-04-23 22:25 · Score: 1

Right. I think a decent solution would be for Mike to match Patrick's $100 and donate the total to the EFF or some other related charity.

And Mike should probably both reword the challenge and, to avoid the appearance of conflicting interests, offer to donate future processing fees (after the callenger admits defeat) in any case.

--

Stephan

Re:Intriquing by sjames · 2001-04-23 22:38 · Score: 2

which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).

But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.

Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.

If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.

Re:sorry! by sjames · 2001-04-23 23:56 · Score: 2

It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.

The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.

It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!

Re:Intriquing by sjames · 2001-04-26 02:22 · Score: 2

I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.

Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.

What is "total file size"? by CaseyB · 2001-04-23 21:11 · Score: 2

The problem is that the definition of "total file size" is subject to interpretation. If the guy holding the contest had said "total file size as reported by ls -l", he'd be screwed. But he could argue that "total file size" means the number of bytes of disk used to store the file.

Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.

Re:What is "total file size"? by malfunct · 2001-04-24 07:39 · Score: 1

If the person holding the challenged had defined it as the "total amount the file took on my filesystem including filesystem overhead and such" all the person taking the challenge would have to do to win is put the file on a more efficient file system. The way to stop all the trickery in winning seems to be to specify the file name in advance and specifying only one final file and specifying a sandbox to work in that had a defined (yet Turing complete) set of operators to work in the data.
Even given these restrictions the contest has the possibility to be won because the chances of the person holding the challenge being able to create a file of a size requested by the challenger that cannot be compressed by legitimate ways is very low. This is due to the fact that given any piece of data generated at random there is a decent chance of finding an algorithm for representing that data specific piece of data in a smaller form.
Of course your algorithm would only work on creating that one piece of data.
An impossible task would be for the holder of the challenge to have a file of fixed size that the challenger must compress (given the tigher restrictions mentioned in this post) because then the person holding the challenge has much more control.

--
"You can now flame me, I am full of love,"

Re:"to be disbursed at my sole discretion" by CaseyB · 2001-04-24 00:52 · Score: 2

He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.

However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)

No, and it's provably false by hawk · 2001-04-23 23:47 · Score: 2

There are 2^n ways to arrange bits. The number of ways to arrange all possible sequences of bits ranging from since 1 to (n-1) is 2^(n-1).

There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.

OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.

hawk

Re:Compression by AxelBoldt · 2001-04-24 09:13 · Score: 2

As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.

Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.

The bet can be won; not in every instance of course, but in the long run.

--

Re:Compression by AxelBoldt · 2001-04-24 23:27 · Score: 2

Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.

In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.

--

Re:Compression by AxelBoldt · 2001-04-25 01:36 · Score: 2

No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.

If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.

--

Re:Compression by AxelBoldt · 2001-04-26 01:30 · Score: 2

As understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.

Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.

If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.

--

Re:Compression by AxelBoldt · 2001-04-26 02:31 · Score: 2

the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n)

It's a*2^(-n), sorry.

--

I'd comment... by Enahs · 2001-04-23 23:43 · Score: 2

...but Yahoo! seems to have taken the site down. Thank you, Yahoo!

--
Stating on Slashdot that I like cheese since 1997.

An ingenious solution... by jd · 2001-04-23 20:16 · Score: 1

Technically, -both- guys are correct, because English is a horribly ambiguous language.

The challanger is correct in that the problem of compressing an arbritary amount of truly random data is insoluble, and I'm not going to question that this is the problem as the challanger believed he had set.

The challangee is correct in that the challange did not specify any particular form of compression. Compression is simply the removal of redundancy. If, by splitting the file and removing a number, redundancy is removed, he HAS compressed the file.

On the flip-side, the challenger looses karma and brownie points for trying to fiddle the figures. You can't argue on the basis of inodes, as the challange doesn't refer to a specific OS or filing system. (MSDOS, for example, uses fixed-sized FAT tables, rather than inodes.)

Also on the flip-side, the challangee almost certainly knew that the problem =as written= was NOT the problem as the challanger intended. True, it wasn't his job to second-guess, but if you enter a lion's den, you can still expect to get bitten.

My personal opinion -- BOTH sides should withdraw their various claims, and the sum total of all monies involved ($5,100) should be donated to an appropriate charity.

Overall, this should be a lesson to ALL -- Don't Assume. It can burn, and burn bad.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:An ingenious solution... by jd · 2001-04-23 22:35 · Score: 2

No. There is a difference between random DATA and a random number GENERATOR. To be random, data must follow some mathematical rules, one of which is that the probability distribution is equal at all points. Another is that the next point's probability is NOT affected by the prior point's.
You are correct, however, in saying that -any- sequence can be generated by a random number generator. However, there is no general solution to the compression of a sequence S into a smaller sequence C, where a unique 1:1 mapping exists.
HAVING SAID THAT, if you compressed S into C, and could manually determine which of the reverse mappings was S, you -would- be able to "compress" generic random data of any size and of any degree of randomness.
(This is because the user posesses some additional information, I, which is therefore redundant and can be ignored, for the purpose of the compression.)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An ingenious solution... by jd · 2001-04-23 22:47 · Score: 2

The 218 files would have taken no additional space on the FAT, as it's fixed size.
Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".
If, however, we store all the files in a single indexed file, there's no dead space to account for.
The filenames could be considered another issue. However, again, DOS' root directory is fixed-size and therefore file entries within it take no additional space.
Lastly, it doesn't matter if the guy "admits" to exploiting vague wording. Within the wording of the rules as given, he won. Within the intent of the rules, as given, he lost.
Ergo, BOTH people have a claim to "victory". They can either split the purse, or they can shake hands and donate it to some worthy cause.
In the end, this comes down to something I've been asked all too often myself. If you had to choose, would you rather be happy or right?
IMHO, the guys should cut out the righteous act, and stick with being happy. The contestant that he "defeated" the challange, and the guy who set the contest up, that he achieved his goal of reducing the snake oil surplus.
Let's face it, those are worthy achievements in themselves. Be satisfied! The money is just causing grief. Neither probably needs it. So why not clear the air, and give the cash to someone who does?!

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An ingenious solution... by jd · 2001-04-25 01:39 · Score: 2

Yawn. I know perfectly well that the FAT uses clusters of sectors. So? You can configure the disk geometry in the boot sector.
Hex disk editors are fun tools. You might want to try them, some day, when you're finished trolling and want to get some work done.
The dead space (NOBODY calls it "slack space"! Slack is what people do on Fridays) is a function of sector size, and you can define whatever sector size you like. (The defaults are the "sensible" geometry, but by no means the only one.)
I suggest you also get Peter Norton's books on MSDOS, and print out a copy of the Interrupt listings, which you can find on the Internet. The Norton Guides are extremely handy, too. Oh, and you'll want a copy of Flight Simulator 2.0, which came on a 5.25" floppy. The "copy protection" scheme is to use 1 Kb sectors, which not only stopped DOS reading them directly, it also gave the disk about 256 extra bytes per track.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An ingenious solution... by Ed+Avis · 2001-04-23 20:53 · Score: 2

In this sort of contest, it's traditional that finding ingenious ways to exploit loopholes in the rules is a perfectly valid way to win. Then the rules are tightened up if ever the contest is re-run.

Mike Goldman didn't make it clear what 'total size' meant - I think most programmers if you asked them would assume the sum of file sizes, if you didn't specify otherwise. His fault, I feel. He should pay the $5000 and write the rules more carefully next time.

BTW, if you do assume that 'total size' is just the sum of file sizes, you can compress anything down to zero bytes. Number each possible sequence of binary digits - the empty string is zero, '0' is one, '1' is two, '00' is three, '01' is four, '10' is five, '11' is six, and so on. Then just create the required number of zero-length files.

--
-- Ed Avis ed@membled.com
Re:An ingenious solution... by ethereal · 2001-04-23 22:01 · Score: 1

It's not cheating if you follow the rules of the contest as explained to you, and create a winning entry as defined by those rules. The creator of the challenge is at fault for not creating a challenge that would rigorously prove the point he was trying to make. Shame on Mike, if anything - he should know better how to construct a correct challenge, since as he said he has a FAQ about exactly this sort of thing.
I agree that in the real world, no actual compression occurred, and the compression gurus seem to be correct that a general solution to the compression challenge is impossible. Under the logic of a challenge which allowed tailoring of the compressor to the input and multiple files for the output, however, compression did occur. Unfortunately for the challenge provider, he allowed the problem space to be sufficiently dissimilar to reality for the problem to be solvable.

--
Your right to not believe: Americans United for Separation of Church and
Re:An ingenious solution... by Surak · 2001-04-23 21:27 · Score: 2

No way...the guy didn't compress shit. All he did was some data shifting. If the guy had read the FAQ, he would know that such tricks are not compression...you mention an MSDOS FAT filesystem. Even on a FAT filesystem, the 218 (or whatever) files and the decompressor would have taken up more space than the original file due to slack space...although the amount would vary from hard drive to hard drive, if you figure an average of 512 bytes of slack space per file thats 111,616 bytes of slack space, far more space than Patrick claims to have "saved."

And anyways, the guy basically admits that he cheated by exploiting the vague wording of the challenge.

I've got to hand it to Patrick, though, because if you do take the challenge out of context, he basically won...it depends on what the meaning of the words "compressor" and "decompressor" mean.

--
My journal has hot /. gossip.
Re:An ingenious solution... by Surak · 2001-04-24 04:12 · Score: 2

The 218 files would have taken no additional space on the FAT, as it's fixed size.

Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".

I never said that it would take additional FAT space, I'm talking slack space...if you knew anything about the FAT filesystem (which you obviously don't), you would know that its not sectors you have to worry about, it's clusters (GROUPS of sectors). there are only so many clusters because of FAT's fixed size... I was being QUITE generous in saying that 512 bytes is a practical average. On a 1.2 GB MS-DOS formatted drive (not FAT32, mind you, we're talking DOS here, not Windows), the cluster size ends up being something like 16K, leaving the average slack space to be 8K. so I was being VERY generous there.

--
My journal has hot /. gossip.
Re:An ingenious solution... by ChadN · 2001-04-23 20:57 · Score: 2

In particular, the filename encodes information about the order in which the files need to be reassembled; so for each byte saved by the "compressor", several bytes are needed (on average) to encode the ordering as an ascii integer in the filename.
The challenger should have made it clear that these costs would be counted against the reconstructed filesize (or simply not allowed any deviation from the "one file" rule) The decompressor should operate by having all the files piped into stdin, thus eliminating filesystem state info entirely.

--
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
Re:An ingenious solution... by MikeBabcock · 2001-04-24 23:13 · Score: 2

The quote everyone seems to avoid is:
I am curious, though, and I am not trying to get any sort of angle on you here, what brilliant idea have you come up with that makes you think you can compress arbitrary data? Whether or whatever you disclose won't affect this challenge and I will live up t my end of the bargain regardless.
That sums it up for me.

--
- Michael T. Babcock (Yes, I blog)
Re:An ingenious solution... by supersnail · 2001-04-23 23:11 · Score: 1

If he was serious he should have worded the competition this way:-
"I will send you a 1.44 MB file on a standard DOS formatted HDD diskette".
"To earn 5,000 you must return the compressed file AND the decompressor on a similar standard 1.44MB standard formatted HDD diskette".
Tricky but still possable.

--
Old COBOL programmers never die. They just code in C.
Re:An ingenious solution... by steveheath · 2001-04-23 22:21 · Score: 1

The FAT file system is interesting. On a FAT file system this actually did perform compression. The FAT (table) is fixed size and takes up the same disk space however many entries are in it (up to it's limit). Thus the hiding mechanism takes advantage of an area of the file system that isn't normally used.
Still, I don't think he ever expected the $5000, and expected to loose the $100 too. Fair enough I guess.. some ppl are willing to pay a price to prove a point (look at OSS :)
Re:An ingenious solution... by susano_otter · 2001-04-23 23:26 · Score: 2

Ingenious, but not ingeneous enough. The FAT table is a limited resource, and his solution uses up more of that resource than the original file. It's still not a valid compression - not to mention the fact that it's totally unportable anyway. Other OS's won't even concede this marginal point. . .

--
Any sufficiently well-organized community is indistinguishable from Government.
Re:An ingenious solution... by arfy · 2001-04-23 23:53 · Score: 1

The only problem with getting them to donate the money is that only Patrick has ponied up any cash. Looks like Mike Goldman is more concerned with money than honor; Patrick met the challenge as written and Mike weaseled out by saying that it wasn't what he meant.
Re:An ingenious solution... by EllisDees · 2001-04-23 22:05 · Score: 1

When one makes this sort of challenge, he should be sure that the wording is unambiguous. Mike really should have seen it coming when he was asked about splitting up the file. To threaten legal action was just asinine.

--
-- Give me ambiguity or give me something else!

Re:Are you for lawyers or against? by iabervon · 2001-04-24 00:03 · Score: 1

If he had not specifically accepted the possibility of more than two files, it would be a case of someone exploiting a loophole in his challenge. But the challenger actually asked about the loophole, and was told that it was okay. If you're making bets like this, you ought to be sufficiently attentive to not permit loopholes if the person asks about them before using them.

In this case, the challengee specifically asked if the challenger would permit multiple files, and the challenger said yes. It's hard to argue that, simply because he *intended* to say either no, or "yes, but you have to include metadata size on all files after the first" or something else that would actually require compression, that he should be let off the hook. He was asked a straightforward question and gave a straight and wrong answer, and should lose because of that.

Easy way to show he lost by dmahurin · 2001-04-23 22:00 · Score: 1

Create a partition of 3,145,728 bytes on hardisk.
Put the data file there.
Now try to put the solution in that same partition some how ( as file system,tar, or a big exe).
It will not fit.

Re:Easy way to show he lost by Chandon+Seldon · 2001-04-24 04:03 · Score: 1

That's a different problem with a different solution.

--
-- The act of censorship is always worse than whatever is being censored. Always.

Omega by Zooko · 2001-04-24 01:45 · Score: 2

He should have used Gregory Chaitin's Omega number to generate the challenge file.

Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.

Zooko

no genius required by slew · 2001-04-24 04:08 · Score: 2

As people have already mentioned, the numbers are probably not pseudo-random as they were
retrieved from random.org...

However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
you don't need to be a good mathmetician, you just need to know how to read a paper written
by a good mathmetician... Look for the paper...

Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
IT-15(1): 122-127.

Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
for decoding BCH codes actually provides a general solution to the problem of synthesizing the
shortest linear feedback shift register capable of generating the prescribed finite sequence of
digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
problem is demonstrated...

This is probably in most textbooks on linear codes for encryption and/or error detection and
allows you to recreate the shortest LFSR (linear feedback shift register, a common component
of pseudo random number generators), given a sequence of digits. Of course nobody would
use a non-cryptographically secure PRNG like random() would they?

So you take the number, pass it through this algorithm, find the shortest LFSR, and the
decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.

Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
knows it might just accidentally compress the sequence the challenger gives you (not likely,
but who knows?)...

Always remember... "stand on the backs of giants" whenever possible ;^)

Re:Compression by Genom · 2001-04-24 00:54 · Score: 2

ahh...but would THAT 1000 bit number be random, or compressible on it's own?

Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.

A variation which might work by stevelinton · 2001-04-23 20:33 · Score: 2

As several people have observed, the attacker in this case used his "split into multi-files" trick to hide the necessary information in the directory structure of the disk. This is just barely ruled out by the "one compressed file" rule. However, suppose you hid the data in the filename of one file. For example write a program like

echo $1 | cat - $1

(about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.

This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.

Steve

Re:Nice try but wrong by jamiemccarthy · 2001-04-23 21:28 · Score: 1

"any number can be defined as n=a^2 + b"

That doesn't get you any closer to winning the contest or overturning the laws of information theory.

Because squaring it gets you close to n, your variable a will have approximately half the bits of n. But there's no guarantee that b will be smaller than (the other) half of n's size. In fact, with large n, the probability is overwhelming that b will be larger than half n's size, so together a and b will occupy more bits than n.

I just tried this for fun with a 15-digit number I pecked in at random (482837578298375), and got an 8-digit a (21973565) and a 9-digit b (19489150).

Because the gap between primes increases according to a simple formula, there's probably a simple proof of the average size of b, given n. But I don't feel like calculating it. :)

Jamie McCarthy

--

Jamie McCarthy
jamie.mccarthy.vg

Re:Nice try but wrong by jamiemccarthy · 2001-04-23 21:48 · Score: 1

I wrote:

"Because the gap between primes increases according to a simple formula"

Typo, for "primes" read "squares." Sigh.

Jamie McCarthy

--

Jamie McCarthy
jamie.mccarthy.vg

Re:Mike's right by markb · 2001-04-24 04:26 · Score: 1

"Sure" is not straight forward enough???

Re:There is no solution... by ocie · 2001-04-24 00:24 · Score: 1

How about this for a compression algorithm:

1) if the input is original.dat, output 1 byte 0xff

2) otherwise, output 1 byte 0x00, followed by the input

This algorithm doesn't work that well in general, but for compressing the given data file, it should be great.

--
JET Program: see Japan, meet intere

Re:There is no solution... by ocie · 2001-04-25 01:13 · Score: 1

As soon as I was able to find the mirror of the original site, I figured this out. Seems like there are all sorts of tricks like having a script say:

gunzip $1

gives you all the power of the gunzip executable (72kb on my system) for only 10 bytes. Also, is opening a network connection back to a web server that has the original data forbidden:)

--
JET Program: see Japan, meet intere

Mike defined total file size himself by A+nonymous+Coward · 2001-04-23 21:37 · Score: 2

Patrick specified a file size, and Mike generated a file of that specified size. Mike did not include any OS filesystem overhead in his file size. It seems rather hypocritial of Mike to claim OS file system overhead counts for Patrick's files but not Mike's. Mike defined FILE SIZE, not DISK SPACE USED.

--

--
Infuriate left and right

Re:Mike defined total file size himself by MO! · 2001-04-24 00:48 · Score: 1

While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.
Because the decompressor relied on the file numbering to work does not make it an "integral part" any more than the OS needed to run the decompression program on is "integral".
As well, things such as sector size, blocking factor, filesystem paging size, etc can cause, for example, a 1K file to take anywhere from 1K - (many)MB to store. For this reason, the filesystem utilization should not be considered in the validity of the challenge. The source file generated was a set byte size which did not factor in filesystem and storage overhead. Therefor, the "solution" presented did meet the stated requirements.

--
I AM, therefore I THINK!
Re:Mike defined total file size himself by ChadN · 2001-04-23 22:49 · Score: 3

However, the decompressor DOES depend on the filename numbering in order to work; it is part of the algorithm (and hence the "decompressor"), and those bytes (the bytes used to number the files) should be charged against the decompressor size. Then Patrick loses.

While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.

--
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
Re:Mike defined total file size himself by spoon42 · 2001-04-24 02:38 · Score: 1

While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.

IMO, too. Patrick said he wouldn't use "cheats" like saving information in command-line options, user input, or filenames:
It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.

However, the number identifying each file and its place in relation to the others is *information*, stored in the filename, and is cheating by his own admission. He only gained one byte for each file by splitting it up as he did, and it would require at least one byte each to encode each file's position in the sequence. So he loses, as I see it. He fought the law (information theory) and the law won. ;-)

--
--- this comment is presented in WIDE SCREEN STEREO!!!
Re:Mike defined total file size himself by RedAlert99 · 2001-04-24 18:11 · Score: 1

Absolutely Right! I can't believe none of the other posts I read (many) mentioned this. I'm sure the file that the Challenge-maker created was exactly 3Mb (or whatever it was supposed to be), not 3Mb less the overhead for the file-storage system. How am I sure of this? Because if he were going to send a file that had less than 3 Mb of data + the filename and whatnot, it *would* depend what OS he was using, and the challengee specifically asked if it had to run on multiple platforms or not. The defense Mike made was that that information (like filenames) is part of the data, and so in total, it's not smaller after the "compression" routine. However, that means, by his own definition, he didn't send the right "size" file to Patrick. Patrick may have violated the spirit of the challenge, but Mike VALIDATED his violation by using the same *wrong* definition of "file size."

--
Cats know what you're thinking. They don't care, but they know.
Re:Mike defined total file size himself by bigwig10001 · 2001-04-24 00:35 · Score: 1

I agree with ChadN.

The filename and EOF flags are vital to the algorithm. While these pieces of information won't show up as part of the "file size", they will take up disk space.

Reducing the memory footprint is the whole point of the comp.compression group.
Re:Mike defined total file size himself by dinivin · 2001-04-23 23:25 · Score: 1

and so including their size is consistent with the agreed upon rules, IMO.

Why? Mike didn't say that the total files had to take up less room on disk, just that their total file size had to be less. Total the file sizes and what do you find? It's less than the file size of the original file. Patric fulfilled the requirement. Period.

Dinivin
Re:Mike defined total file size himself by dinivin · 2001-04-24 02:19 · Score: 1

But reducing memory footprint obviously wasn't the whole point of the challenge. The point of the challenge was to make a fool of whoever accepted it. In the end, it's Mike who looks like the fool.

Dinivin

Re:Intriquing by Firehawk · 2001-04-24 08:00 · Score: 1

If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.

I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.

Re:Intriquing by llywrch · 2001-04-24 05:15 · Score: 2

> What I'm thinking of, is wouldn't it be possible to create a minimalist compression algorithm that tended to not affect the file size,
> but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).

All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."

Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:

s/a/311/g
s/b/47/g
s/c/9724/g
...
etc.

That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.

Hmm. I seem to have reinvented pkzip.

Geoff

--
I think I see a trend here. Maybe for them it really would be easier to muzzle the entire internet than to produce p

Re:Another tack... by scrytch · 2001-04-23 23:54 · Score: 2

> My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way?

Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
--

--
I've finally had it: until slashdot gets article moderation, I am not coming back.

*sigh* Here's why. by Jeremy+Lee · 2001-04-23 23:32 · Score: 1

Since a couple of people have asked the same thing, I'll have a go. My mathematics is reasonable, but I'm not a pro. IANAM.

PI isn't actually random, and will skew this analysis in ways that doubtlessly earned someone a PhD, but let's ignore that for the moment. Let's pretend PI is an infinite-length random bitstream. It doesn't matter much.

Being infinite in length, it will therefore (eventually) contain any finite length random bitstring that you care to look for.

The chance of finding a match at any location is one in 2 to the power of the length of the bitstring. So, the chance of finding a 100 bit match at any index is 2^100.

You will therefore find a match, ON AVERAGE, at a distance of about 2^99 bits into PI.

Hurrah! you say. We only need a 99 bit index to represent a 100 bit number. We've made a saving!

First, let's just diverge slightly and apply a sanity check for why this can't possibly be right. If that logic holds, then we can compress any N bit number to N-1 by looking for it in PI. Therefore, we can compress the 99 bit index to a 98 bit index, then to 97... down to 1 bit. Something's wrong.

The issue is that you get a 99 bit index ON AVERAGE. Sometimes it will be 101 bits. Sometimes it will be 98. There's even a small chance (exactly 2^100) it will be right at the beginning. So, you can't just have a fixed 99 bit block to store the index, you need another number to store the length of the index number.

Since the number "99" requires 7 bits of storage, your average case will be to represent a 100 bit string as a 106 bit index+length code.

So, actually, you end up expanding the data, if the law of averages holds.

There's no way around this, no matter how clever you get.

If you say "we'll use a 99 bit block only if it fits, otherwise we'll just store the original 100 bit number" then you need at least a 1 bit switch to indicate this.. so on average, half will be 99+1 bit blocks, and the other half will be 100+1 bits. 100.5 bits per original 100 bit block.

No, Huffman coding the index length doesn't help either. But if you asked that question, you probably already knew the answer.

Of course, you might get lucky on individual cases. That's always a possibility when dealing with truly random data. But then, that's not compression, that's gambling.

If you really want to learn about this stuff, read Claude E. Shannon's work on Entropy and coding theory. (also Turing, Huffman, and Godel) Doing so will also give you the grounding necessary for thermodynamics and quantum computing.

--
Jeremy Lee | Orinoco

Re:*sigh* Here's why. by Jeremy+Lee · 2001-04-24 03:02 · Score: 1

> And hence *on average* you win the competition.

Only if you can devise a 1 bit decompression algorithm. Good luck.

--
Jeremy Lee | Orinoco
Re:*sigh* Here's why. by Jeremy+Lee · 2001-04-24 03:03 · Score: 1

No.

--
Jeremy Lee | Orinoco
Re:*sigh* Here's why. by Jeremy+Lee · 2001-04-24 03:18 · Score: 1

No.

I'll restate. The chance of matching a 100 bit string is so remote - you need to search through so much of PI - that to simply record WHERE in PI you found the match would require about 99 bits just to store a number that big.

Incidentally, 2^99 is about 10^97, which is up in the number-of-subatomic-particles-in-the-universe range. If you actually try to compute 2^99 bits of PI, you'll be here long past when the sun burns out. You have to do that both during the compression and the decompression steps BTW. But I digress...

Any number you store requires at least a little metadata about it - it's length, usually - which will require more than 1 bit. Thus, there is no saving after all.

Bottom line. You cannot compress truly random data. That's the definition. If you can compress it, it wasn't random.

--
Jeremy Lee | Orinoco
Re:*sigh* Here's why. by clare-ents · 2001-04-23 23:50 · Score: 2

"
You will therefore find a match, ON AVERAGE, at a distance of about 2^99 bits into PI
"

And hence *on average* you win the competition.

In n entries you will win n/2 times and lose n/2 times - admittedly you will tend to expand the data - the amount the data expands on average in the losing cases is greater than the average saving in the winning cases. However, that doesn't matter here. Providing we can shrink the data in better than 1 in every 50 attempts we will end up making a profit.

If the guy wanted to have a winning money source he should have made the prize $199.

--
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Re:*sigh* Here's why. by Drakantus · 2001-04-23 23:57 · Score: 1

Interesting stuff. But what if instead of pi, it ALSO used primes. You could encode various strings in sets of three numbers. First number indicates the prime, 0 would be used for pi. Second is the starting bit, third the ending bit. Such as 99,21,99 would mean bits 21-99 of the 99th prime number. There would be cases of course where you couldn't save any space, but wouldn't it increase the "average" by having more places to look?

--
I love going down to the elementary school, watching all the kids jump and shout, but they dont know I'm using blanks.
Re:*sigh* Here's why. by Drakantus · 2001-04-24 00:07 · Score: 1

If I understand correctly, there is some reasonable chance that a 99bit number can be found early enough in pi to really "save" space. Couldn't the algorithm simply look for different strings if the first one doesn't actually save space? Okay, the 99 bit sequence may happen to be at a bit requireing 99 bits to record, so it doesn't save space- instead the program searches for a 98bit sequences, and so on untill it finds one that results in an actuall savings. Or could even search for OTHER 99 bit sequences within the file, instead of the first which didn't save space. By the law of averages isn't it pretty close to 100% likely that one possible sequence within the random data is somewhere in pi which can be described in less bits?

--
I love going down to the elementary school, watching all the kids jump and shout, but they dont know I'm using blanks.
Re:*sigh* Here's why. by Drakantus · 2001-04-24 04:28 · Score: 1

Your bottom line sucks. You can most certainly compress some random data, because the set of all random possibilities include every number ever written, a few of which have already been compressed. The definition of random data is not "data that can't be compressed" Besides, you are the one who decided to make an example with a 99bit string. Just for the record, I could easily be wrong about how often random strings can be found in other numbers, but not for the reason you gave.

--
I love going down to the elementary school, watching all the kids jump and shout, but they dont know I'm using blanks.
Re:*sigh* Here's why. by matrix29 · 2001-04-24 08:53 · Score: 1

The problem is in the average.

Higher levels of randomness above 50% (as ordered files are much below the 10% random mark) along the area of 90% random should be easily compressable. The trick is in find something SIMILAR to the original random generation function. Any pseudo-random intersection SQRT(prime number) and Pi should be close enough to get a space-saving gain on location bit-strings(otherwise we couldn't have irrational numbers - think about it). The big problem is TIME. These functions waste huge amounts of time & computation for something with no practical worth (except the $5000).

The implications of the counting theory being absolute extend to physics, quantum theory, pointlessly delving into Pi, and the structure of the universe. Try not to overlook the big picture for small-scale pigeon holes.

--
"Face it, a nation that maintains a 72% approval rating on George W. Bush is a nation with a very loose grip on reality.
Re:*sigh* Here's why. by Guy+Smith · 2001-04-24 04:08 · Score: 1

Well, no. 5,000 is actually a bargain, given that we're measuring file lengths in BYTES instead of bits. In order to technically compress the data, you have to reduce the file length by EIGHT bits, not one. With truly random data, you'll only do that 1/256 of the time. Hence, the break-even prize offer is 25,600 dollars. So 5,000 is really lucrative. And then add in the fact that the nature of the contest requires ADDING data to the original file; the decompressor size is also counted. So in order to profit, you have to be able to reduce the file size by 8 bits + the bitlength of the decompressor. The odds of winning are extremely remote.

Re:PAY UP MIKE by edhall · 2001-04-24 05:40 · Score: 2

If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.

The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.

-Ed

Re:The page has been removed by GeoCities by Geek+In+Training · 2001-04-24 04:53 · Score: 1

My understanding is that GeoCities IMMEDIATELY shuts down pages that induce X hits per hour, because they are likely porn/warez/MP3 sites.

They THEN do investigation, and re-enable TOS-compliant sites.

Kind of sad that they have to do this...

--
SlashSigTheorem: Humorous, Political, Critical, Constructive- If you have a .sig, someone WILL complai

Re:Compression by ethereal · 2001-04-23 21:50 · Score: 1

Although you submitted it as plain text, the browser might still interpret it as a tag. /. doesn't seem to insert [pre] tags around plain old text postings, I'm not sure why.

--

Your right to not believe: Americans United for Separation of Church and

Re:According to the rules... by ethereal · 2001-04-23 22:13 · Score: 1

Which article were you reading? He had 218 compressed files! (Admittedly without real-world compression, but within the rules of the contest there was compression.)

--

Your right to not believe: Americans United for Separation of Church and

Re:Petty by ethereal · 2001-04-23 22:17 · Score: 1

I think the moral is that if you think you're so smart, and you want to set up a public challenge to prove how smart you are and/or how inviolable a certain concept is, you'd better be smart enough to word the challenge so as to prevent fairly obvious technicalities/loopholes in your challenge. This wasn't a brand-new loophole, Mike even knew about it from the FAQ to begin with and he still couldn't phrase a challenge that avoided this problem.

If you're going to put your money where your mouth is, make sure you know what you're talking about.

--

Your right to not believe: Americans United for Separation of Church and

Re:Some things can be taken for granted. by ethereal · 2001-04-23 22:42 · Score: 1

The stated metric of compression was "file size", as reported by 'wc -b', not "disk space". This metric was used by both challenger and challengee. The challenge wasn't intended to cause people to write better code, because the provider of the challenge knew that the problem was unsolvable. The only purpose of the challenge was to provide a bit of sport for the elite of the compression usenet world. If the whole thing was planned around "let's laugh at people who think they've found the Philosopher's Stone", then I don't see how turning the tables on the so-called "experts" is really contrary to the spirit of the contest. Either way, somebody thinks they've got a sure thing and that there's no way the other side can win.

I understand the argument that you're making that certain things should be able to be taken for granted in such challenges, but I don't think it's a worthwhile challenge if you don't specify everything up front and then as people solve the problem you say to them "no, you can't do it that way, everybody knows that's just cheating". The provider of the challenge supposedly had much more of a background in the topic of compression and should have been able to easily word a challenge without loopholes (just a reference to avoiding techniques which had been proved to fail from the FAQ would have been sufficient, for example).

Since Mike was willing to put up money based on a challenge which didn't really prove what he thought it did, he should hand over the money to Patrick for providing a worthwhile education in specifying your problem clearly.

I suspect Patrick knew about this hole all along, but chose to go through the challenge to find out if Mike would really admit the error in the specification of the challenge, or if he would try to back out. I only wish I'd thought of it first!

--

Your right to not believe: Americans United for Separation of Church and

Re:I have an idea... by Sloppy · 2001-04-23 23:46 · Score: 1

Yes, but the offset number probably has more digits than the length of the file you're trying to compress.

---

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Good deal for contestant by Sloppy · 2001-04-23 21:13 · Score: 2

The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.

---

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Re:Good deal for contestant by dasmegabyte · 2001-04-23 22:02 · Score: 2

Correct. He no doubt weeds his files for the normal patterns of random similarity, because even ONE of these would make the contest a farce (you could replace the bytes very easily). But there must be other alternatives to simple dictionary replacement, alternatives like dictionary on a bit shift (you're guaranteed to find some stuff that way, though the compression will be slow as hell), fractal decompression (match a sequence of the random to some buildable equation and save that equation) or possibly compression of a matrix of the file itself. Of course, for each of these methods I mention, Mike has probably generated a file that's taken these into account, and that's the point of this: he wants to see the magical next step in compression, which doesn't rely on these simple notions that things are repeated or can be rerepresented. There's some elusive technology he wants us to uncover, and hence the $5000 (and why he wouldn't pay up on that sham of a compression scheme Pat generated...converting "5" to an EOF does not constitute compressing.)

--
Hey freaks: now you're ju

*Read* the article before you post. by arcade · 2001-04-23 22:17 · Score: 2

Read the damn article before you post - he didn't use 'gunzip' in his proposition.

--

--
"Rune Kristian Viken" - http://www.nwo.no - arca

Re:*Read* the article before you post. by weis3w3 · 2001-04-24 01:46 · Score: 1

Can't read it - yahoo geocities says it's unavailable.

--
-- We all get heavier as we get older because there's a lot more information in our heads.

Re:The page has been removed by GeoCities by Skapare · 2001-04-24 01:11 · Score: 2

Their 486 got overloaded

--
now we need to go OSS in diesel cars

Re:Almost. by th0m · 2001-04-23 20:14 · Score: 1

in this case it wasn't entirely random, since, although the file was sourced from random.org, goldman got the chance to review it before delivery to craig, which (presumably) gave him the chance to check that he didn't fortuitously generate three meg of pi or the decss source - or any of the number of less obvious cases which may have inadvertently yielded the very slight compressability required to win the challenge.

--