How I Completed The $5000 Compression Challenge
Patrick Craig points to this page which details (in email) the strange life of an online challenge as posed, clarified and (at least to some eyes) met successfully. This should give pause to anyone willing to pose -- or accept -- seemingly impossible problems. In short, Patrick's not $5,000 richer, but you may think he deserves to be.
which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).
But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.
Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.
If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.
It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.
The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.
It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!
I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.
Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.
Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.
However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)
There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.
OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.
hawk
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.
The bet can be won; not in every instance of course, but in the long run.
--
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
--
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.
--
Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.
If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.
--
It's a*2^(-n), sorry.
--
...but Yahoo! seems to have taken the site down. Thank you, Yahoo!
Stating on Slashdot that I like cheese since 1997.
You are correct, however, in saying that -any- sequence can be generated by a random number generator. However, there is no general solution to the compression of a sequence S into a smaller sequence C, where a unique 1:1 mapping exists.
HAVING SAID THAT, if you compressed S into C, and could manually determine which of the reverse mappings was S, you -would- be able to "compress" generic random data of any size and of any degree of randomness.
(This is because the user posesses some additional information, I, which is therefore redundant and can be ignored, for the purpose of the compression.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".
If, however, we store all the files in a single indexed file, there's no dead space to account for.
The filenames could be considered another issue. However, again, DOS' root directory is fixed-size and therefore file entries within it take no additional space.
Lastly, it doesn't matter if the guy "admits" to exploiting vague wording. Within the wording of the rules as given, he won. Within the intent of the rules, as given, he lost.
Ergo, BOTH people have a claim to "victory". They can either split the purse, or they can shake hands and donate it to some worthy cause.
In the end, this comes down to something I've been asked all too often myself. If you had to choose, would you rather be happy or right?
IMHO, the guys should cut out the righteous act, and stick with being happy. The contestant that he "defeated" the challange, and the guy who set the contest up, that he achieved his goal of reducing the snake oil surplus.
Let's face it, those are worthy achievements in themselves. Be satisfied! The money is just causing grief. Neither probably needs it. So why not clear the air, and give the cash to someone who does?!
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Hex disk editors are fun tools. You might want to try them, some day, when you're finished trolling and want to get some work done.
The dead space (NOBODY calls it "slack space"! Slack is what people do on Fridays) is a function of sector size, and you can define whatever sector size you like. (The defaults are the "sensible" geometry, but by no means the only one.)
I suggest you also get Peter Norton's books on MSDOS, and print out a copy of the Interrupt listings, which you can find on the Internet. The Norton Guides are extremely handy, too. Oh, and you'll want a copy of Flight Simulator 2.0, which came on a 5.25" floppy. The "copy protection" scheme is to use 1 Kb sectors, which not only stopped DOS reading them directly, it also gave the disk about 256 extra bytes per track.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
He should have used Gregory Chaitin's Omega number to generate the challenge file.
Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.
Zooko
As people have already mentioned, the numbers are probably not pseudo-random as they were
;^)
retrieved from random.org...
However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
you don't need to be a good mathmetician, you just need to know how to read a paper written
by a good mathmetician... Look for the paper...
Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
IT-15(1): 122-127.
Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
for decoding BCH codes actually provides a general solution to the problem of synthesizing the
shortest linear feedback shift register capable of generating the prescribed finite sequence of
digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
problem is demonstrated...
This is probably in most textbooks on linear codes for encryption and/or error detection and
allows you to recreate the shortest LFSR (linear feedback shift register, a common component
of pseudo random number generators), given a sequence of digits. Of course nobody would
use a non-cryptographically secure PRNG like random() would they?
So you take the number, pass it through this algorithm, find the shortest LFSR, and the
decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.
Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
knows it might just accidentally compress the sequence the challenger gives you (not likely,
but who knows?)...
Always remember... "stand on the backs of giants" whenever possible
ahh...but would THAT 1000 bit number be random, or compressible on it's own?
Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.
As several people have observed, the attacker in this case used his "split into multi-files" trick to hide the necessary information in the directory structure of the disk. This is just barely ruled out by the "one compressed file" rule. However, suppose you hid the data in the filename of one file. For example write a program like
echo $1 | cat - $1
(about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.
This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.
Steve
In this sort of contest, it's traditional that finding ingenious ways to exploit loopholes in the rules is a perfectly valid way to win. Then the rules are tightened up if ever the contest is re-run.
Mike Goldman didn't make it clear what 'total size' meant - I think most programmers if you asked them would assume the sum of file sizes, if you didn't specify otherwise. His fault, I feel. He should pay the $5000 and write the rules more carefully next time.
BTW, if you do assume that 'total size' is just the sum of file sizes, you can compress anything down to zero bytes. Number each possible sequence of binary digits - the empty string is zero, '0' is one, '1' is two, '00' is three, '01' is four, '10' is five, '11' is six, and so on. Then just create the required number of zero-length files.
-- Ed Avis ed@membled.com
Patrick specified a file size, and Mike generated a file of that specified size. Mike did not include any OS filesystem overhead in his file size. It seems rather hypocritial of Mike to claim OS file system overhead counts for Patrick's files but not Mike's. Mike defined FILE SIZE, not DISK SPACE USED.
--
Infuriate left and right
> What I'm thinking of, is wouldn't it be possible to create a minimalist compression algorithm that tended to not affect the file size,
> but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).
All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."
Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:
s/a/311/g
s/b/47/g
s/c/9724/g
...
etc.
That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.
Hmm. I seem to have reinvented pkzip.
Geoff
I think I see a trend here. Maybe for them it really would be easier to muzzle the entire internet than to produce p
> My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way?
Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
--
I've finally had it: until slashdot gets article moderation, I am not coming back.
If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.
The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.
The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.
---
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Read the damn article before you post - he didn't use 'gunzip' in his proposition.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
Their 486 got overloaded
now we need to go OSS in diesel cars
there's no scamming involved. goldman required the (suitably small) entry fee to deter time-wasters; craig wasn't after the money in the first place. they were both just out to prove a point (goldman, that truly random data can't be compressed in the general case; craig, that goldman's challenge was amusingly open to abuse), not to make any money.
-- in china, chinese food is just called food.
You could scan the file for the smallest possible number whose binary representation isn't contained in the file,
i.e. let's say 1001010111001010101 (I know this number is too short) is not there replace the 1000 zero-bits with that number.
But I fear now it only get's more difficult to calculate the probabilties of a "win", but the outcome will stay the same.
No way...the guy didn't compress shit. All he did was some data shifting. If the guy had read the FAQ, he would know that such tricks are not compression...you mention an MSDOS FAT filesystem. Even on a FAT filesystem, the 218 (or whatever) files and the decompressor would have taken up more space than the original file due to slack space...although the amount would vary from hard drive to hard drive, if you figure an average of 512 bytes of slack space per file thats 111,616 bytes of slack space, far more space than Patrick claims to have "saved."
And anyways, the guy basically admits that he cheated by exploiting the vague wording of the challenge.
I've got to hand it to Patrick, though, because if you do take the challenge out of context, he basically won...it depends on what the meaning of the words "compressor" and "decompressor" mean.
My journal has hot
The 218 files would have taken no additional space on the FAT, as it's fixed size.
Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".
I never said that it would take additional FAT space, I'm talking slack space...if you knew anything about the FAT filesystem (which you obviously don't), you would know that its not sectors you have to worry about, it's clusters (GROUPS of sectors). there are only so many clusters because of FAT's fixed size... I was being QUITE generous in saying that 512 bytes is a practical average. On a 1.2 GB MS-DOS formatted drive (not FAT32, mind you, we're talking DOS here, not Windows), the cluster size ends up being something like 16K, leaving the average slack space to be 8K. so I was being VERY generous there.
My journal has hot
The hard drive platters were free. The coffee can lids might have worked better though because they are lighter in weight, although they wouldn't have been as rigid, so I'm not sure how well they would have supported the weight of the vehicle...
:)
I guess I could try it, though.
My journal has hot
Again, that depends on what you mean by "paper airplane." Admittedly, I won a similar contest involving rubber-band driven cars. The contest rules said that CDs or DVDs could not be used for tires due to the fact that they make for very narrow tires, thus having reduced friction. So I used hard drive platters. :-)
My journal has hot
He never said it was okay to deliver a whole bunch of 'compressed' files. He said ONE decompressor, and ONE compressed file. Though Patrick asked about this, all the guy did was restate his terms.
What Patrick did was take a small bit of information out of the file, and, rather than having a table of 'offsets' telling his decompressor where to insert 5's, he broke it up into a bunch of files, effectively using the directory/inote tables to store this information.
Had he not insisited on a single file, I'd say Patrick here would have technically won, though not at all in the spirit of the competition.
The data has to come from the guy accepting the challenge; you get to tell him how much data to compress, he gives you the data.
The challenger should have made it clear that these costs would be counted against the reconstructed filesize (or simply not allowed any deviation from the "one file" rule) The decompressor should operate by having all the files piped into stdin, thus eliminating filesystem state info entirely.
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
I've heard this idea conjectured a lot by various people; but although the mandelbrot set is in some sense infinite, the set of images that can be produced has bounds. Basically, choosing any fractal, and sampling at any resolution, with any slice through the set, still won't produce the entire possible set of data that could be represented in the finite space that you've chosen. Due to the sampling, you would repeat some data and never produce others. In fact, the self-similarity properties of fractals would seem to indicate that this would be a rather poor (or inefficient) way to represent general data.
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
In order to meet the challenge you must provide to me a compressed file and a decompressor which together total less than the size of this original.dat, and which can generate a file identical to original.dat.
He doesn't say "file size" in this or any other passage. It is Patrick who says (and assumes) this.
Also, Patrick specifically addresses the issue of filename based compression in this passage.
It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.
Thus, I think it is Patrick who hadn't considered the fact that his "decompressor" is not only just the 'dfi' script, but also the filename information used to indicate file boundaries and ordering. Like I said, his algorithm (much less his actual implementation) could not properly reassemble the file if it were fed all the properly ordered "compressed" files as an input stream (through a pipe, say), since it wouldn't know where to place the missing fives. So given that Mike said "total size", in all cases, when discussing and offering the challenge, I think his position is safe.
Now, I do think that Patrick was reasonably clever, and that Mike was overly smug in both the challenge and the initial news posting. And in the exchanges Patrick did explictly ask about file sizes, but I do not fault Mike for not clarifying further, since he had no a priori knowledge of the scheme, and therefore didn't know he would have to be so specific. But his use of terminology was consistent, and I think, appropriate, to protect against this type of scheme.
Anyway, that's just my take on it. Mike can hopefully learn some humility from this, and Patrick can learn that his "decompressor" is more than just the sum of its file bytes.
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
As soon as Mike accepted Patrick's money, it was no longer a rhetorical challenge. It was a bet. Patrick, at that point, had caught Mike proffering a reverse sucker bet. Mike says "I bet you $100 at 50:1 odds that you can't do X." Patrick, confident that he could do X, accepted the bet and then did X. Mike says "Oh, I really meant for you to do X.1. I only tricked you into trying to do X because it's impossible." Patrick, in a really amazing display of good sportsmanship, declined to have his money refunded. (Which Mike wisely proffered...otherwise, he'd be in serious danger of fraud...)
My momma told me never to take a sucker bet. Mike got caught by his OWN sucker bet...which takes a) greed and b) foolishness. He got lucky that the person who took him was such a good sport about it.
It'd be really nice to see Mike donate the money to ACM or EFF or somebody like that...that'd be a pretty classy way of taking care of this.
Why yes, I AM a rocket scientist!
I'm not necessarily arguing a point of law, it's more of a point of honor. Wagers are well-understood in many cultures, and in this case it seems clear that Mike proffered an arguably dishonest wager (a bet where he challenged people to do something that was not possible to do to his satisfaction). Somebody hoist Mike by his own petard. Good for Patrick.
Why yes, I AM a rocket scientist!
Yeah, my exceptionally slow compressor counts the number of "one" bits and the number of "zero" bits. It stores those two numbers, along with several position-dependent checksums. Compression is very fast.
;)
Decompression can take roughly the lifetime of the universe (for large files).
Interestingly, my brilliant strategy is also a encryption algorithm
bukra fil mish mish
-
Monitor the Web, or Track your site!
Eloi, Eloi, lema sabachtani?
www.fogbound.net
(should have said "slow de-compressor" but you get the picture)
bukra fil mish mish
-
Monitor the Web, or Track your site!
Eloi, Eloi, lema sabachtani?
www.fogbound.net
Lzip should have gotten $5000 for being the best compression program.
----------
----------
Check out my blackbox styles
Too bad the Python doesn't get executed.
:-)*
Maybe in Squishdot, as long as rexec is used, of course. You could import StructuredText from Zope and your post could act like a Wiki.
- *this* would look like this
- **this** would look like this
The above would become an actual HTML bulleted list.
http://this.would.be/a/link
|| Instant Table ||
|| 1 || 2 ||
Ah, well.
I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.
Data point: The corolation for me is: modern US law as practiced is harmful, and anti-productive; this contest was clearly stated, and clearly won. Mike should pay up.
But it was clear. The contest was in terms of total file size of the compressed output and decompressor program.
The proviso given (which should not have been) was that the compressed file could be provided in multiple parts. What he forgot is that an ordered sequence of files has more information than just the sum of their contents. Ooops.
This is not a wording problem, but one of a poorly thought out bet. This guy gave 1:50 odds that no one could compress truely random data. He chose to allow conditions that made the bet doable, and then did not want to pay up when the bet was won.
Actually, even without using multiple files, I think the bet's doable for any single given data file (though you cannot write a generalized compressor for all truely random data). I should come up with the submission to solve for this guy's challenge.
Sky, an extraordinary proposition gambler, responded by countering. He covered the offeror's eyes and bet he couldn't tell what color was Sky's tie. "No bet."
I say this to point out the difference between sucker betting and proposition gambling. It is a sucker bet to offer a bet challenging a mark to do something that is, in fact, impossible. In proposition gambling, the offeror gives the mark a proposition they are unlikely to win, or that is worded in such a clever way that the winning of it leaves the mark too embarassed, entertained or impressed by the cleverness to attempt to challenge having lost.
In this case, the first offer (compressing high-entropy data) was a sucker bet. Shame on him. The response, doing the sucker bet using multiple files, was a sweet proposition gamble. Shame on him. Once taken, the offeror-now-mark did weasel, but as I see it, ultimately welched.
I admit that I, too, didn't consider the import of the multiple files on first reading (but, hey, I didn't have $5K against $100 riding on the result).
At the end of the day, the sucker-bettor was the mark, and he had an earful of cider.
No sympathy.
It is my understanding of the situation that Patrick "solved" the challenge by exploiting the fact that multiple files store more implicit data in the filesystem than a single file does. I believe the only such data he used was the EOF.
This is considered "cheating" because he used 220 EOFs that are not counted in the filesize.
So, my question is this. If a contestant submits a single compressed file and a single decompressor file which are just 1 byte under the original file, does Mike void the entry because of the extra filesystem space of two files? If not, why not? If so, shouldn't the rules say that specifically, as two is the minimum total number of files one can submit?
--
gnfnrf
Its an intriquing problem. I certainly wouldn't stake $5000 on nobody being able to compress the data I supplied, good random number generator or not. Why? because while the concept that random data in general cannot be compressed is correct, there are no patterns, it is not necessarily true that a given set of random data cannot be compressed.
For example, a file full of 0's, is perfectly possible from a true random number generator, and just as likely as any other set of numbers. As any good geek knows, a file full of 0's compresses very very well.
In this case, I think the challenge has one major problem, and that is that there is no real time limit on the entry, and no requirement that the decompressor work effectively in general, it can be entirely specialised.
This means that the problem *may* be solvable, simply by throwing ungodly amounts of CPU power at it. With the appropriate code, it would be possible to simply hunt for unintentional patterns in the data and hope. If you're lucky, there will be enough indexable patterns present that it will outweigh the cost of the indexing, and the code necessary to replace the index points.
If you're unlucky, the data will have no patterns in the space you searched, and therefore you're out of luck.
This is why the guy noted that if he got a big enough file, he could win the competition. As the filesize increases, it becomes increasingly likely that there are, accidentally, enough patterns in the data to cover the space required for the decompressor and the dictionary. Of course, as the filesize increases, you also have to burn considerably more CPU in order to locate the patterns in the first place.
Personally I think it would make an excellent project for a research team, not particularly for the money, but in order to create some really fast, specialised code for locating patterns in assumedly random data.
I am not aware of much effort in that area, most compression research goes into identifying patterns inherent in the data type, waveforms in audio, visual patterns in images, duplicate strings in text and binary files. It would be very interesting to see just how effective fast analysis code and heavy CPU is against the problem.
I suspect there may well be some crossover here with Cryptoanalysis, since one of the primary jobs in breaking crypto is finding the patterns that differentiate the ciphered data from pure randomness.
Any links appreciated.
You can't win a fight.
Depends entirely on how random the random data is. There's a surprising lot of random numbers which can be compressed--this almost always indicates a critical flaw in the random selection process. Even radioisotope-based RNGs aren't immune; by knowing the isotope, its age, and the sort of radiation detector being used, it's often possible to get a faint hint of determinism out of the RNG. That faint hint of determinism can be exploited to result in compressability... sometimes.
By and large you're right that random data can't be compressed. But getting really random data is a far-from-trivial undertaking.
A while ago on slashdot.org an article was posted about a prime number that had the "magical" property of being able to be g'unzip'd to the DeCSS source. The number was represented as a (relatively short) arithmetic expression.
My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way? Of course the chances of finding these type of numbers are very slim, but they *are* infinitely large spaces, and the compression would be ultra-compact.
Maybe something for a quantum computer to chew on anyway...
SuperID
Free Database Hosting
The problem is that Mike said it was OK to use multiple files, but did not specify whether that meant that the actual additional filesystem space that they took up would be counted against Patrick anyway (with the idea that Patrick could use many files, just so long as the eventual information still added up smaller than the original), or whether he meant that it shouldn't make a difference, and didn't care.
The information that could be considered "removed" by the files, would basically be the start and end offsets of the chunks, if they were all in one file...which would naturally add up to more than the original (219 * 2 additional info). The "missing" information escaped into the unix shell and filesystem, through the facility of "cat" being able to print out a block of bytes given a single piece of info.
It's 10 PM. Do you know if you're un-American?
POT appears not to mean that Slashdot will *interpret* your post as POT, but instead that Slashdot *assumes* your post to be text that is not meaningful HTML. They should changed it so any HTML is appropriately escaped (entitied) from POT submissions. Angle brackets, etc, are very annoying to have to fix.
It's 10 PM. Do you know if you're un-American?
ie, it would confirm what I say every day.
How we know is more important than what we know.
Since the challenge allows you to see the file before building the compressor (and the neither the size or speed of the compressor is relevant), all you need to do is find the single array of bytes (I'd say string, but presumably the 'arbitrary file' would be binary) greater than three bytes, that occupies the greatest percentage of file space in the compressed file. Then replace that array wherever it occurs with an EOF character and name each new section of the file sequentially using a 1 byte name.
(As noted in another post) As long as you pick an initial file length long enough that 2% of the time there statistically should be enough repeats of some 3 byte (or longer) array to equal the number of bytes in your decompressor (which would be roughly the same size as Patrick's), then you will win the challenge enough times to make it profitable.
Work for Change & GET PAID!
Nice try, but won't work. You have to keep the position of that serie of '1'. That position will use about log2(position) bits. Unsuprisingly, 2^n is the size the file needs to be to have a good chance to have a serie of n '1's.
Not if you expand on Patrick's original idea replace the series of 1's (or zeros, your compressor should determine the longest continues series of each and then use the longer) and replace it with EOF. Since EOF can't appear in the original file except at the end, your de-compressor will know that this is the position to insert the original series. You could also repeat this as many times as like, provided the replaced series is at least as long as length of the EOF character plus the length of each new file name created plus memory space needed to store the value of the length of the series.
Work for Change & GET PAID!
[ ... ]
Specificity is critically important when you're communicating with a computer. But over specificity is actually a deterrant to communicating with another human. Ambiguity is a common mechanism by which we humans learn. So, for example, when I say "cat" you have one image in your mind and I have another image in my mind. But when we both discover that the thing that I was talking about is different than the thing you were thinking, we'll have expanded our understanding of what the word "cat" comprises. Just now, I was thinking of a cheetah. Do you remember when you learned that cheetahs and lions and tigers were also cats?
I'm overstating some things here becuase it's been about 8 years since I took my philosophy of language class. But to sum up, generally human communication works much more efficiently when there's some ambiguity in it.
But it's not just in the event of silly cases. You are cornered by the silliness everyday. Clearly you use a computer connected to the Internet. Do you think that the terms of service that govern your connection protect you? The point is that if we can't figure out a better way to deal with specificity and intent, then we consign ourselves to the rules that are written by those who can afford to write them.
I don't have an answer for this either, but I'd like to think that the discussion isn't so easily dismissed.
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
If you are for the challengee, then you believe that exact specificity is more important than intent. And you inadvertantly empower the legal profession to profit from being better able to specify rules, etc. Which, in turn, gives those with money even more power. They can write the rules by which your interaction with them will be governed. You can be certain that the rules will be written to the advantage of the guy with the money.
If, on the other hand, you are for the challenger, then you believe that his intent was more important. (Considering that the challenge came from the compression FAQ, the exact definition of "compression" should have not been vague. It seems clear to me that his intent was obvious, even though his challenge wasn't very specific.) Still there is a problem with valuing intent over specificity. That is it's easy to misrepresent intent after the fact. Thus the need to clearly specify our intentions prior to anything we do. That being the case, with accurate documentation, it seems possible to me that we can discern the original intent w/out having to specify everything legalistically. For example, I think that I was able to clearly determine the challenger's intent well after the fact, and I did this having read the challengee's representation of the events.
Now you may say, in rebuttal, "Who cares who I choose? I don't get to make the law." But think of yourself as the jury trying to determine to whom you would award the money. You may one day find yourself on a jury and trying to determine who should be awarded money. I see this as a microcosm of exactly how our court system has gotten into the state that it's in. Specificity is clearly more important than intent. And it's that way because juries award that way. Since juries are comprised of members of society, it's not a far stretch to say that society as a whole regards specificity as more important than intent.
Is there any chance that we can change this? And if we did, would it be better?
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
And here if needed.
- Michael T. Babcock (Yes, I blog)
- Michael T. Babcock (Yes, I blog)
I have to agree with this. The trick is Given a sufficiently large file. I can't imagine anyone coming up with a 2GB file off of which a modified RLE program couldn't snip a few kB. By "modified", I mean spend all the time you want analyzing the file (this is allowed), and find strings of similar bytes in arithmetic progression, or in strange but calculable patterns. Remember, the compressor can be as complicated and large as possible; it's the decompressor that needs to be small.
The challenge specifies that you can choose your compressor after you had a peek at the datafile. If it turns out to be compressible enough by a known algorithm, just ship that compressor. If it isn't, use your meta-compressor.
However, I have a gut feeling that the uncompressible files are too numerous for this meta-compressor to work: if 99% of all possible files are uncompressible, then the index number of the file would not be that much smaller than the file itself, and you couldn't fit the meta-compressor's code (which would be rather complex...) in the small space saving.
As others have pointed out, a better alternative would be to simply "play the odds": make an algorithm which only works in 10% of the cases. If the file can't be compressed, resubmit an entry until you get one that works. On average you would have spent $1000, and won $5000!
Say no to software patents.
Clearly, this is the crux of the arguement, and intellegent people can waiver here. In a sense the fragments were compressed in that the 5's were chopped off. I would actually argue that this is a form of compression, however impractical in this case. For example, if you create a dictionary of "phrases" and install the dictionary within the decompressing client, and send the "compressed" files you are really doing the same thing that Patrick did. The difference, is that Patrick exploited the filesystem to store his dictionary data, as opposed to actually putting the dictionary inside the decompressor. In fact what he did IS compression, and the files were compressed in size. Patrick could have done even more devious techniques by moving all of the data into the filenames themselves. Because Mike did not spesify that external linkage to datasources was illegal, what Patrick did in my view was within the confines of the contest.
Someone you trust is one of us.
One solution however impractical an imperfect is to reverse-checksums. Essentially, you take the data block and perform a series of running checksums (like CRCs). You store only the checksums in the "compressed" file plus tidbits of "hint" real data. What you have is a lossy form of compression. However, with enough processing power (a few dozen E10,000s) you can reverse calcuate the checksums which will give you accurate results with 99.999+% of data blocks (some data blocks will return false solutions). It is simply a matter of home much processing time you are willing to throw at the problem.
Someone you trust is one of us.
Aside for not clarifying the use of external dictionaries as being illegal (Mike let Patrick use the filesystem filenames to store data), Mike also didn't clarify processing time. If processing time is not a factor, you can create reverse CRC/Checksum compression that is acurate almost everytime. Reverse-crc compression is accurate inversely porportionate to the rate of compression. So, you simply need to choose an large data block (1 gig or so) to compress with a very low rate of compression, say 1000/999 or something like that. Then you let the reverse crc decompressor work on it for a few days, and eventually you'll come up with the right data 999/1000 times.
Someone you trust is one of us.
.
Patrick I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed fileMike Sure.
Pay up Mike. Pay up. You were too careless to plan out the rules and think about your intent before you threw 5k into the sink. Cough up the cash or loose all honor.
Someone you trust is one of us.
Let us disregard the issue of money and the issue of a specific answer's acceptability and analyze the question
In my opinion any person, given enough time to come up to speed on the subject, and given that that time, plus the time it takes to come up with and implement their solution all multiplied by the factor of time it takes to maintain themselves (eating, resting etc.) during that time, and given that they live that long, can come up with a winning solution to Mike's challenge due to a flaw, not in wording of the challenge, but in Mike's reasoning behind the challenge.
To put a lot of discussion to rest (in my mind) just read the relevent portion of the FAQ in which the challenge is found. i.e. the section entitled "9.2 The counting argument"
Here you'll find, clearly stated, the flaw in trying to find any one algorithm which will always compress data of a fixed length to less than that length. Also you'll see the decent challege by Steve Tate which paraphrased says:
Mike Goldman's addendum challenge effectivly states:Reading it in its full context, which aparently a lot of slashdotters didn't, it immediately gives itself up as a challenge which does not hinge on on the mathematical theorem it is presented with. i.e.
Given, there is no one algorithm which can compress all data, nor a subset of data selected by being a fixed length.
Therefore, if you give me an algorithm which claims to compress all data or all data of such a subset, I can give you at least one counter example.
It does not follow that I can give you one counter example to all compression algorithms for a given length (of your choice).
The implication of the first challenge is that the poser is confident in both the mathematical theory and either his analytical abilities when studying the given algorithm, or in his ability to test and check against the given algorithm in a sufficient period of time.
Conversely, the implication in the second challenge is that Mike in fact is the possessor of a set of super-data, of which any sub-set cannot be compressed by any algorithm.
Since we live in a time when any data sufficiently limited in size yet sufficiently large enough to offer decent gains when compressed (the challenger may pick that size) can be analyzed fully and with great speed, I think Mike has made a bit of a fool of himself.
I am not saying that Mike can't supply <<the uncompressable data from hell>>, but he will have to go to some lengths to get it. It will likely be selected to be large enough so Mike cannot generate it himself without an algorithm. It cannot be generated with a small input set or that set could be discovered and when coupled with the generator would defeat the challenge. It may be generated from a large set, but the result must show no patterns that can be bitstuffed (see networking) against or more simply generated and indexed for, nor may the source set contain any of the same properties. It may be possible that Mike is collecting this data from a source, i.e. digitizing random radio signals; we may all be interested in any source data which cannot by any reversable algorithm be represented/stored more succinctly particularly without any regard for retaining its meaning while compressed.
In summary, if Mike doesn't want to be giving away money, then he doesn't understand his own challenge's failure to be based on any mathematical certainty (unless I'm wrong and he has some super-data). However Patrick's solution was clearly not compression, though it was the start of a decent form of compression. Had concatenated his file-set and headed it with information as to each file's length in order, he just might have saved space enough for the decompression, particularly if he picked a suffiently large recuring patern that could be generated rather than having to be stored verbatim.
However, I might just be able to claim, that if the universe is deterministic, if the input conditions for the creation of the universe could be represented in less data than a generated block of data which can be represented in that universe, and if I were given enough resources in which to recreate the universe, I should be able to offer the input conditions of the universe and an index into it's ongoing generation which could point-to a generated block of data which exactly matches one provided. I just might be able to give you the process, its inputs, and your data before you've finished generating the data for my participation in the challenge, if in my generated universe I could significantly warp time without affecting the results.
-Daniel
That last part's an odd joke
I'm surprised at the number of people using the FAQ's "pigeon hole" theorem backwards. The theorem states that a reversable algorithm cannot be used to represent every piece of data as a piece of data smaller than itself. It does not address being able to pick an algorithm well suited to just one specific piece of data. If there is any data to which no compression alogrithm can be applied to, tailored to, or made for, that's a pretty amazing piece of data, and we should all be made aware of it Mike.
-Daniel
Goldman's mistake was two fold:
1) He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.
2) This he failed to do so within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.
Seastead this.
The point of the story is to be careful when issuing so-called "impossible" challenges. Don't leave loopholes in the rules!
Free Hans!
When Goldman gleefully posted to the newsgroup that someone had accepted his challenge, he said "some people just don't understand information theory too well". Yeah, and some people just don't understand issuing challenges very well, either!
When you make a challenge like that you absolutely must be sure to word the rules carefully and try to avoid leaving any loopholes. In this case, all Goldman really needed to do was tack a sentence at the end of his original challenge saying "the data must actually be compressed according to the definition of 'compression' contained in this FAQ. Challengers who merely point out loopholes or exploits in the rules of the challenge will not be rewarded".
The moment Mr. Craig started iquiring about about the specifics of the rules and saying "wait and see" regarding his methods, that should have been a warning sign to Goldman that Craig had some trick up his sleeve. But no, Goldman was just so in love with the idea that he found some "impossible" task which suckers might try to undertake that he couldn't be bothered to think about such things.
What's even more embarassing to Goldman is his initial reaction to Craig's submission:
Even though Craig had already asked if more than one file was okay and Goldman had said yes!And Goldman's following point:
is even more ridiculous given that Goldman had already said that "sure" in response to Craig's question about multiple files whose "total file size" (which does not include directory entries) are less than the size of the original data file. Goldman tried to change the rules rather than admit he was sloppy in setting them up.Goldman also says "I think that I did not make a mistake", even though he admits in the same message that he needs to restate the rules:
"Perhaps"? How about "definitely"! And any "miscommunication" was entirely one-way. Craig was very sly and clever, and Goldman was too full of himself to realize what the fuck was going on until after the fact. A "pompus ass" indeed.Free Hans!
I think in his post to comp.compression Mike wrote that his random data had come from random.org. Random doesn't use a PRN generation algorithm; they get random data from atmospheric radio noise (tune your AM radio to a non-station and you'll hear what that means.) To remove any bias in the data they postprocess the noise by removing high order bits, and the low order bits are shifted to normal using a bit discarding algorithm.
LibBT: BitTorrent for C - small - fast - clean (Now Versio
The comp.compression fact discusses that for any given compression algorithm, it is possible to craft an input fill that cannot be compressed.
This is different than the challenge proposed, where Mike, the challenger, provided a file of size X, and Patrick, the challengee, was required to provide a de-compression program of size Y and a data file of size Z, where X > Y+Z.
The difference here is that Patrick was allowed to hand-craft the compression method to the data file, but Mike was not allowed to hand-craft the original data to the compression routine.
Basically, Mike was wrong, and should pay the $5K.
I do not deploy Linux. Ever.
Given a sufficently large source file of truly random data, I can create a binary and a data file which can recreate that one specific source file, losslessly, with my binary+data being smaller than the source. That is, I can create a routine that will work for just that one specific source file, no others.
The problem of 'Lossless Data Compression' refers to the opposite case- given any general purpose routine to compress files, I can create a file that it cannot compress. That is the oppposite of the challenge given, and actually is impossible.
As stated, even with the restriction of a single compressed file, the challenge *IS* winnable.
I do not deploy Linux. Ever.
It is impossible to write a routine that can losslessly compress every possible 3-megabyte file that could ever be generated.
It is not impossible to write a routine that can compress the one specific file that Mike supplied to you. That is, it is quite likely that you could write a program that can losslessly compress that one single input set, such that:
That would only be true for the general casse compressor-decompressor. I could write a routine that would work for that one specific input file, but fail miserably on any other input. This would win me the $5K, but be valueless for any other purpose.length of decompressor) + (length of compressed data)
The mistake both you and Mike make is that while writing a program for the general case of compressing random data is impossible, it is very different than writing a program to compress one specific file of 'random' data.
I do not deploy Linux. Ever.
Slightly over-zealous use of "Submit" without previous use of "Preview", methinks. ;-)
BTW, you can use < to get "less than", like this:
<
This is very very standard stuff in HTML.
--
http://www.gimbo.org.uk/
Heh - interesting: I've never used anything other than "Plain Old Text" (including on this message - but spot the italics!).
Strange... Hey ho, live and learn I guess.
--
http://www.gimbo.org.uk/
def test():
print 'so what do the other modes do?'
this_is_the_example_for('code')
return("cool, works the way I'd hope")
if __name__ == '__main__':
test()
--
http://www.gimbo.org.uk/
But tomorrow morning, I shall be sober.
Oh, hang on, that's not right.
--
http://www.gimbo.org.uk/
You know, I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length. This is because knowing the file to be compressed *before* you create the decopressor gives you an advantage over the arbitrary random file case.
Furthermore, I reckon I can write a program G that will, when given any input file A of length>=X, will create a decompressor D and compressed file C such that:
D(C) = A
size (D) + size (C) size (A)
Don't know what X is yet, I supspect around 1 meg is fine though.
Anyone willing to bet either way?
Slashdot apparently doesn't like open angle brackets.
the condition should read:
size(D) + size(C) "is less than" size (A)
Slightly over-zealous HTML filter, methinks....
Uh, just spotted that isn't possible. Cos then you've created G(A)->(D,C) which is a universal compressor, and therefore not possible by your usual arguments. Course, I'm still interested to see if you could create G which works for a percentage X of possible values of A that gets arbitrarily close to 1. My guess is that this is possible, in which case you can still win money off the original bet by ensuring that X>0.02 Of course, if your opponent knew G, they could always choose a value of A that would confound it. But this isn't the case here. Also, it may be that X increases only very slowly with filesize, so that in order to get X>0.02 you would need a few solar systems worth of storage and CPUs to complete the challenge. Can anyone prove/disprove any of this?
Anyone willing to bet either way?
Yes, assuming appropriate restrictions against storing information in the compressed file's name or attributes, I'll give you N:1 odds that you cannot write the program G, where N is any positive number. X can also be any number.
Or, save yourself time, don't try to write the program at all, just send me money.
It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.
However, your decompression program would have to be smaller than M - log_2 (N) = log_2 (2^M/N) bits, where M is the number of bits in the compression-resistant file, and N is the number of compresion-resistant files preceding it in your sort order. So, the number of compression-resistant files would have to be small relative to the number of possible files if you want more than a bit to write your program. Since this is not the case, you would fail.
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out. The bet can be won; not in every instance of course, but in the long run.
That reflects a mistake on the part of Mike Goldman. If you want your decompressor and compressed file to be a total of length N, you can store log_2(N) bits of information in how you set the lengths of the two files. If you have more than two files, you can store even more information. Mike should have required that (number of bits in original data)+log_2(length of original data) > sum of number of bits in each compressed file and decompressor + log_2(product of lengths of each file and of decompressor), and he should have permitted himself to rename or change the file attributes of any file he wished in case information was being stored therein.
Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.
As I understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.
The theory was that each teams set of data was incomplete so they would have to cooperate to get the suitcase open, thus teaching us the importance of acting as a whole team or something.
However, I have always had an interest in locks (and "unlocking" them) and the briefcase had those cheapy combo locks on it so I wandered over and had it open in 20 seconds.
Lesson: Teams are good but sometimes an individual makes the difference.
Unfortunately, because things had gone wrong, they decided to usurp my victory and give the final activity (locate an item in the woods) to both teams. My team lost. I'm not sure what the message there is but I chose to learn my own message which is "Messing around for the day is much more fun than sitting in an office working"
Rich
Ingenious, but not ingeneous enough. The FAT table is a limited resource, and his solution uses up more of that resource than the original file. It's still not a valid compression - not to mention the fact that it's totally unportable anyway. Other OS's won't even concede this marginal point. . .
Any sufficiently well-organized community is indistinguishable from Government.
If you try to compress random data using one generic algorithm, you will sometimes have size savings, and sometimes will end up with a larger file. If you are smart enough to put a "not compressed" flag in your data file when it becomes larger, you can reduce the size on average.
But, if you are free to select your algorithm after seeing the data file, chances are you can always compress the data (although I may not mathematically prove this). Remember the article here on slashdot a few weeks ago where a prime number, when gunzipped, produces the source code for a DVD decoder. Well, you may say that they are not the same because the source code was compressed with gzip beforehand, but please think a different algorithm than gzip (as in mandelbrot fractals) and you *might* end up with an algorithm which compresses some data which is random to you but not-so-random statistically or mathematically.
PI is infinitely random, but can't you compress it in a few lines of code?
Although I should admit that finding the algorithm itself would be mostly by chance.
Donate free food to the hungry at The Hunger site.
Lossy implies decompression will *not* return it to its original form.
works on freebsd, anyway
-------------
#!/bin/sh
fetch http://211.0.18.242/~patrick/other/original.dat
ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?
Kind of true: you can't universally compress random data using a single program, but this doesn't mean you can't compress a single instance of random data.
This competition is kind of "balanced" because:
Given a sufficiently large file (a couple of Gb would probably do) then I think virtually any file could be compressed with a specially crafted compressor and decompressor. Given a 2Gb file, you only need to achieve .0001% compression to have 2147 bytes to write the decompressor.
"Mary had a crypto key, she kept it in escrow, and everything that Mary said, the Feds were sure to know."
--
--
We have fought the AC's, and they have won.
Hopefully, at least, he learned his lesson, which most likely was all Patrick ever hoped for anyway.
"If you think education is expensive, try ignorance" - Derek Bok
Pick a large enough length, and you are virtually guaranteed that there is a 1024-byte block of zeroes in it somewhere.
Your reasoning contains an error. What were you thinking, arguing with an MIT guy, for heaven's sake! :-)
As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.
For example, to a first order, you need a file of length around 2^1000 bits to expect a run of 1000 consecutive zero-bits. Great, But specifying the position of this run requires-- you got it!-- a 1000-bit number. We're hosed!
That's almost a good idea, and it made me laugh when I read it, but your decompressor would surely take more than one bit, wouldn't it? So the total size of the decompressor plus the compressed file would certainly be greater than the original file (this was a key part of the original challange).
http://www.bootyproject.org
OtakuBooty.com: Smart, funny, sexy nerds.
"
I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length.
"
Nope,
If I have N bits of file. There are 2^N possible files that can be created. There is no way you can always have a compressed file of length 2^(N-1) in length simply because you only have 2^(N-1) distinct possibilities leaving 2^(N-1) files untouched.
However, I intend to compress 2^(N-1) files into a length of 2^(N-1) and the other 2^(N-1) files into a length > 2^N. Since there's a 2^-1 chance of getting each file I expect an average return of $2400. (2^-1 * $prize - $entry).
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Sorry, I meant to say length N-1, and length N.
I've muddied my explanation however hopefully the point is clear [it's possible to compress 50% of the files to smaller than the starting length at the expense of the other 50% being larger].
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
"
You will therefore find a match, ON AVERAGE, at a distance of about 2^99 bits into PI
"
And hence *on average* you win the competition.
In n entries you will win n/2 times and lose n/2 times - admittedly you will tend to expand the data - the amount the data expands on average in the losing cases is greater than the average saving in the winning cases. However, that doesn't matter here. Providing we can shrink the data in better than 1 in every 50 attempts we will end up making a profit.
If the guy wanted to have a winning money source he should have made the prize $199.
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Ignore the last bit
Supply it at random.
50% of the time you're wrong and lose your $100
50% of the time you're right and win $5000
Of course this loses the spirit of the competition [design a universal compressor which is clearly impossible] but wins the other competition of 'how do I make the arrogant prick look like a fool'
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Ok, I've done some information theory, I know it's impossible to create a perfect compressor. Boxes arguments and all that.
However, to make money, you don't have to.
All that is required is a compresser that saves 1 bit on better than 1 in 50 attempts.
For example, if we can save 1 bit on 50% of the attempts - even if the other 50% of the attempts results it catasrohpically large files. On average we've lost the compression war.
However, on 100 attempts, we've paid $100 x 100 or $10000 but we've won on half of these - netting us $250000 or a cool $240000 profit.
Anyone want to demonstrate this?
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
"However, there should be no way for the decompressor to generate compression resistant files in order to enumerate them."
Generating files is easy. You just iterate through them: empty file, all the one-byte files (0x00, 0x01, 0x02, ... 0xff), all the two-byte files (0x0000, 0x0001, 0x0002, ... 0x00ff, 0x0100, 0x0101, ... 0xffff), all the three-byte files (0x000000, 0x000001, ...), et cetera. It would take a while, but time isn't the issue.
Of course, the decompressor wants to count compression-resistant files, but it does this by generating each file, as described above, and then running it through the compressor. If it does not come out smaller, it is compression-resistant.
"So you're running them through an external compressor ..."
No, it is not external. The compressor and decompressor are written with an internal copy of the compressor known to be used by the person who generated the file.
"... your decompressor ONLY decompresses 'uncompressable' files"
That is the point here. This thread is about how one might go about compressing files that had been generated by somebody who filtered out compression-resistant files.
Section 9.2 of the FAQ, the Counting Theorem, says that "no program can compress without loss *all* files of size N >= 0".
Mike Goldman's challenge explicitly involves "a specific known file" of size N, where both N and the data within are arbitrary and predetermined.
Mike's intent is obvious, but his logic is horrible. It's quite clear that Mike's challenge has HUGE gaping holes in it's mathematics, statistics, and logic as compared to the Counting Theorem. There's nothing "for sure" about Mike's challenge. Just reading it by itself made me sure he's going to lose, if not this time, then another time. He set himself up to lose.
What were the /. readers' opinions on this subject? It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win. This fellow from Japan knows this, but due to a wording technicality in the contract, he figures he can scam the $5k from the scammer. They both seemed to be good natured chaps about it, but I feel that they still have some deep resentment towards each other.
Any other opinions?
Keeping
A similar approach may be used with the logistic map (a=4 ... X(n+1)=4 X(n) * (X(n)-1), 0X(n)1 )
Keeping
Well, if Mike had specified anything other than "file size", I'd agree. But he said "file size". And the file size of what Patrick sent is less than the file size of the original file. Balls to directories taking up space, or more files meaning more space is taken up. File size is file size.
I don't believe that the intent was followed. However, as this challenge was "impossible" to begin with, I think that the $5K should go to Patrick for creativity. He met the letter of the challenge, which in this case should be good enough as the challenge as it was intended was "impossible".
Mr. Ska
I slit a sheet
A sheet I slit
Mr. Ska
First off it would be better to choose a pseudo-random sequence that is easier to generate, something like RC4 which is designed to run fast.
Secondly the index into the random stream would increase with the number of bits, so there might be a section of Pi that met the criteria, chances are that it would be a very large number of bits out.
There is however a possible approach that is arguably not cheating. Chances are that the person setting the challenge did not generate a perfect random stream. If they are generating megabytes of the stuff they are most likely using some pseudorandom number generator to do so.
Cryptanalysing the input stream to identify the source could well reveal the random number generator. This would not be a cheat in my opinion.
Of course the person setting the challenge could have done the job properly as I did for a one time pad system once. What I did there was to hook up a microphone, record a reasonably random source (a loudish fan) and sampled at 16 bits. Then after determining that the ergodicity of the source was at least 4 bits per byte I ran blocks of 1K of data through a modified version of MD5 to remove any bias in the input stream.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
Re ADFS, Goldman apparently believed that the FAQ on his website excluded storing parts of the file in the catalog, for instance by creating a filename out of the first N data characters. If it had been explicitly stated in the two paragraphs of the challenge, I think that would have prevented the use of the ADFS to store a small file in the same allocation block as the catalog, but Goldman was sloppy...
You cannot get real space saving's by Patrick's method -- the extra directory entries created will always be larger than the data removed from the file. Think of the smallest possible directory structure -- an array of pointers. (Files aren't named, you get them by number. Files are stored sequentially, so you don't need a length -- EOF is the last byte before the next file starts.) To accomodate files up to 2^n, the pointers have to be n bits long. Patrick's trick is to split the file at a particular sequence of m bits, m being less than n so you can be sure that that sequence will be found within a 2^n length random file. So you take away m bits and add n bits for every split.
But Goldman's challenge was not clearly worded in this regard, even though he was apparently aware of possible cheats involving hiding data in the directory. As Patrick said, if a 1,000 byte file could have been duplicated by a 499 byte decompression program and a 500 byte data file, there is no doubt that it would have met the challenge even though it would have taken more HD space.
I suspect that a good enough mathematician with the right expertise probably could meet that challenge without "cheating" by shifting data into the directory. In the simplest case, if Goldman derived the file from a psuedo-random number generator, you guess the formula used, then your "decompressor" is the generator and your "compressed file" is the seed value. More generally (like if the data was truly random from an inherently random quantum process), you try matching pseudo-random (did I finally get the spelling right?) formulas against parts of the file until you hit something that fortuitously matches for long enough. Snip that segment out of the data file. Write the "decompressor" to copy all the bytes before, then generate the pseudo-random sequence, then copy the rest of the bytes.
However, I think if you are genius enough to make that work, you probably have easier ways of making $5,000...
If the guy had read the FAQ, he would know that such tricks are not compression
Did the rules as stated say that the solution had to comply with the principle laid out in the FAQ? Not that I can see. If you're going to pup up a $5,000 challenge, you damn well better post all the rules.
For instance: I had a friend who entered a paper airplane contest. He signed up for the "maximum distance" part. The only rules were that the contest organizers supplied the single piece of paper, you had five minutes to fold or tear it however you want to, and then everyone threw them at the same time. The one that stops moving farthest from the launch line wins.
He waited until 4:45 into the time, then crumpled the paper up into a ball and threw it. While everyone else's planes were circling lazily around, his went straight, landed, and rolled several more feet.
While you can debate all you want about whether his wad of paper counted as a "paper airplane" -- and we did debate (hey, it was college) -- he did saitsfy the rules of the contest. Had there been enough money on the line for someone to threaten legal action, I suspect he would have won.
Nope, no sig