How I Completed The $5000 Compression Challenge
Patrick Craig points to this page which details (in email) the strange life of an online challenge as posed, clarified and (at least to some eyes) met successfully. This should give pause to anyone willing to pose -- or accept -- seemingly impossible problems. In short, Patrick's not $5,000 richer, but you may think he deserves to be.
And Mike should probably both reword the challenge and, to avoid the appearance of conflicting interests, offer to donate future processing fees (after the callenger admits defeat) in any case.
Stephan
which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).
But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.
Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.
If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.
It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.
The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.
It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!
I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.
Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.
Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.
However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)
There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.
OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.
hawk
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.
The bet can be won; not in every instance of course, but in the long run.
--
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
--
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.
--
Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.
If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.
--
It's a*2^(-n), sorry.
--
...but Yahoo! seems to have taken the site down. Thank you, Yahoo!
Stating on Slashdot that I like cheese since 1997.
The challanger is correct in that the problem of compressing an arbritary amount of truly random data is insoluble, and I'm not going to question that this is the problem as the challanger believed he had set.
The challangee is correct in that the challange did not specify any particular form of compression. Compression is simply the removal of redundancy. If, by splitting the file and removing a number, redundancy is removed, he HAS compressed the file.
On the flip-side, the challenger looses karma and brownie points for trying to fiddle the figures. You can't argue on the basis of inodes, as the challange doesn't refer to a specific OS or filing system. (MSDOS, for example, uses fixed-sized FAT tables, rather than inodes.)
Also on the flip-side, the challangee almost certainly knew that the problem =as written= was NOT the problem as the challanger intended. True, it wasn't his job to second-guess, but if you enter a lion's den, you can still expect to get bitten.
My personal opinion -- BOTH sides should withdraw their various claims, and the sum total of all monies involved ($5,100) should be donated to an appropriate charity.
Overall, this should be a lesson to ALL -- Don't Assume. It can burn, and burn bad.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
If he had not specifically accepted the possibility of more than two files, it would be a case of someone exploiting a loophole in his challenge. But the challenger actually asked about the loophole, and was told that it was okay. If you're making bets like this, you ought to be sufficiently attentive to not permit loopholes if the person asks about them before using them.
In this case, the challengee specifically asked if the challenger would permit multiple files, and the challenger said yes. It's hard to argue that, simply because he *intended* to say either no, or "yes, but you have to include metadata size on all files after the first" or something else that would actually require compression, that he should be let off the hook. He was asked a straightforward question and gave a straight and wrong answer, and should lose because of that.
Create a partition of 3,145,728 bytes on hardisk.
Put the data file there.
Now try to put the solution in that same partition some how ( as file system,tar, or a big exe).
It will not fit.
He should have used Gregory Chaitin's Omega number to generate the challenge file.
Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.
Zooko
As people have already mentioned, the numbers are probably not pseudo-random as they were
;^)
retrieved from random.org...
However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
you don't need to be a good mathmetician, you just need to know how to read a paper written
by a good mathmetician... Look for the paper...
Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
IT-15(1): 122-127.
Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
for decoding BCH codes actually provides a general solution to the problem of synthesizing the
shortest linear feedback shift register capable of generating the prescribed finite sequence of
digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
problem is demonstrated...
This is probably in most textbooks on linear codes for encryption and/or error detection and
allows you to recreate the shortest LFSR (linear feedback shift register, a common component
of pseudo random number generators), given a sequence of digits. Of course nobody would
use a non-cryptographically secure PRNG like random() would they?
So you take the number, pass it through this algorithm, find the shortest LFSR, and the
decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.
Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
knows it might just accidentally compress the sequence the challenger gives you (not likely,
but who knows?)...
Always remember... "stand on the backs of giants" whenever possible
ahh...but would THAT 1000 bit number be random, or compressible on it's own?
Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.
As several people have observed, the attacker in this case used his "split into multi-files" trick to hide the necessary information in the directory structure of the disk. This is just barely ruled out by the "one compressed file" rule. However, suppose you hid the data in the filename of one file. For example write a program like
echo $1 | cat - $1
(about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.
This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.
Steve
That doesn't get you any closer to winning the contest or overturning the laws of information theory.
Because squaring it gets you close to n, your variable a will have approximately half the bits of n. But there's no guarantee that b will be smaller than (the other) half of n's size. In fact, with large n, the probability is overwhelming that b will be larger than half n's size, so together a and b will occupy more bits than n.
I just tried this for fun with a 15-digit number I pecked in at random (482837578298375), and got an 8-digit a (21973565) and a 9-digit b (19489150).
Because the gap between primes increases according to a simple formula, there's probably a simple proof of the average size of b, given n. But I don't feel like calculating it. :)
Jamie McCarthy
Jamie McCarthy
jamie.mccarthy.vg
I wrote:
Typo, for "primes" read "squares." Sigh.
Jamie McCarthy
Jamie McCarthy
jamie.mccarthy.vg
"Sure" is not straight forward enough???
How about this for a compression algorithm:
1) if the input is original.dat, output 1 byte 0xff
2) otherwise, output 1 byte 0x00, followed by the input
This algorithm doesn't work that well in general, but for compressing the given data file, it should be great.
JET Program: see Japan, meet intere
As soon as I was able to find the mirror of the original site, I figured this out. Seems like there are all sorts of tricks like having a script say:
gunzip $1
gives you all the power of the gunzip executable (72kb on my system) for only 10 bytes. Also, is opening a network connection back to a web server that has the original data forbidden:)
JET Program: see Japan, meet intere
Patrick specified a file size, and Mike generated a file of that specified size. Mike did not include any OS filesystem overhead in his file size. It seems rather hypocritial of Mike to claim OS file system overhead counts for Patrick's files but not Mike's. Mike defined FILE SIZE, not DISK SPACE USED.
--
Infuriate left and right
I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.
> What I'm thinking of, is wouldn't it be possible to create a minimalist compression algorithm that tended to not affect the file size,
> but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).
All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."
Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:
s/a/311/g
s/b/47/g
s/c/9724/g
...
etc.
That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.
Hmm. I seem to have reinvented pkzip.
Geoff
I think I see a trend here. Maybe for them it really would be easier to muzzle the entire internet than to produce p
> My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way?
Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
--
I've finally had it: until slashdot gets article moderation, I am not coming back.
Since a couple of people have asked the same thing, I'll have a go. My mathematics is reasonable, but I'm not a pro. IANAM.
PI isn't actually random, and will skew this analysis in ways that doubtlessly earned someone a PhD, but let's ignore that for the moment. Let's pretend PI is an infinite-length random bitstream. It doesn't matter much.
Being infinite in length, it will therefore (eventually) contain any finite length random bitstring that you care to look for.
The chance of finding a match at any location is one in 2 to the power of the length of the bitstring. So, the chance of finding a 100 bit match at any index is 2^100.
You will therefore find a match, ON AVERAGE, at a distance of about 2^99 bits into PI.
Hurrah! you say. We only need a 99 bit index to represent a 100 bit number. We've made a saving!
First, let's just diverge slightly and apply a sanity check for why this can't possibly be right. If that logic holds, then we can compress any N bit number to N-1 by looking for it in PI. Therefore, we can compress the 99 bit index to a 98 bit index, then to 97... down to 1 bit. Something's wrong.
The issue is that you get a 99 bit index ON AVERAGE. Sometimes it will be 101 bits. Sometimes it will be 98. There's even a small chance (exactly 2^100) it will be right at the beginning. So, you can't just have a fixed 99 bit block to store the index, you need another number to store the length of the index number.
Since the number "99" requires 7 bits of storage, your average case will be to represent a 100 bit string as a 106 bit index+length code.
So, actually, you end up expanding the data, if the law of averages holds.
There's no way around this, no matter how clever you get.
If you say "we'll use a 99 bit block only if it fits, otherwise we'll just store the original 100 bit number" then you need at least a 1 bit switch to indicate this.. so on average, half will be 99+1 bit blocks, and the other half will be 100+1 bits. 100.5 bits per original 100 bit block.
No, Huffman coding the index length doesn't help either. But if you asked that question, you probably already knew the answer.
Of course, you might get lucky on individual cases. That's always a possibility when dealing with truly random data. But then, that's not compression, that's gambling.
If you really want to learn about this stuff, read Claude E. Shannon's work on Entropy and coding theory. (also Turing, Huffman, and Godel) Doing so will also give you the grounding necessary for thermodynamics and quantum computing.
Jeremy Lee | Orinoco
If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.
The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.
My understanding is that GeoCities IMMEDIATELY shuts down pages that induce X hits per hour, because they are likely porn/warez/MP3 sites.
They THEN do investigation, and re-enable TOS-compliant sites.
Kind of sad that they have to do this...
SlashSigTheorem: Humorous, Political, Critical, Constructive- If you have a
Although you submitted it as plain text, the browser might still interpret it as a tag. /. doesn't seem to insert [pre] tags around plain old text postings, I'm not sure why.
Your right to not believe: Americans United for Separation of Church and
Which article were you reading? He had 218 compressed files! (Admittedly without real-world compression, but within the rules of the contest there was compression.)
Your right to not believe: Americans United for Separation of Church and
I think the moral is that if you think you're so smart, and you want to set up a public challenge to prove how smart you are and/or how inviolable a certain concept is, you'd better be smart enough to word the challenge so as to prevent fairly obvious technicalities/loopholes in your challenge. This wasn't a brand-new loophole, Mike even knew about it from the FAQ to begin with and he still couldn't phrase a challenge that avoided this problem.
If you're going to put your money where your mouth is, make sure you know what you're talking about.
Your right to not believe: Americans United for Separation of Church and
The stated metric of compression was "file size", as reported by 'wc -b', not "disk space". This metric was used by both challenger and challengee. The challenge wasn't intended to cause people to write better code, because the provider of the challenge knew that the problem was unsolvable. The only purpose of the challenge was to provide a bit of sport for the elite of the compression usenet world. If the whole thing was planned around "let's laugh at people who think they've found the Philosopher's Stone", then I don't see how turning the tables on the so-called "experts" is really contrary to the spirit of the contest. Either way, somebody thinks they've got a sure thing and that there's no way the other side can win.
I understand the argument that you're making that certain things should be able to be taken for granted in such challenges, but I don't think it's a worthwhile challenge if you don't specify everything up front and then as people solve the problem you say to them "no, you can't do it that way, everybody knows that's just cheating". The provider of the challenge supposedly had much more of a background in the topic of compression and should have been able to easily word a challenge without loopholes (just a reference to avoiding techniques which had been proved to fail from the FAQ would have been sufficient, for example).
Since Mike was willing to put up money based on a challenge which didn't really prove what he thought it did, he should hand over the money to Patrick for providing a worthwhile education in specifying your problem clearly.
I suspect Patrick knew about this hole all along, but chose to go through the challenge to find out if Mike would really admit the error in the specification of the challenge, or if he would try to back out. I only wish I'd thought of it first!
Your right to not believe: Americans United for Separation of Church and
Yes, but the offset number probably has more digits than the length of the file you're trying to compress.
---
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.
---
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Read the damn article before you post - he didn't use 'gunzip' in his proposition.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
Their 486 got overloaded
now we need to go OSS in diesel cars
in this case it wasn't entirely random, since, although the file was sourced from random.org, goldman got the chance to review it before delivery to craig, which (presumably) gave him the chance to check that he didn't fortuitously generate three meg of pi or the decss source - or any of the number of less obvious cases which may have inadvertently yielded the very slight compressability required to win the challenge.
-- in china, chinese food is just called food.
it's more of a thought experiment than a scam - the idea being that you can't win the $5k, however bad you want it. and it's a win-win situation for goldman, since he either gets a bunch of $100 checks from a bunch of loonies who think they can compress his random data, or he ends up paying out the not-impossibly-large sum of $5k in exchange for playing a part in what would be one of the greatest mathematical revelations in recent memory, which is probably more than good enough value as far as he's concerned.
-- in china, chinese food is just called food.
of course, there's always the chance that your algorithm would fortuitously perform wonders on any given file, but i think the odds are stacked against you.
-- in china, chinese food is just called food.
is this a troll? you can represent a gzip of any data in this way.
-- in china, chinese food is just called food.
- gzip your data
- represent it as a large integer
- find the next prime number that can be represented by appending garbage bytes to your integer
so, just regular prime discovery afaik.-- in china, chinese food is just called food.
clare-ents was talking about one compressor that would win 2% of the time
-- in china, chinese food is just called food.
good point; at the level that craig was discussing, though, something as desperately simple as ensuring that you had no repeated or related sequences (given a suitably large random file) would probably be a first step towards eliminating the most obvious candidates.
-- in china, chinese food is just called food.
this is a troll, right?
-- in china, chinese food is just called food.
you've got them over a barrel! go claim that 5k!
-- in china, chinese food is just called food.
it's not as ambiguous as you seem to think it is. can you represent any integer quantity between 0 and 10 using only an integer quantity between 0 and 9? no, you can't; there are more things to represent than you have things to represent them with. i'd say that's pretty inviolate.
-- in china, chinese food is just called food.
it does not work period. storing the factors of a number takes more space than storing the original number.
-- in china, chinese food is just called food.
it'd work for numbers that were close to a prime. unfortunately, there are far more numbers that are a long way away from their nearest prime - so storing the prime index and the difference would eat more space than storing the original number. your recursive tactics don't solve that problem.
-- in china, chinese food is just called food.
ultimately it's just a bit of fun. nobody got hurt, nobody got scammed, and it entertained slashdot readers for a minute. where's the harm in that?
-- in china, chinese food is just called food.
look at the date of the article he linked to.
-- in china, chinese food is just called food.
i guess you just don't get it. some things are highly unlikely, and some things are quite literally impossible. read section 9 of the faq; maybe that'll clarify it for you.
-- in china, chinese food is just called food.
it'd work for *some* numbers (14159265 compresses wonderfully!) but not for others, which is true of every other compression scheme ever.
-- in china, chinese food is just called food.
he wouldn't void the entry if it genuinely wasn't hiding data in the filesystem, ie. the decompressor would accept the input file on stdin. that wasn't the case in patrick's attempt.
-- in china, chinese food is just called food.
there's no scamming involved. goldman required the (suitably small) entry fee to deter time-wasters; craig wasn't after the money in the first place. they were both just out to prove a point (goldman, that truly random data can't be compressed in the general case; craig, that goldman's challenge was amusingly open to abuse), not to make any money.
-- in china, chinese food is just called food.
You could scan the file for the smallest possible number whose binary representation isn't contained in the file,
i.e. let's say 1001010111001010101 (I know this number is too short) is not there replace the 1000 zero-bits with that number.
But I fear now it only get's more difficult to calculate the probabilties of a "win", but the outcome will stay the same.
Agreed. Mike clearly stated, quite boastfully in the newgroup before he even sent the file that Patrick obviously didn't read the FAQ because it was an impossible task. so who's frauding whom? He believes a task to be impossible, and alludes quite obviously to it, then sets up a challenge to meet it?
Come on, Mike, in my opinion you were just trying to make fools of people and maybe make a quick buck on the side. As you obviously were positive this was an impossible task, then you had no intentions of paying $5000, though I can't really speculate, you probably would propose the challenge whether you even had the $5000 or not. Then when he actually used your own rules and resolved the challenge in a way you didn't see coming, you quickly turned lawyer in an attempt to scare him from even trying to collect.
He met the challenge that you yourself called impossible. Pay up (assuming you even can)
Once I thought I was wrong...I was mistaken.
The hard drive platters were free. The coffee can lids might have worked better though because they are lighter in weight, although they wouldn't have been as rigid, so I'm not sure how well they would have supported the weight of the vehicle...
:)
I guess I could try it, though.
My journal has hot
Again, that depends on what you mean by "paper airplane." Admittedly, I won a similar contest involving rubber-band driven cars. The contest rules said that CDs or DVDs could not be used for tires due to the fact that they make for very narrow tires, thus having reduced friction. So I used hard drive platters. :-)
My journal has hot
He never said it was okay to deliver a whole bunch of 'compressed' files. He said ONE decompressor, and ONE compressed file. Though Patrick asked about this, all the guy did was restate his terms.
What Patrick did was take a small bit of information out of the file, and, rather than having a table of 'offsets' telling his decompressor where to insert 5's, he broke it up into a bunch of files, effectively using the directory/inote tables to store this information.
Had he not insisited on a single file, I'd say Patrick here would have technically won, though not at all in the spirit of the competition.
The data has to come from the guy accepting the challenge; you get to tell him how much data to compress, he gives you the data.
The indices into Pi will take more room to represent (in terms of bits) than the data you are trying to represent (in general).
But it is a commonly suggested technique. Good try!
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
I've heard this idea conjectured a lot by various people; but although the mandelbrot set is in some sense infinite, the set of images that can be produced has bounds. Basically, choosing any fractal, and sampling at any resolution, with any slice through the set, still won't produce the entire possible set of data that could be represented in the finite space that you've chosen. Due to the sampling, you would repeat some data and never produce others. In fact, the self-similarity properties of fractals would seem to indicate that this would be a rather poor (or inefficient) way to represent general data.
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
In order to meet the challenge you must provide to me a compressed file and a decompressor which together total less than the size of this original.dat, and which can generate a file identical to original.dat.
He doesn't say "file size" in this or any other passage. It is Patrick who says (and assumes) this.
Also, Patrick specifically addresses the issue of filename based compression in this passage.
It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.
Thus, I think it is Patrick who hadn't considered the fact that his "decompressor" is not only just the 'dfi' script, but also the filename information used to indicate file boundaries and ordering. Like I said, his algorithm (much less his actual implementation) could not properly reassemble the file if it were fed all the properly ordered "compressed" files as an input stream (through a pipe, say), since it wouldn't know where to place the missing fives. So given that Mike said "total size", in all cases, when discussing and offering the challenge, I think his position is safe.
Now, I do think that Patrick was reasonably clever, and that Mike was overly smug in both the challenge and the initial news posting. And in the exchanges Patrick did explictly ask about file sizes, but I do not fault Mike for not clarifying further, since he had no a priori knowledge of the scheme, and therefore didn't know he would have to be so specific. But his use of terminology was consistent, and I think, appropriate, to protect against this type of scheme.
Anyway, that's just my take on it. Mike can hopefully learn some humility from this, and Patrick can learn that his "decompressor" is more than just the sum of its file bytes.
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
-- these are only opinions and they might not be mine.
As soon as Mike accepted Patrick's money, it was no longer a rhetorical challenge. It was a bet. Patrick, at that point, had caught Mike proffering a reverse sucker bet. Mike says "I bet you $100 at 50:1 odds that you can't do X." Patrick, confident that he could do X, accepted the bet and then did X. Mike says "Oh, I really meant for you to do X.1. I only tricked you into trying to do X because it's impossible." Patrick, in a really amazing display of good sportsmanship, declined to have his money refunded. (Which Mike wisely proffered...otherwise, he'd be in serious danger of fraud...)
My momma told me never to take a sucker bet. Mike got caught by his OWN sucker bet...which takes a) greed and b) foolishness. He got lucky that the person who took him was such a good sport about it.
It'd be really nice to see Mike donate the money to ACM or EFF or somebody like that...that'd be a pretty classy way of taking care of this.
Why yes, I AM a rocket scientist!
I'm not necessarily arguing a point of law, it's more of a point of honor. Wagers are well-understood in many cultures, and in this case it seems clear that Mike proffered an arguably dishonest wager (a bet where he challenged people to do something that was not possible to do to his satisfaction). Somebody hoist Mike by his own petard. Good for Patrick.
Why yes, I AM a rocket scientist!
"Last, the contestant will send me a decompressor and a compressed file, which will together total in size less than the original data file, and which will be able to restore the compressed file to the original state."
At the bottom of the page:
"There's almost universal agreement that I didn't compress anything..."
Since he didn't have any compressed file (or files), it seems he didn't meet the rules of the challenge.
Dad always thought laughter was the best medicine, which I guess is why several of us died of tuberculosis.
Yeah, my exceptionally slow compressor counts the number of "one" bits and the number of "zero" bits. It stores those two numbers, along with several position-dependent checksums. Compression is very fast.
;)
Decompression can take roughly the lifetime of the universe (for large files).
Interestingly, my brilliant strategy is also a encryption algorithm
bukra fil mish mish
-
Monitor the Web, or Track your site!
Eloi, Eloi, lema sabachtani?
www.fogbound.net
(should have said "slow de-compressor" but you get the picture)
bukra fil mish mish
-
Monitor the Web, or Track your site!
Eloi, Eloi, lema sabachtani?
www.fogbound.net
It should be possible to come up with a scheme that would work more often than 1 in 50, which is all you need to make money at this game, pretending the backer would actually let you play over and over again.
How about this:
take the file,
1 look for the longest sequence of 1s. (length x)
2 record its starting point (position y).
3 remove the ones and write a program to insert x 1s at position y.
4 choose the file size to ensure a high enough (>50% say given the terms of the bet) probability of finding enough 1s in a row to be longer than the code to reinstate the ones.
Of course the decompressor is not general but there will be one such that will work on any sufficiently large randon number and hte length of the program will only increase logarithmically with the size of the file.
I was trying to avoid using EOF as a cop out.
Is it also the case that I could not win through choice of run (zeroes, digits of pi, exponentials of small numbers)? I imagine that this wins just a few bits and the length of the program bounds the maximum number of different checks. and each method only wins about a bit.
Also the files would get very large.
Still since I only have to do better than one fiftieth it ought to be possible to get quite close.
In theory, and in practice, the actual length of the "compressed" data files is equal to the length of the original files. Why? because all that happened is that a character ("5") was replaced by another character ("\D" or EOF). The tool wc obviously does not include EOF characters in its byte counts, so it does appear that the "compressed" files are smaller, in sum, than the original. In actuallity, the compressed files each have an extra byte tacked onto the end, where that "5" used to be -- an EOF character.
Does that matter? I think so.
Has Patrick, albiet perhaps unwittingly, used a "filesystem exploit" to (try to) win? Yes. The fact that the computer removes and ignores the EOF character when processing the file data is of no concern -- the EOF is there, stored in the file, regardless.
Lzip should have gotten $5000 for being the best compression program.
----------
----------
Check out my blackbox styles
Too bad the Python doesn't get executed.
:-)*
Maybe in Squishdot, as long as rexec is used, of course. You could import StructuredText from Zope and your post could act like a Wiki.
- *this* would look like this
- **this** would look like this
The above would become an actual HTML bulleted list.
http://this.would.be/a/link
|| Instant Table ||
|| 1 || 2 ||
Ah, well.
I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.
Data point: The corolation for me is: modern US law as practiced is harmful, and anti-productive; this contest was clearly stated, and clearly won. Mike should pay up.
But it was clear. The contest was in terms of total file size of the compressed output and decompressor program.
The proviso given (which should not have been) was that the compressed file could be provided in multiple parts. What he forgot is that an ordered sequence of files has more information than just the sum of their contents. Ooops.
This is not a wording problem, but one of a poorly thought out bet. This guy gave 1:50 odds that no one could compress truely random data. He chose to allow conditions that made the bet doable, and then did not want to pay up when the bet was won.
Actually, even without using multiple files, I think the bet's doable for any single given data file (though you cannot write a generalized compressor for all truely random data). I should come up with the submission to solve for this guy's challenge.
there's implicit information present in the split files that is not present in the whole file such as the sizes of the files and the order of the files (which is implicitly included in the filenames). That's why the split files tarred and gzipped together produce a larger archive than the orginal file (because this information will have to be stored also in the archive). That's why mr. Goldman asks for "a decompressor and a compressed file" and not multiple files. Now if mr. Craig delivered ONE file that is smaller than the original which dearchives into the 218 files which in turn decompress to the original file, he might have won (but it would never work).
which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator). This obviously isn't true. A true random number generator doesn't randomly generate numbers but generates numbers that are truly random, that is, you can't predict what digit (x+1) will be based on the previous x digits. That is why the only true random numbers can be generated if the source is completely random such as the background noise in space, or timing nuclear disintegrations.
(or to do it anonymously ofcourse). Now please don't flame me ;-)
Sky, an extraordinary proposition gambler, responded by countering. He covered the offeror's eyes and bet he couldn't tell what color was Sky's tie. "No bet."
I say this to point out the difference between sucker betting and proposition gambling. It is a sucker bet to offer a bet challenging a mark to do something that is, in fact, impossible. In proposition gambling, the offeror gives the mark a proposition they are unlikely to win, or that is worded in such a clever way that the winning of it leaves the mark too embarassed, entertained or impressed by the cleverness to attempt to challenge having lost.
In this case, the first offer (compressing high-entropy data) was a sucker bet. Shame on him. The response, doing the sucker bet using multiple files, was a sweet proposition gamble. Shame on him. Once taken, the offeror-now-mark did weasel, but as I see it, ultimately welched.
I admit that I, too, didn't consider the import of the multiple files on first reading (but, hey, I didn't have $5K against $100 riding on the result).
At the end of the day, the sucker-bettor was the mark, and he had an earful of cider.
No sympathy.
It is my understanding of the situation that Patrick "solved" the challenge by exploiting the fact that multiple files store more implicit data in the filesystem than a single file does. I believe the only such data he used was the EOF.
This is considered "cheating" because he used 220 EOFs that are not counted in the filesize.
So, my question is this. If a contestant submits a single compressed file and a single decompressor file which are just 1 byte under the original file, does Mike void the entry because of the extra filesystem space of two files? If not, why not? If so, shouldn't the rules say that specifically, as two is the minimum total number of files one can submit?
--
gnfnrf
Its an intriquing problem. I certainly wouldn't stake $5000 on nobody being able to compress the data I supplied, good random number generator or not. Why? because while the concept that random data in general cannot be compressed is correct, there are no patterns, it is not necessarily true that a given set of random data cannot be compressed.
For example, a file full of 0's, is perfectly possible from a true random number generator, and just as likely as any other set of numbers. As any good geek knows, a file full of 0's compresses very very well.
In this case, I think the challenge has one major problem, and that is that there is no real time limit on the entry, and no requirement that the decompressor work effectively in general, it can be entirely specialised.
This means that the problem *may* be solvable, simply by throwing ungodly amounts of CPU power at it. With the appropriate code, it would be possible to simply hunt for unintentional patterns in the data and hope. If you're lucky, there will be enough indexable patterns present that it will outweigh the cost of the indexing, and the code necessary to replace the index points.
If you're unlucky, the data will have no patterns in the space you searched, and therefore you're out of luck.
This is why the guy noted that if he got a big enough file, he could win the competition. As the filesize increases, it becomes increasingly likely that there are, accidentally, enough patterns in the data to cover the space required for the decompressor and the dictionary. Of course, as the filesize increases, you also have to burn considerably more CPU in order to locate the patterns in the first place.
Personally I think it would make an excellent project for a research team, not particularly for the money, but in order to create some really fast, specialised code for locating patterns in assumedly random data.
I am not aware of much effort in that area, most compression research goes into identifying patterns inherent in the data type, waveforms in audio, visual patterns in images, duplicate strings in text and binary files. It would be very interesting to see just how effective fast analysis code and heavy CPU is against the problem.
I suspect there may well be some crossover here with Cryptoanalysis, since one of the primary jobs in breaking crypto is finding the patterns that differentiate the ciphered data from pure randomness.
Any links appreciated.
You can't win a fight.
Depends entirely on how random the random data is. There's a surprising lot of random numbers which can be compressed--this almost always indicates a critical flaw in the random selection process. Even radioisotope-based RNGs aren't immune; by knowing the isotope, its age, and the sort of radiation detector being used, it's often possible to get a faint hint of determinism out of the RNG. That faint hint of determinism can be exploited to result in compressability... sometimes.
By and large you're right that random data can't be compressed. But getting really random data is a far-from-trivial undertaking.
1. gzip your data
2. represent it as a large integer
3. find the next prime number that can be represented by appending garbage bytes to your integer
Why bother with primes at all? Find the closest n ^ k. Specify n, k, and the difference (positive or negative).
ASCII charactor EOF? What the hell kind of ASCII EOF would be in a binary file?
-- The act of censorship is always worse than whatever is being censored. Always.
Understood. I wanted to know if any research has been done to determine the discovery time and whether this would make the scheme feasible...although I'm pretty sure no, for the same reason cryptography is secure these days.
:)
Just spouting random thoughts, jeez
A while ago on slashdot.org an article was posted about a prime number that had the "magical" property of being able to be g'unzip'd to the DeCSS source. The number was represented as a (relatively short) arithmetic expression.
My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way? Of course the chances of finding these type of numbers are very slim, but they *are* infinitely large spaces, and the compression would be ultra-compact.
Maybe something for a quantum computer to chew on anyway...
SuperID
Free Database Hosting
This works well for highly composite numbers, eg.
;-). You also have the problem of deciding when you have achieved 'enough' compression.
4096 => 2^12
However for less composite numbers,
p*q = p^1, q^1, p & q prime
Its not so good. For prime numbers, you're worse off (trivially):
p = p^1
Note that I've ignored the delta - eg. for the (can't remember the name) special primes that appear as 2^n -1. Here, you run into trouble finding the delta (you have to factor p, then p-1, then p+1, then p-2... factoring being a hard problem (in NP), you're not going to have much fun implementing this
By the counting arguments expressed in the other replies to your post, there will be points where, by the time you have factored the number, counted the factors & their exponents, and the delta, you're worse off than when you started.
I would expect this to be true for very small
numbers. Note that no matter what we do, we're effectively in non-computable space because we have to factor prime numbers. So as reasonable as this algorithm is for compressing numbers, its going to take far too long to do it to be practical, at least, without quantum computers.
Cyano
Don't like my sig? I don't either.
The problem is that Mike said it was OK to use multiple files, but did not specify whether that meant that the actual additional filesystem space that they took up would be counted against Patrick anyway (with the idea that Patrick could use many files, just so long as the eventual information still added up smaller than the original), or whether he meant that it shouldn't make a difference, and didn't care.
The information that could be considered "removed" by the files, would basically be the start and end offsets of the chunks, if they were all in one file...which would naturally add up to more than the original (219 * 2 additional info). The "missing" information escaped into the unix shell and filesystem, through the facility of "cat" being able to print out a block of bytes given a single piece of info.
It's 10 PM. Do you know if you're un-American?
POT appears not to mean that Slashdot will *interpret* your post as POT, but instead that Slashdot *assumes* your post to be text that is not meaningful HTML. They should changed it so any HTML is appropriately escaped (entitied) from POT submissions. Angle brackets, etc, are very annoying to have to fix.
It's 10 PM. Do you know if you're un-American?
ie, it would confirm what I say every day.
How we know is more important than what we know.
Wrong.
If you could compress everything, you could just take any set of data and keep compressing until you have nothing left. This is quite clearly nonsense.
This is my sig. Hooray !
Since the challenge allows you to see the file before building the compressor (and the neither the size or speed of the compressor is relevant), all you need to do is find the single array of bytes (I'd say string, but presumably the 'arbitrary file' would be binary) greater than three bytes, that occupies the greatest percentage of file space in the compressed file. Then replace that array wherever it occurs with an EOF character and name each new section of the file sequentially using a 1 byte name.
(As noted in another post) As long as you pick an initial file length long enough that 2% of the time there statistically should be enough repeats of some 3 byte (or longer) array to equal the number of bytes in your decompressor (which would be roughly the same size as Patrick's), then you will win the challenge enough times to make it profitable.
Work for Change & GET PAID!
Nice try, but won't work. You have to keep the position of that serie of '1'. That position will use about log2(position) bits. Unsuprisingly, 2^n is the size the file needs to be to have a good chance to have a serie of n '1's.
Not if you expand on Patrick's original idea replace the series of 1's (or zeros, your compressor should determine the longest continues series of each and then use the longer) and replace it with EOF. Since EOF can't appear in the original file except at the end, your de-compressor will know that this is the position to insert the original series. You could also repeat this as many times as like, provided the replaced series is at least as long as length of the EOF character plus the length of each new file name created plus memory space needed to store the value of the length of the series.
Work for Change & GET PAID!
Do I get to pick the number base I represent the result in?
This can't be anything other than flamebait. I'll try and overlook it by saying that I never meant to suggest that I was universally pro- or anti-lawyer. I meant to discuss how a person's stance on this issue might lead to unintended consequences. Specifically that you might be enabling the powerful to be more powerful, by giving less value to intent and only sticking to the specificity that was set forth in his challenge.
But my assumptions about how the law does and does not work are well corrected by you. (Obviously) IANAL, so anything I said about how the law works is left to better people than me.
Cheers!
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
What's funny about that is that I was thinking that exact thing right after I pushed submit on my last post! Nice catch.
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
My only intent (ironic) was to start a discussion of which side you think you'd rather have, or how you combine the two into something that is acceptable. Of course, no such discussion started, and discussion about how little I understand the law is what ensued. Oh well!
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
[ ... ]
Specificity is critically important when you're communicating with a computer. But over specificity is actually a deterrant to communicating with another human. Ambiguity is a common mechanism by which we humans learn. So, for example, when I say "cat" you have one image in your mind and I have another image in my mind. But when we both discover that the thing that I was talking about is different than the thing you were thinking, we'll have expanded our understanding of what the word "cat" comprises. Just now, I was thinking of a cheetah. Do you remember when you learned that cheetahs and lions and tigers were also cats?
I'm overstating some things here becuase it's been about 8 years since I took my philosophy of language class. But to sum up, generally human communication works much more efficiently when there's some ambiguity in it.
But it's not just in the event of silly cases. You are cornered by the silliness everyday. Clearly you use a computer connected to the Internet. Do you think that the terms of service that govern your connection protect you? The point is that if we can't figure out a better way to deal with specificity and intent, then we consign ourselves to the rules that are written by those who can afford to write them.
I don't have an answer for this either, but I'd like to think that the discussion isn't so easily dismissed.
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
If you are for the challengee, then you believe that exact specificity is more important than intent. And you inadvertantly empower the legal profession to profit from being better able to specify rules, etc. Which, in turn, gives those with money even more power. They can write the rules by which your interaction with them will be governed. You can be certain that the rules will be written to the advantage of the guy with the money.
If, on the other hand, you are for the challenger, then you believe that his intent was more important. (Considering that the challenge came from the compression FAQ, the exact definition of "compression" should have not been vague. It seems clear to me that his intent was obvious, even though his challenge wasn't very specific.) Still there is a problem with valuing intent over specificity. That is it's easy to misrepresent intent after the fact. Thus the need to clearly specify our intentions prior to anything we do. That being the case, with accurate documentation, it seems possible to me that we can discern the original intent w/out having to specify everything legalistically. For example, I think that I was able to clearly determine the challenger's intent well after the fact, and I did this having read the challengee's representation of the events.
Now you may say, in rebuttal, "Who cares who I choose? I don't get to make the law." But think of yourself as the jury trying to determine to whom you would award the money. You may one day find yourself on a jury and trying to determine who should be awarded money. I see this as a microcosm of exactly how our court system has gotten into the state that it's in. Specificity is clearly more important than intent. And it's that way because juries award that way. Since juries are comprised of members of society, it's not a far stretch to say that society as a whole regards specificity as more important than intent.
Is there any chance that we can change this? And if we did, would it be better?
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
There is a solution to this problem, but it takes a _long_ time to work :)
Write a script that:
1. outputs random data to a file
2. gen's an md5 sum on the file and checks it against the original md5
3. gen's another checksum that does not overlap with md5 and check this against the original files checksum
4. When the two checksum's match you have recreated the original file.
This script should be easy to make small. You then simply ask for a data set one byte larger. Eventually, you will generate the original file, but it will take a long F***ing time.
The only other problem is finding two checksum programs with large collision domains and that do not overlap. md5 is one, simply find another.if you can do this, then you can probably win the challenge, although I would hardly call this compression.
-sirket
The point of using multiple non-overlapping check-sums on a file is that the collision domains will minimally overlap.
The chances of finding two files with the same md5, the same sum (Alg 2), and the same crc32 are miniscule, and the more checks you use the smaller the odds. If neither program collides with the other you are garanteed a unique file.
md5 is a 128 bit checksum. crc32 is 32 bit. sum (alg 2 is 32 bit). Are there times in which these algortihms will produce the same answer for two different files? Yes. Are there times when they will _ALL_ produce the same answers for two different files? Yes but the chances have become very remote. The question is how many will produce useful input for a program. Probaly only one.
You could also check some of the byte sequences that you know from the original file to make sure that the randomly produced file is the correct one. If you know some small part of the original data as well as the checksums you can reduce the collsion domain to zero and garantee you have the original file.
Eventually you should be able to pick a set of check-sum programs that produce somewhat unique results, and then choose a subset of the original data, such that the two taken together produce the original data uniquely, and do so in a smaller space than the original file.
-sirket
And you would have destroyed anything anyone knew about information theory in the process! Yay!
:P
Say you had an algorithm that would take a "compressed" file of all but the last 4 bits of information, and guess at the last 4 bits. It'll get it right 1/16 of the time. That's great! You can spend $1600 and win $5000! So then all you need is a program to do this that fits in the remaining half a byte! That's not too hard, right?
Say you're a really l33t assembly coder and you can make an algorithm that would do this sort of thing in 20 bytes. You'd then have to guess at least 20 bytes of information on each attempt. The prize for the contest would have to be approximately 1.5 quindecillion dollars (1.5 x 10^48) for you to have a chance to make a profit.
--
Win dain a lotica, en vai tu ri silota
The flaw is that, for ANY file size, the probability of finding enough 1s in a row to be longer than the code to reinstate them is damn near close to 0. It will never get close to 50%. It won't even get close to 2%, the break even point. It won't even get close to a tiny fraction of a percent. You'd be better off going out and looking for a $5000 bill on the street.
--
Win dain a lotica, en vai tu ri silota
If random data could be compressed in a useful way, it would certainly be a mathematical revelation. You would be discovering that (say) 1000000 bits of pure information was equal to 999999 bits of pure information, and it follows directly from there that 1000000 = 999999, 1 = 0, TRUE = FALSE, "the universe" = "gone", etc.
--
Win dain a lotica, en vai tu ri silota
And here if needed.
- Michael T. Babcock (Yes, I blog)
According to me the challenge was aimed at proving that it was impossible to build a perfect decompressor and that all compressors make use of redundency in data. This is a dangerous assumption to make for this challenge ! If the challenge required 2 files to be compressed it would have been much closer to being safe.
Consider this:
0x23 ^ 0x34 = 0x77B352C55E214991
Thus we could "compress" the 64bit binary data to 24bit binary data by storing only 0x239434. This indicates 2x8 bit numbers separated with a mathematical symbol. This leaves 56 bits for code, etc/64 bits of data -> 87.5% of data size for code in the decompressor.
The point being that if the decompressor need only work on a single dataset there are usually cunning methods of representing the data. Such a decompressor will not be usable to compress day-to-day data (as is well known from compression theory) but the number of possibilities to find a pattern that is known to exist in a file increases the probablity of meeting such a challenge greatly.
Am I just crazy or is this a possibilty?
I have to agree with this. The trick is Given a sufficiently large file. I can't imagine anyone coming up with a 2GB file off of which a modified RLE program couldn't snip a few kB. By "modified", I mean spend all the time you want analyzing the file (this is allowed), and find strings of similar bytes in arithmetic progression, or in strange but calculable patterns. Remember, the compressor can be as complicated and large as possible; it's the decompressor that needs to be small.
True enough...if you don't consider the decompressor part of the output. Similar to what Patrick did, I could check the first byte of the file, strip it, and write a "decompressor" that sticks the byte back. However, if you include the decompressor, then the decompressor is part of the output, and the "decompression algorithm" is actually "run this program through an x86 (or whatever)". Like it says (all over the place, here), you can't do it generically. That's not to say it can't be done for a given file...maybe I can find an LCG pseudo-random number generator that creates the first 50k of the file.
The technique which was used in this case would not work for instane if there was a massive under representation of 5's in the data set.
Of course, in that case, you could use the lack of fives to your advantage. Maybe a Huffman encoding?
My BS and scam detectors burned out about three sentences into the guy's offer. "Give me $100 and I *might* give you $5000"? That is a classic scam, a variant of the "Give me $1000 ernest money and I'll split this $10000" scam.
Sorry, but technical details aside, anyone who responded to this should avoid timeshare offers, Ginsu knife commercials, and Three Card Monty players.
www.eFax.com are spammers
That only compresses if a, b and c combined use fewer bits. That's not necessarily the case. For instance, if n=5:
n = 101 : 3 bits
n = (10 * 10) + 1 : 5 bits
n = (100 * 1) + 1 : 5 bits
And that's ignoring overhead bytes to specify how many bits each number has.
Only the dead have seen the end of war.
You've clearly never heard of the pigeon-hole principle, or considered applying it to compression theory.
You can't squeeze the 2^n unique possible strings of 1s and 0s of length n into the (2^n)-1 unique possible strings of 1s and 0s of length 0..(n-1) without loss.
Only the dead have seen the end of war.
Not in the general case. No method will in the general problem of being ONE algorithm that will losslessly, strictly compress/uncompress every possible bit string.
(strictly compress => must reduce in size)
The usual proof --
* For arbitrary $n$ within $\mathcal{Z}^{+}$, there are $2^{n}$ unique binary strings of that length.
* There are only $\sum_{i=0}^{n-1} 2^{i}$ unique binary strings that are *fewer* than $n$ bits long.
* That summation adds up to... $2^{n}-1$.
* Therefore, the pigeonhole principle applies; if all $2^{n}$ strings map to strings of length $n-1$ or smaller, there must be at least one output (compressed) string which is now ambiguous since it could mean either of at least two strings of length $n$. Ergo you've lost information. Conversely, if there is no ambiguity, then at least on string of length $n$ was mapped to a string of at least length $n$ -- IOW, not compressed. And this holds for every positive $n$...
And when you combine different methods, you have to remember to account for the bits that tell you *which* decompression method to use.
Only the dead have seen the end of war.
There's a fair chance it wouldn't work. Every algorithm will have bit strings it cannot compress... so, basically, it's a search through the infinite space of possible algorithms to find one that does apply.
Only the dead have seen the end of war.
Mr Goldman is really pushing his luck here. He's bet, at 50:1 odds, that someone can't compress truly random data. That's great. Except Random.org -isn't- truly random data.
If you look at the stats, you'll see that the ent program suggests that the data being generated is at least superficially compressible, in the neighborhood of one quarter of one percent. All Mike said was that you have to compress it 1 byte. Given a large enough file, and a good statistical compressor, it's entirely possible that he'd have to cough up the dough.
It may still be possible to honestly 'trick the trickster', here.
-
Just because it works, doesn't mean it isn't broken.
The challenge specifies that you can choose your compressor after you had a peek at the datafile. If it turns out to be compressible enough by a known algorithm, just ship that compressor. If it isn't, use your meta-compressor.
However, I have a gut feeling that the uncompressible files are too numerous for this meta-compressor to work: if 99% of all possible files are uncompressible, then the index number of the file would not be that much smaller than the file itself, and you couldn't fit the meta-compressor's code (which would be rather complex...) in the small space saving.
As others have pointed out, a better alternative would be to simply "play the odds": make an algorithm which only works in 10% of the cases. If the file can't be compressed, resubmit an entry until you get one that works. On average you would have spent $1000, and won $5000!
Say no to software patents.
First of all, the decompressor was aloud to be written in shell scripts. There are all sorts of hacks one could do if you are able to link to other code for your decompressor. It should have have to have been written in C, and should have have to have relied on stdin.
Someone you trust is one of us.
reverse crc compressions:
For every block of real data:
data[blocksize]
compressed data:
data[blocksize - 2] + crc
The crc is only 1 byte and can be defined as the crc of this block and the previous crc. The decompressor then chops through the data, when a crc comes up incorrect, it goes back and chooses a different possible value for a proceeding crc until the entire series of datablocks validates.
Someone you trust is one of us.
Clearly, this is the crux of the arguement, and intellegent people can waiver here. In a sense the fragments were compressed in that the 5's were chopped off. I would actually argue that this is a form of compression, however impractical in this case. For example, if you create a dictionary of "phrases" and install the dictionary within the decompressing client, and send the "compressed" files you are really doing the same thing that Patrick did. The difference, is that Patrick exploited the filesystem to store his dictionary data, as opposed to actually putting the dictionary inside the decompressor. In fact what he did IS compression, and the files were compressed in size. Patrick could have done even more devious techniques by moving all of the data into the filenames themselves. Because Mike did not spesify that external linkage to datasources was illegal, what Patrick did in my view was within the confines of the contest.
Someone you trust is one of us.
One solution however impractical an imperfect is to reverse-checksums. Essentially, you take the data block and perform a series of running checksums (like CRCs). You store only the checksums in the "compressed" file plus tidbits of "hint" real data. What you have is a lossy form of compression. However, with enough processing power (a few dozen E10,000s) you can reverse calcuate the checksums which will give you accurate results with 99.999+% of data blocks (some data blocks will return false solutions). It is simply a matter of home much processing time you are willing to throw at the problem.
Someone you trust is one of us.
Aside for not clarifying the use of external dictionaries as being illegal (Mike let Patrick use the filesystem filenames to store data), Mike also didn't clarify processing time. If processing time is not a factor, you can create reverse CRC/Checksum compression that is acurate almost everytime. Reverse-crc compression is accurate inversely porportionate to the rate of compression. So, you simply need to choose an large data block (1 gig or so) to compress with a very low rate of compression, say 1000/999 or something like that. Then you let the reverse crc decompressor work on it for a few days, and eventually you'll come up with the right data 999/1000 times.
Someone you trust is one of us.
.
Patrick I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed fileMike Sure.
Pay up Mike. Pay up. You were too careless to plan out the rules and think about your intent before you threw 5k into the sink. Cough up the cash or loose all honor.
Someone you trust is one of us.
Just gzip the file and send it together with the gunzip program. The total of the 2 files bigger than the original ? No way !! It's just, for example, minus 20kb smaller :-)
OTOH... same goes the other way around... If you would solve this challenge... the guy could always say : 'Nope... it's minus 20kb bigger' and will keep his 5k in his pocket...
I don't know about anyone else, but I'd be interested in taking a gander at the original.dat file, but the link is in Japan and very slow. Anyone else mirroring it?
Let us disregard the issue of money and the issue of a specific answer's acceptability and analyze the question
In my opinion any person, given enough time to come up to speed on the subject, and given that that time, plus the time it takes to come up with and implement their solution all multiplied by the factor of time it takes to maintain themselves (eating, resting etc.) during that time, and given that they live that long, can come up with a winning solution to Mike's challenge due to a flaw, not in wording of the challenge, but in Mike's reasoning behind the challenge.
To put a lot of discussion to rest (in my mind) just read the relevent portion of the FAQ in which the challenge is found. i.e. the section entitled "9.2 The counting argument"
Here you'll find, clearly stated, the flaw in trying to find any one algorithm which will always compress data of a fixed length to less than that length. Also you'll see the decent challege by Steve Tate which paraphrased says:
Mike Goldman's addendum challenge effectivly states:Reading it in its full context, which aparently a lot of slashdotters didn't, it immediately gives itself up as a challenge which does not hinge on on the mathematical theorem it is presented with. i.e.
Given, there is no one algorithm which can compress all data, nor a subset of data selected by being a fixed length.
Therefore, if you give me an algorithm which claims to compress all data or all data of such a subset, I can give you at least one counter example.
It does not follow that I can give you one counter example to all compression algorithms for a given length (of your choice).
The implication of the first challenge is that the poser is confident in both the mathematical theory and either his analytical abilities when studying the given algorithm, or in his ability to test and check against the given algorithm in a sufficient period of time.
Conversely, the implication in the second challenge is that Mike in fact is the possessor of a set of super-data, of which any sub-set cannot be compressed by any algorithm.
Since we live in a time when any data sufficiently limited in size yet sufficiently large enough to offer decent gains when compressed (the challenger may pick that size) can be analyzed fully and with great speed, I think Mike has made a bit of a fool of himself.
I am not saying that Mike can't supply <<the uncompressable data from hell>>, but he will have to go to some lengths to get it. It will likely be selected to be large enough so Mike cannot generate it himself without an algorithm. It cannot be generated with a small input set or that set could be discovered and when coupled with the generator would defeat the challenge. It may be generated from a large set, but the result must show no patterns that can be bitstuffed (see networking) against or more simply generated and indexed for, nor may the source set contain any of the same properties. It may be possible that Mike is collecting this data from a source, i.e. digitizing random radio signals; we may all be interested in any source data which cannot by any reversable algorithm be represented/stored more succinctly particularly without any regard for retaining its meaning while compressed.
In summary, if Mike doesn't want to be giving away money, then he doesn't understand his own challenge's failure to be based on any mathematical certainty (unless I'm wrong and he has some super-data). However Patrick's solution was clearly not compression, though it was the start of a decent form of compression. Had concatenated his file-set and headed it with information as to each file's length in order, he just might have saved space enough for the decompression, particularly if he picked a suffiently large recuring patern that could be generated rather than having to be stored verbatim.
However, I might just be able to claim, that if the universe is deterministic, if the input conditions for the creation of the universe could be represented in less data than a generated block of data which can be represented in that universe, and if I were given enough resources in which to recreate the universe, I should be able to offer the input conditions of the universe and an index into it's ongoing generation which could point-to a generated block of data which exactly matches one provided. I just might be able to give you the process, its inputs, and your data before you've finished generating the data for my participation in the challenge, if in my generated universe I could significantly warp time without affecting the results.
-Daniel
That last part's an odd joke
I'm surprised at the number of people using the FAQ's "pigeon hole" theorem backwards. The theorem states that a reversable algorithm cannot be used to represent every piece of data as a piece of data smaller than itself. It does not address being able to pick an algorithm well suited to just one specific piece of data. If there is any data to which no compression alogrithm can be applied to, tailored to, or made for, that's a pretty amazing piece of data, and we should all be made aware of it Mike.
-Daniel
Goldman's mistake was two fold:
1) He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.
2) This he failed to do so within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.
Seastead this.
One of the solution paths people often forget about is the one where you re-define the problem to something that's (moreeasily)solvable. :-)
Furry cows moo and decompress.
Looks like he did it to me, and the proof of his victory is all over the net. I don't see any way the sponsor of the challenge could weasel out of this one. I guess he's lucky Patrick was such a gracious winner. -Mark
I think it is clear to anyone with any intellect who reads this that he took the contest out of context. People, a great deal of misery is the result of miscommunication. I'm talking about people who hear what they want to hear as opposed to what someone else is trying to convey, and lets be frank, spoken language is bad enough of a medium to convey ideas without people intentionally looking for ways to misinterpret it to their own ends.
Although I do still think it was a clever prank to pull, but not one that I would seriously make a webpage complaining about not getting paid.
Of course the argument is "But he said...!"
Well, my only response would be "Hear what I mean, not what I say."
Sigs are awesome huh?
I find that odd. I observed a similar contest and the planes that won went much further than a crumpled up ball would have. Using a single sheet of paper the size of a typical drawing tablet I could make an airplane that goes the distance of the gymnasium where the contest took place. I don't think you could have thrown a crumped up sheet of the same paper that far, although I didn't try.
Sigs are awesome huh?
The point of the story is to be careful when issuing so-called "impossible" challenges. Don't leave loopholes in the rules!
Free Hans!
When Goldman gleefully posted to the newsgroup that someone had accepted his challenge, he said "some people just don't understand information theory too well". Yeah, and some people just don't understand issuing challenges very well, either!
When you make a challenge like that you absolutely must be sure to word the rules carefully and try to avoid leaving any loopholes. In this case, all Goldman really needed to do was tack a sentence at the end of his original challenge saying "the data must actually be compressed according to the definition of 'compression' contained in this FAQ. Challengers who merely point out loopholes or exploits in the rules of the challenge will not be rewarded".
The moment Mr. Craig started iquiring about about the specifics of the rules and saying "wait and see" regarding his methods, that should have been a warning sign to Goldman that Craig had some trick up his sleeve. But no, Goldman was just so in love with the idea that he found some "impossible" task which suckers might try to undertake that he couldn't be bothered to think about such things.
What's even more embarassing to Goldman is his initial reaction to Craig's submission:
Even though Craig had already asked if more than one file was okay and Goldman had said yes!And Goldman's following point:
is even more ridiculous given that Goldman had already said that "sure" in response to Craig's question about multiple files whose "total file size" (which does not include directory entries) are less than the size of the original data file. Goldman tried to change the rules rather than admit he was sloppy in setting them up.Goldman also says "I think that I did not make a mistake", even though he admits in the same message that he needs to restate the rules:
"Perhaps"? How about "definitely"! And any "miscommunication" was entirely one-way. Craig was very sly and clever, and Goldman was too full of himself to realize what the fuck was going on until after the fact. A "pompus ass" indeed.Free Hans!
I think in his post to comp.compression Mike wrote that his random data had come from random.org. Random doesn't use a PRN generation algorithm; they get random data from atmospheric radio noise (tune your AM radio to a non-station and you'll hear what that means.) To remove any bias in the data they postprocess the noise by removing high order bits, and the low order bits are shifted to normal using a bit discarding algorithm.
LibBT: BitTorrent for C - small - fast - clean (Now Versio
What about raising some funds to sue the moron and give Mike some good lawyers ?
"Naughty, naughty, naughty, you filthy old soomka !"
No, because in almost all instances the formula would involve operating with multi-million-digit numbers totaling more then the original MP3.
You'd waste a tremendous amount of processing trying to find applicable equations that would rival current compression.
Of course if you're willing to take some loss then it starts looking up again. But that doesn't help for this contest. Who would want lossy white-noise? It'd be rendered completly worthless!
-Andy
Read the entire file. The challenger agreed that the entry could use multiple files:
=== Begin quoted text ===
>
> I meant can I send you a compressor and several compressed files whose
> total file size is less than the original uncompressed file and from
> which I can regenerate the original uncompressed file
>
> Patrick
Sure -- but you send me a *decompressor*, I don't need the compressor.
=== End quoted text ===
I do not deploy Linux. Ever.
The comp.compression fact discusses that for any given compression algorithm, it is possible to craft an input fill that cannot be compressed.
This is different than the challenge proposed, where Mike, the challenger, provided a file of size X, and Patrick, the challengee, was required to provide a de-compression program of size Y and a data file of size Z, where X > Y+Z.
The difference here is that Patrick was allowed to hand-craft the compression method to the data file, but Mike was not allowed to hand-craft the original data to the compression routine.
Basically, Mike was wrong, and should pay the $5K.
I do not deploy Linux. Ever.
Given a sufficently large source file of truly random data, I can create a binary and a data file which can recreate that one specific source file, losslessly, with my binary+data being smaller than the source. That is, I can create a routine that will work for just that one specific source file, no others.
The problem of 'Lossless Data Compression' refers to the opposite case- given any general purpose routine to compress files, I can create a file that it cannot compress. That is the oppposite of the challenge given, and actually is impossible.
As stated, even with the restriction of a single compressed file, the challenge *IS* winnable.
I do not deploy Linux. Ever.
It is impossible to write a routine that can losslessly compress every possible 3-megabyte file that could ever be generated.
It is not impossible to write a routine that can compress the one specific file that Mike supplied to you. That is, it is quite likely that you could write a program that can losslessly compress that one single input set, such that:
That would only be true for the general casse compressor-decompressor. I could write a routine that would work for that one specific input file, but fail miserably on any other input. This would win me the $5K, but be valueless for any other purpose.length of decompressor) + (length of compressed data)
The mistake both you and Mike make is that while writing a program for the general case of compressing random data is impossible, it is very different than writing a program to compress one specific file of 'random' data.
I do not deploy Linux. Ever.
Slightly over-zealous use of "Submit" without previous use of "Preview", methinks. ;-)
BTW, you can use < to get "less than", like this:
<
This is very very standard stuff in HTML.
--
http://www.gimbo.org.uk/
Heh - interesting: I've never used anything other than "Plain Old Text" (including on this message - but spot the italics!).
Strange... Hey ho, live and learn I guess.
--
http://www.gimbo.org.uk/
def test():
print 'so what do the other modes do?'
this_is_the_example_for('code')
return("cool, works the way I'd hope")
if __name__ == '__main__':
test()
--
http://www.gimbo.org.uk/
But tomorrow morning, I shall be sober.
Oh, hang on, that's not right.
--
http://www.gimbo.org.uk/
True, the simple fact that we can't do it doesn't mean it can't be done. However, other facts do mean it can't be done.
If all files of say, 1MB, could be compressed to a smaller file, then some of those compressed files would necessarily be identical. If they are identical they can't be decompressed to different files. Therefore, not all files can be compressed. Not today, not tomorrow, not ever. It has nothing to do with not knowing how to do it. It simply can not be done.
-- Andrem
There has been a major scientific break-in
Yes exactly. Whats wrong with that?
:)
He was given a challenge where it was obvious that the giver knew it was impossible (he never said he would use random data, but it would be the obvious way to give someone an uncompressable file - and he did it)
So he took this "impossible challange" and attacked the challenge itself. The challenge said "file size" so he put the data somewhere else.
So he showed up the trickster. Of course, the trickster then got defensive and said "no wait you didn't win" and started making up new rules... typical really
-Steve
"I opened my eyes, and everything went dark again"
Yeah, I know about <.
:-)
Just that I was submitting in "Plain Old Text" so I would have thought they'd let me get away it.
You know, I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length. This is because knowing the file to be compressed *before* you create the decopressor gives you an advantage over the arbitrary random file case.
Furthermore, I reckon I can write a program G that will, when given any input file A of length>=X, will create a decompressor D and compressed file C such that:
D(C) = A
size (D) + size (C) size (A)
Don't know what X is yet, I supspect around 1 meg is fine though.
Anyone willing to bet either way?
Slashdot apparently doesn't like open angle brackets.
the condition should read:
size(D) + size(C) "is less than" size (A)
Slightly over-zealous HTML filter, methinks....
Uh, just spotted that isn't possible. Cos then you've created G(A)->(D,C) which is a universal compressor, and therefore not possible by your usual arguments. Course, I'm still interested to see if you could create G which works for a percentage X of possible values of A that gets arbitrarily close to 1. My guess is that this is possible, in which case you can still win money off the original bet by ensuring that X>0.02 Of course, if your opponent knew G, they could always choose a value of A that would confound it. But this isn't the case here. Also, it may be that X increases only very slowly with filesize, so that in order to get X>0.02 you would need a few solar systems worth of storage and CPUs to complete the challenge. Can anyone prove/disprove any of this?
Take a block of data. ("Four score and twenty...")
Convert it into bits. (1010001001001...)
Arrange the bits so that they form a decimal number. (564273274687632....)
Factor the decimal number as compactly as possible, by using brute force to encode the number as a series of prime powers. So the number becomes
(2^A)+(3^B)+(5^C)+(7^D)+... - Delta
So the original message becomes (an example, I have no ideas what the actual values are, this is a thought experiment) (2^71)+(3^151)+(5^0)+(7^12)-18763
Which could be compactly encoded as (71,151,0,12,-18763).
Presto! Godel originally proposed this I believe, and I read about this in aa novel 'Starburst' by Robert Silverberg. I can't recall anymore. Any thoughts? Would this work?
www.avacal.com -- the home page of pete shaw
There are so many random files, that the length of the number of the compression resistant file would be very close (although slightly less) than the length of the file itself.
However, there should be no way for the decompressor to generate compression resistant files in order to enumerate them. If the files are compression resistant, they would have to be stored bit-for-bit inside the source of the decompressor, enlarging it.
If he just offered $5000 to whoever could meet the challenge, then I'd say that intent was very important, he could argue that the "compression" scheme was not practical...he shouldn't have to pay.
But he had no objection to accepting the $100. Money has already changed hands. Now that he lost the bet, he wants to just call the whole thing off and return the $100? I don't think so.
So you're running them through an external compressor (perhaps it removes strings with runs or something of the sort).
The problem is that you are reducing the size of compression resistant files very slightly, however you have no way to represent non-compression resistant files.
So if the random number generator happened to spit out three zeroes in a row, and your compressor would try to eliminate them, you could NOT compress the file (even increasing the size) because your decompressor ONLY decompresses "uncompressable" files.
Anyone willing to bet either way?
Yes, assuming appropriate restrictions against storing information in the compressed file's name or attributes, I'll give you N:1 odds that you cannot write the program G, where N is any positive number. X can also be any number.
Or, save yourself time, don't try to write the program at all, just send me money.
It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.
However, your decompression program would have to be smaller than M - log_2 (N) = log_2 (2^M/N) bits, where M is the number of bits in the compression-resistant file, and N is the number of compresion-resistant files preceding it in your sort order. So, the number of compression-resistant files would have to be small relative to the number of possible files if you want more than a bit to write your program. Since this is not the case, you would fail.
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out. The bet can be won; not in every instance of course, but in the long run.
That reflects a mistake on the part of Mike Goldman. If you want your decompressor and compressed file to be a total of length N, you can store log_2(N) bits of information in how you set the lengths of the two files. If you have more than two files, you can store even more information. Mike should have required that (number of bits in original data)+log_2(length of original data) > sum of number of bits in each compressed file and decompressor + log_2(product of lengths of each file and of decompressor), and he should have permitted himself to rename or change the file attributes of any file he wished in case information was being stored therein.
Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.
As I understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.
geocity's seems to be blocking it.
t m
here's my mirror
http://streamripper.sourceforge.net/compression.h
-Jon
this is my sig.
And the winner is... CmdrTaco!
Bow-ties are cool.
I think the biggest flaw in the given solution (and one that wasn't pointed out too well) is that, regardless of what filesystem is used, how efficiently the filenames are stored, etc., the compressed files cannot be stored in less space than the original data.
Think about this. For every one byte he takes out, he adds another file to his collection. Even if you store all the files in a linear block, like a TAR file or something, and store their filenames by a mechanism that requires no storage, you still need to encode the *length* of each of those blocks - requiring at least one or two bytes, depending on the block size. The challenge didn't say anything about the cost of storing the files in a filesystem (inodes or the like) but it did say that the compressed files, with the decompressor, needed to take less space than the original file - and the solution can't be stored in a way that meets those requirements, even under very generous assumptions about the conditions of the storage.
The challenge should have been stated better, but the challenger still lost IMO.
Bow-ties are cool.
What about taking that huge set of random data, representing it as a number. If you break that number down into say, a series of smaller numbers that when one number is raised to the power of the next number gives you the orriginal number? I've been thinking about this problem all morning, and this solution hit my while driveing to get lunch.= \=\=\=\
It may be a dumb idea, it may not even work at all. If it wouldn't work could someone explain to me why it wouldn't work?
=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\
Mike: Give back the $100 with hearty congrats for finding a way to meet the challenge with true bar room challenge flare (con the con).
Patrick: Accept the $100, publically concede the impossibility of meeting the spirit of the challenge and relinquish honorable claims to the $5K (as you've done in the link). Buy a round for Mike and all the patrons at your favorite pub. Save the story as a "good one" for the rest of your life.
The little guy just ain't getting it, is he?
Anyway, I think that he could go a bit further. Depending upon the random number generator, it is possible to compress the file a lot - mainly because there are no true random number generators, except if you use some natural random source such as radioactivity decay. If the random numbers are in fact "pseudo random numbers" generated by some mechanical/computational device, then it is possible to compress the data, using some particular algorithm.
Going a litttle further, it is interesting to note that the combination of the compressed data and the decompressor is not completely self contained. The decompressor includes to a lot of information that is not stored inside the executable file. For example, libraries can be linked. Even if the decompressor is written in assembly, the problem is still there - a lot of information may be contained in a single assembly instruction.
As an exercise, imagine the following scenario:
- The original data has some particular sequence repeated in some places (even a random file can have such repetitions; in fact, a file containing only ones can be possibly generated by the best random number generator; it's improbable but it is possible)
- This particular sequence can be regenerated by a simple CPU instruction
- The decoder (written in assembly) uses this assembler instruction the reconstruct the sequence, using only one byte of code.
Is it a *real* compressed file? In this case, the CPU does contain the information needed to decompress the file; even if the information is not inside the compressed file itself, the information is part of the context. This is just to complicate things a little bit further...In the end, we need to keep mind that using information that is implicit in the context is all that compression is about. We are removing redundant information that can be inferred later based on the context. In a contest like this, the context should be clearly stated to avoid this kind of discussion. Looking at things this way, maybe Patrick was right after all...
RTFS (read the fucking site ;)
From then on, I was just trying to avoid the situation where he said that I had made up either the whole thing or the emails where he agreed to multiple files. That's why I said I didn't want the $5000 or for him to reveal my identity. I thought that if he thought I wasn't going to say anything publicly it would make him feel more secure and more likely to talk to me and other people, and to tell the truth to comp.compression, thereby making it more difficult to deny everything later on. I said he could keep the $100 because I think that taking it back would be admitting that I failed or cheated.
Before you email me, remember: "There is no god!"
1. A Compressor does not need to be provided.
2. There is no time limit which limits how long the decompressor can take.
Solution:
Create a program that can generate and check. You already know what the md5sum of the result should be. In fact you know what the result should be. Systematically generate data in memory, or with a seed built into the "decompressor", and then check your results. If they match the md5 you can then write this file out to disk and you're done.
Given enough time I think I can prove mathematically that this will work.
In fact I can speed up the computation significantly if I break up the original file into chunks, and md5 those chunks. Store those md5 sums in the "decompressor". Now I know the chunk size and md5 sum, I can permutate the bits until I get it right, and then write out the bits, chunk by chunk.
Now you have a method which you can use to "decompress" the orginal file out of nothing. Of course you can argue that this is not a decompressor, so fine. Given the same orignal program you can have it "decompress" a file which contains the first x bits of the original. You can use this file as your "compressed" file, and as a means of speeding up your generation (err, I mean decompression).
With this system I believe you can beat this challenge given enough time. This system does not hide any data in the file system. This system will also take a lot of time to realize a solution, but it will happen eventually.
You can use this for arbitrary data too, as long as the original file is larger than the decoder program and the original file's SHA or md5 hashsum represented as a file.
People can't go around making promises and then refusing to stand behind them.
If Goldman doesn't pay, I'd urge the challenger to retain a tech-savvy lawyer (one in the challenger's home state). The lawyer should simply send a letter to Goldman threatening to sue. That should take care of that. If Goldman still won't pay, the challenger has the option of suing him.
Disclaimer: I am a law student. The above is not to be construed as legal advice, and no attorney-client relationship between the author and any recipient of this message should be inferred.
I am not a lawyer.
The theory was that each teams set of data was incomplete so they would have to cooperate to get the suitcase open, thus teaching us the importance of acting as a whole team or something.
However, I have always had an interest in locks (and "unlocking" them) and the briefcase had those cheapy combo locks on it so I wandered over and had it open in 20 seconds.
Lesson: Teams are good but sometimes an individual makes the difference.
Unfortunately, because things had gone wrong, they decided to usurp my victory and give the final activity (locate an item in the woods) to both teams. My team lost. I'm not sure what the message there is but I chose to learn my own message which is "Messing around for the day is much more fun than sitting in an office working"
Rich
It probably all balances out in the end though (weight vs inertia etc.)
Sure, but then your decompressor has to include the entire contents of the original file, or a compressed version thereof. In order to get this compressed version, you would, of course, need another compressor, and YOUR compressor would be more or less worthless.
First of all, the page with the story was down, so I didn't get the jist of it. But, for the argument, it I might guess that the challenger's intent was to post an insoluble problem: submit a program (or algoritm, again I didn't read the story because it was down) that is guarenteed to compress ANY random data of an arbitrary size. The problem is this: If you can compress a single bit out of truly random data, your result is truly random, thus you could compress an arbitrary length sequence to 0 or 1 bits. Obviously this is impossible. You can't represent something with nothing. At test of this is simple. The challengee could not compress a file with a single bit of 0 and a single bit of 1, if each were in a vacuum. Thats my 2 cents.
Not that I feel like trying a challenge like this but it seems if you ask for about 60gig of data and analyze the hell out of it you will be able to get sufficient patterns in the data to get the compression needed. A complete waste of time but would beat this challenge because its not very good. Of course this is what any math or science type person would be able to tell you so I'm not saying anything revolutionary.
"You can now flame me, I am full of love,"
Well since splitting the files was allowed, let the filesystem record the position with a file split. The key then would be to get a sufficiently large run that you could "throw out" to allow you to fit in the decompressor and the filehead penalty. Possible in theory but I doubt the example file had any sufficiently long runs.
"You can now flame me, I am full of love,"
This is definitely off topic but I'd think the bigger advantage of DVD's or CD's as tires is that they are far nearer to perfectly round than just about anything common that I can think of and most certainly when compared to something cut out of wood or paper.
"You can now flame me, I am full of love,"
One of the interesting caveats in this argument is the reliance on the Turing Halting Problem, which means that this proof is restricted to finite time. However, I don't think "Patrick" would have got his $5000 if the "decompressor" had taken infinite time to run, despite the slack wording of his challenge
I've been posting on the net since 1994 and I still haven't come up with a good sig!
If you try to compress random data using one generic algorithm, you will sometimes have size savings, and sometimes will end up with a larger file. If you are smart enough to put a "not compressed" flag in your data file when it becomes larger, you can reduce the size on average.
But, if you are free to select your algorithm after seeing the data file, chances are you can always compress the data (although I may not mathematically prove this). Remember the article here on slashdot a few weeks ago where a prime number, when gunzipped, produces the source code for a DVD decoder. Well, you may say that they are not the same because the source code was compressed with gzip beforehand, but please think a different algorithm than gzip (as in mandelbrot fractals) and you *might* end up with an algorithm which compresses some data which is random to you but not-so-random statistically or mathematically.
PI is infinitely random, but can't you compress it in a few lines of code?
Although I should admit that finding the algorithm itself would be mostly by chance.
Donate free food to the hungry at The Hunger site.
Since when did Data Compression become Data hiding? It is simple that he "Moved" data but never compressed it. The quest was for a Lossless Data Compression and any person involved in basic data compression would testify that this did NOT meet the requirements of Compression but the requirements of data hiding.
He lost, he complained, but simply put he paid 100 dollars to play word games with the intent.
I can program myself out of a Hello World Contest!!
If he was a REAL PROGRAMMER he would have written his own file system! Tailored to the actual data!
Oracle and unix guy.
Patrick's assertion that the size of the "compressed" files plus size of the decompressor is less than the size of the original is false. The compressed files have replaced the ASCII character "5" with the ASCII character EOF (end of file), a one-to-one substitution that doesn't reduce size.
His trick implicitly relies in the way "wc -b" operates: it doesn't count EOF in the number of bytes it returns. One could argue that the true size of the compressed files is + and construct a tool to measure file size according to this definition, which would show that Patrick failed the challenge.
However, Mike's arrogant attitude and posturing to the newsgroup gets him no brownie points either. His challenge is impossible in spirit, but his poor wording allows trickery.
>This is very very standard stuff in HTML.
Its not standard in `plain old text` mode though - at least, not around these parts! Why is this altered from the input at all?
good point. actually, i toyed with pre (i`m scared to put slashes and things in my posts now!) once, and it used a differenct font, so i stopped! I`m not really a html coder though!
Lossy implies decompression will *not* return it to its original form.
Data can be expressed as a number.
Random data can be expressed as a random number.
Take A to be the random number.
Take B to the the nearest prime number.
Take r to the difference between the prime number and the random number.
B+r=A
Likewise take C to be the nearest prime number to r and so on...
B+C+D+E+F+r=A
The data in your compressed file could consist of;
32334/2000/231/22//23222
(obviously you would need to delimit them differently)
Your decompresser would add the 32334th prime to the 2000th prime to the 231st prime to the 22nd prime and add 23222 to the result.
So long as you know the data to be compressed why could you not *rig* something like this to work?
Then again...
I assume it wouldn't...
________ semper ubi sub ubi
works on freebsd, anyway
-------------
#!/bin/sh
fetch http://211.0.18.242/~patrick/other/original.dat
ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?
Kind of true: you can't universally compress random data using a single program, but this doesn't mean you can't compress a single instance of random data.
This competition is kind of "balanced" because:
Given a sufficiently large file (a couple of Gb would probably do) then I think virtually any file could be compressed with a specially crafted compressor and decompressor. Given a 2Gb file, you only need to achieve .0001% compression to have 2147 bytes to write the decompressor.
"Mary had a crypto key, she kept it in escrow, and everything that Mary said, the Feds were sure to know."
--
--
We have fought the AC's, and they have won.
unfortunately no, and I can't remember where I first came across it either. If you have any further information I'd be very interested, as I would find it quite embarrasing if I credit him [Derek Bok] on false grounds.
"If you think education is expensive, try ignorance" - Derek Bok
Hopefully, at least, he learned his lesson, which most likely was all Patrick ever hoped for anyway.
"If you think education is expensive, try ignorance" - Derek Bok
Pick a large enough length, and you are virtually guaranteed that there is a 1024-byte block of zeroes in it somewhere.
Your reasoning contains an error. What were you thinking, arguing with an MIT guy, for heaven's sake! :-)
As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.
For example, to a first order, you need a file of length around 2^1000 bits to expect a run of 1000 consecutive zero-bits. Great, But specifying the position of this run requires-- you got it!-- a 1000-bit number. We're hosed!
Many people have been citing the usual argument against a universal compression algorithm: it's impossible to uniquely map all 2^n combinations of n bits into the 2^(n-1) available combinations of n-1 bits. But that wasn't Mike's challenge. As Patrick pointed out, Mike only required a challenger to compress a single specific file, which Mike would provide. Was Mike smart in making this bet? We know that some of the 2^n files Mike could provide are sufficiently compressible: he might just randomly generate an ASCII file or a file with a long string of zeroes, for instance. So the question is what percentage of the possible files are sufficiently compressible? "Sufficiently compressible" would mean that they compress enough that whatever compression algorithm is used can fit in the freed space. Mike's bet is only a good one if less than 2% ($100/$5000) are sufficiently compressible.
That's almost a good idea, and it made me laugh when I read it, but your decompressor would surely take more than one bit, wouldn't it? So the total size of the decompressor plus the compressed file would certainly be greater than the original file (this was a key part of the original challange).
http://www.bootyproject.org
OtakuBooty.com: Smart, funny, sexy nerds.
Yeah, I thought the same thing at first, however...
If you read carefully in the email corrispondance, the random data was gotten from random.org, which offers truely random data, not pseudo-random data as generated by some algorithm. Unless you have a nifty algorithm for generating the values of radio noise (random.org's external source of white noise), this won't work.
"
I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length.
"
Nope,
If I have N bits of file. There are 2^N possible files that can be created. There is no way you can always have a compressed file of length 2^(N-1) in length simply because you only have 2^(N-1) distinct possibilities leaving 2^(N-1) files untouched.
However, I intend to compress 2^(N-1) files into a length of 2^(N-1) and the other 2^(N-1) files into a length > 2^N. Since there's a 2^-1 chance of getting each file I expect an average return of $2400. (2^-1 * $prize - $entry).
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Sorry, I meant to say length N-1, and length N.
I've muddied my explanation however hopefully the point is clear [it's possible to compress 50% of the files to smaller than the starting length at the expense of the other 50% being larger].
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Ignore the last bit
Supply it at random.
50% of the time you're wrong and lose your $100
50% of the time you're right and win $5000
Of course this loses the spirit of the competition [design a universal compressor which is clearly impossible] but wins the other competition of 'how do I make the arrogant prick look like a fool'
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Ok, I've done some information theory, I know it's impossible to create a perfect compressor. Boxes arguments and all that.
However, to make money, you don't have to.
All that is required is a compresser that saves 1 bit on better than 1 in 50 attempts.
For example, if we can save 1 bit on 50% of the attempts - even if the other 50% of the attempts results it catasrohpically large files. On average we've lost the compression war.
However, on 100 attempts, we've paid $100 x 100 or $10000 but we've won on half of these - netting us $250000 or a cool $240000 profit.
Anyone want to demonstrate this?
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
Shooting from the hip without having been able to read the content, it is only theoretically impossible to compress random data assuming uniform distribution of all possible characters. usually there will be enough variance in a finite sized randomly generated file to allow for some compression. the trick is making sure that the addition of the decompression map to the file does not make the file bigger than it was in the first place. (that's why using lossless compression on the same thing twice will frequently give you a bigger file)
Here ya go. GZipped. "1F8B0808EA9CE53A0003666F6F00732C2ED653F0C8CF492DD 6E3E5020019F95E3F0D000000"
blah blah blah lameness shmameness..
The face of a child can say it all, especially the mouth part of the face.
The second i read the mail about multiple files i understood what Patrick was going to do. Putting information in filenames is a very old trick that was used to get around file system quotas for a pretty long time. But it is pretty stupid of Mike not to include a clause that filenames themselve are also data.
OK, how about this: The "random" data had to be generated from some finite set of data. Whether the original "randomness" comes from a bunch of a pictures of lava lamps or static from the radio, it's still a finite amount of information at the beginning. Then, you apply a bunch of randomizing algorithms to it to generate a large file. If I request a file to be generated that is larger than the initial input and the source code to the randomizing algorithms. Then my problem is to find some initial input and combination of algorithms to generate the file - assuming a long time to brute-force it, you might be able to win in this way - the compressed file is then the initial randomized input, and the decompression algorithm is the exact same series of randomization algorithms that was used to create the random data.
How does this sound?
NB: I've not studied information theory.
A different slant on a troll, nothing to see here, move along.
-- www.globaltics.net
Political discussion for a new world
Other assumptions, obviously, are a little more obscure. Sure, Patrick would have done well to have read the FAQ, but even without having done so there are some things he certainly knows. Without having done any Information Theory, I know that the objective of any compression program is to take up less disk space, and I think it reasonable to assume that Patrick knew this as well. Therefore, as much as I applaud his creativity, what he wrote was not compression.
The only issue remaining, then, is whether the wording of the challenge should override the information Patrick could reasonably be assumed to have. I think not. Every hacker instinct I have says that the challenge was intended to spur people to write better code, or at least to get them thinking about how code can be improved. Focussing on the wording of the challenge and saying the equivalent of "well, how was I supposed to know an envelope didn't count as a filesystem? It's not my fault a magnifying glass isn't machine-readable" goes counter to the spirit of the challenge and of constructive programming.
Therefore, Patrick, I'm afraid you did not successfully meet the challenge. Reading what you wrote, I did not think about the space filenames took up until I read the email response to your final submission. I suspect you either made the same oversight as me, or chose to ignore it in favour of rubbing Mike's nose in his choice of words. Either way, you made a valiant effort, but did not earn the prize.
LAI
"How do I type "for i in *.dvi do xdvi $i done" in a GUI?"
:eof
Goldman comes across like a pompous ass in this series of emails and he deserved to be showed up.
After Patrick accepts the challenge Goldman quickly jumps on the newsgroup to proclaim how some sucker fell for his trick.
At the first sign that his little charade is in trouble he hints at legal action against the challenger.
How classy.
Patrick, you should settle this bar room bet in the same manner most bar room bets are settled: a haymaker to the jaw of the jackass challenger.
- Toby
What if you named the files randomly but recombined them in the order they'd been placed in the directory? Checking the date on the file would probably look like cheating to you, but it's quite possible to build a directory mechanism in which files are simply listed in order of creation with no explicit time stored. By your logic he only failed if he's not instantiated on such a system - but there's nothing intrinsic to his method that rules out doing just that.
<p>Of course even such a system would have to have some marker of the beginning (or end) of each file, you'll say. But consider when I compress fruit into jam. It's the same degree of compression if I end up with five one-gallon jars or on five-gallon. Sure, it uses more jars and labels the first way, but no one's going to claim I've made more jam. And if there's a cubbord I need to put the jam in that the five-gallon jar won't fit into, the one-gallon jars may even be the more effective compression scheme - less rather than more costly storage because I don't need a new cubbord.
"with their freedom lost all virtue lose" - Milton
So he upped the stakes, put your money where your post is, or run back to comp.information.hiding ....
Where's the problem?
ShunScene
After reading that page, I have to wonder if there is some language barrier that is being argued. Case in point:
Last, the contestant will send me a decompressor and a compressed file, which will together total in size less than the original data file, and which will be able to restore the compressed file to the original state.
According to the wording of the challenge, in order to "complete" the challenge you had to produce "a compressed file". Note the singular specific article. The person performing the challenge did not meet this requirement. He produced 218 uncompressed files.
Education is a better safeguard of liberty than a standing army.
Edward Everett (1794 - 1865)
As an aside, I find it amusing that one of the people he quoted regarding the outcome of the contest was one David Scott, who sounds a lot like someone who also posts to sci.crypt, and gives ./ trolls a bad name. The SCOTT16U.ZIP controversy, in which he personally blasted Bruce Schneier, was particularly amusing.
Ok, I have done number theory, I have read all the posts, I still think this can be done especially since you have prior knowledge of the file your compressing. Given all the time in the world to analyze it, this could certianly be done...sounds like a fun way to use distributed.net actually. :>
Power Corrupts,Absolute Power Corrupts Absolutely, leaving one person(group)in charge is absolutely corrupt.
Now, I don't know a thing about compression or anything else, and never would have known such a thing was impossible, but that's besides the point. Had I in the least bit cared, I would have seen this challenge, tried it out, realized the futuility, and said "oh, would you look at that. He was right."
Instead our society has raised everyone to be a bunch of babies about wording and specifics, just because everyone wants something for nothing. Despite what you say about Michael, it was obvious what he wanted of this challenge--to prove a point. Patrick completely ignored this point, took advantage of the fact that Michael didn't prepare for every nitpicking possibility, and found a loophole to an honest challenge. And to what end? Just so everyone would try to find similar innocent challenges and take advantage of them too? Simple product of a litigious society--argue everything until you win.
Oh, and to say he was trying to make money by thinking people would flood him with 100 dollar checks is just ridiculous--he stated that the challenge was impossible, and how many people are dying to beat an impossible compression challenge? Not many.
Maybe technically Michael should pay 5000 dollars... but technicalities aren't the best way to win everything.
Section 9.2 of the FAQ, the Counting Theorem, says that "no program can compress without loss *all* files of size N >= 0".
Mike Goldman's challenge explicitly involves "a specific known file" of size N, where both N and the data within are arbitrary and predetermined.
Mike's intent is obvious, but his logic is horrible. It's quite clear that Mike's challenge has HUGE gaping holes in it's mathematics, statistics, and logic as compared to the Counting Theorem. There's nothing "for sure" about Mike's challenge. Just reading it by itself made me sure he's going to lose, if not this time, then another time. He set himself up to lose.
This guy's a genius.. he should patent using gunzip $1 to compress a file!
because you can't compress random data. He can just generate a file of random data. This is just a scam so he can make $100 off of every entry.
Keeping
What were the /. readers' opinions on this subject? It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win. This fellow from Japan knows this, but due to a wording technicality in the contract, he figures he can scam the $5k from the scammer. They both seemed to be good natured chaps about it, but I feel that they still have some deep resentment towards each other.
Any other opinions?
Keeping
Sadly, so many people have overloaded the page, that I suppose GeoCities has taken the page down :-(
Fruit flies like bananas... Time flies like the wind...
On a seperate subject compression of random numbers is not imposible just very unlikely. Maybe one day we will find a way to represent any randum number in a simple formula. Because we are unable to do this today does not mean it is impossible, just unlikely ( I will not mention how many inventions we take for granted today which were thought impossible in the past)
DRM? No thanks, I'll just get it somewhere else...
but....
You've got a large amount of truly random data..... except few things on a computer are truly random- they have to come from somewhere, right?
So, lets say that Mike used a certain program to generate the file to be compressed. (Didn't he use some third party company?) Now if one knew the algorythm used to generate the file, would it be possible to reverse the program and obtain the original random seed, and then submit the random seed and the algorythm?
If Mike arbitrarily rearranged the bits, or used multiple transformations of the original data, this would be obviously be difficult or impossible to do- but what if he was lazy and only ran it through once, is recovering the random seed possible?
Alcohol, Tobacco and Firearms should be the name of a store, not a government agency.
Even with two files and without hiding data in the names you can split data between the decompressor and the file and the the ratio of sizes between the two is an extra bit of information not expressed in the sum of sizes, with a sufficiently large data file this will always allow you to solve the challenge exactly as originally stated.
The wording of challenge allows it to be easily solved... rewording it after the fact is indeed slimey.
Well, if Mike had specified anything other than "file size", I'd agree. But he said "file size". And the file size of what Patrick sent is less than the file size of the original file. Balls to directories taking up space, or more files meaning more space is taken up. File size is file size.
I don't believe that the intent was followed. However, as this challenge was "impossible" to begin with, I think that the $5K should go to Patrick for creativity. He met the letter of the challenge, which in this case should be good enough as the challenge as it was intended was "impossible".
Mr. Ska
I slit a sheet
A sheet I slit
Mr. Ska
Pity. I thought you were holding out something on us there ;-)
But couldn't you write a program that will calculate pi, and then compare pi to the random data, and in any place where they overlap, substitue the starting digit and size from pi. Maybe this doesn't work because the "starting digit" itself would usually be bigger than the data saved, but thats the only problem I can see. With the option of asking for a file of any size, I'm sure an 8gig file will have several places where the bits match up directly with some part of pi. This program would be slow- calculating pi takes time I understand- but there was no time limit or speed requirement.
I love going down to the elementary school, watching all the kids jump and shout, but they dont know I'm using blanks.
This is an obvious case of rules lawyerism,
where the contestent looks for loopholes in
the rules in order to exploit the game itself
instead of the spirit under which it was set
up. Why people should get upset when the other
person responds "but that is not what I meant"
when it should be perfectly obvious that they're
not going to mean that is beyond me.
C//
If you weren't in Japan I'd recommend a http://www.prepaidlegal.com/, which is a legal hmo that's usually been v. good for me.
"Be thankful you are not my student. You would not get a high grade for such a design
Just how smug do you have to be to bother to announce a challenge like this just because you are so proud of yourself for knowing a fundamental result in information theory. The answer: Really, really smug.
That's because good compressors remove as much redundancy as possible on the first pass.
By disguising it as a contest ... the owner of the contest might be able claim the algorithm as property of the contest and reap the rewards, or the algorithm could be made public for everyone to benefit (slim chance?).
BTW; compression via division/square root of values will tend to have remainders that must be stored, often creating more data than the original. ref: post 483
m.mmm..myyy
i think he was joking... the lossy compression was one of those wonderful April Fools posts we all loved so much.
Never con a con man?
Coffee can lids. Round, thin, cheaper than harddrive platters. I have seen many a geek go bad because they get so stuck on getting around the rules that they are trapped in to a certain train of thought. Thinking outside the box is just like realizing that the spoon doesn't exist.
This fellow paid $100 for the opportunity to stick it to some some other fellow. He made a $100 bet, on the off change that he'd get $5000. If he had a brain in his head, he'd send his address and get his money back, and laugh about joke. Sticking to the "I met the stated rules" just look dumb, especially when you're out $100.
Information THEORY aside, the challenge can be met.
Imagine the bits of a file, lined up, bit-0 through bit-n, as the output of a boolean logic function. The number of variables required to produce those bits is Z, where 2^Z = Total bits.
So, 1024 bits requires a maximum of 10 variables, but usually less, depending on the data.
Write a program that will find the boolean equation that, will produce bit 0 of the origional file if all variables are set to 0, and bit 1 of the file, if all but the last variable is set to zero (a binary count, you get the idea?)
The resulting equation will have log2(Total bits)+4 unique symbols. Now reduce the function to eliminate all unneeded terms. Now write the equation to a file, prepending the binary number equal to the number of bits in the origional file.
This procedure, if run on a large data set, like megs, will result in the highest level of compression possible, because boolean reduction WILL find the minimum possible symbols for even random data.
One reason this procedure is no good is that the processing power, and storage capacity needed to achieve it is incredibly large, and exponential with origional data size.
I've written the code and yes, it works great, but I have not been able to do files of any size due to the fact that expanding the boolean terms takes 2^(bits in origional file)/2 terms, which is quite a chunk of change.
Information THEORY is just that. If you can code up the above, and run it on data sizes in the megs, you won't need my $5000 bucks, you'll have plenty of money rolling in on your own.
Above comment is personal opinion. Poster is not a spokesperson.
how about:
100000=100^2
my compressed data reads: 100
the decompressor original=data^2
with larger numbers you probably make larger savings, especially if you can use squares.
Change my original post to read any number can be defined as n=a^2 + b. It is still true and for small (yet still huge) files ie: 100KB-1MB, you make massive savings.
Again though it works as a one off compression for a given data set. And again, I'm no mathematician and not exceptionally talented computer-wise: I could be completely wrong.
this signature is a virus, please make me your
Should have thought about it more, but it was a nice distraction from my history thesis anyway...
69 hours and counting
this signature is a virus, please make me your
no I've been working on this. The problem is you will need a key at least larger than the bytes you don't represent. If you have 500 bytes, the needed informations to represent those bytes will be 500 bytes or more.
Well, that is interesting. If you can find a mathematical equation which is smaller than the "string" to compress, that would work. However, there will always be strings you can't make equations for... Sometimes, you would need a equation of 128 bits so you would enlarge the file. You could use a mark to say "yes the following is compressed and the equation is x bytes long" or "no, the following x bytes are uncompressed". Maybe you could also use less than 256 mathematical symbols so you can make better equations... In fact, if this method could work for like 64 bytes on a 1kb file and that the method is applied to a lot of files, the size of the compressor (which could be huge, because it tests mathematical equations) could be won by the reduction of all those files. But, again, this method will not ALWAYS work... so there will be a time when a given file can't be compressed anymore.
Doesn't quite work that way, but the difficulty is very subtle. You have to tell the decompressor where in the file the string of zeros starts. A string of 1000 zeroes happens once in 2^1000 bits. So you need a 1000-bit pointer to say where to find it. The key to the impossibility of compressing the truly random file is the true randomness. Any pattern and yes it can be compressed. No pattern at all, and no compression. Cutting the file in two is sneaky, but it's information hiding, not compression. There's information in where you choose to cut the 2^1000 bit file.
Curtains for windows?
Interesting. Maybe any 'random' file of finite length is not entirely random- there is information in its length itself. Perhaps this is what allows for your loophole of the (extraordinarily unlikely, but not impossible) random file which is all zeros. This is only possible with a finite length file. An infinite length random file will demonstrably have no patterns.
Curtains for windows?
You know, various stuff like:
Who won (legally?)
Who won (morally?)
etc etc.
And something todo with cowboyneal, probably....
Two wrongs may not make a right, but three
At no time did Mike say that multiple 'compressed' files were acceptable. He only used the singular "file". However, he was asked, point blank, "Are multiple compressed files acceptable?" and gave a less than straightforward answer.
Mike ought to send the C note back. Even though Patrick did not complete the challenge.
- Dan I.
But, if you are free to select your algorithm after seeing the data file, chances are you can always compress the data The point is that this file is skew corrected so you can't find such an algorithm. Besides most of the time you can't find such an algorithm with random data (and this can be proved.) Think about it like this, compression depends on some repition or pattern. For every set of data that has such a pattern there are multiple adjustments to this data that will destroy this pattern. So in a stricktly random sence more sets of data are uncompessable than compressable. Obviously this isn't how the proof would be outlined but you can see how it would fall from there.
You don't seem to know what a ponzi scheme is... A ponzi scheme is something where the early investors are paid off with money paid in by the later investors. It grows and grows as new people join, and all the early people are happy... but soon, the rate of people joining slows and there isn't enough money to pay off people anymore and the whole thing collapses... this has nothing to do with that whatsoever...
I must burn in hell, suffer and pay for my sins
But Gods the one who's losing, Satan always wins!
Well, I don't see this other scheme as a ripoff at all... Even taking your example about fermat's last theorem... what do you care if someone is willing to pay to find out what the theorem is? I mean, there are entrance fees to enter into tournaments all the time... what if I had a quake tournament where the entrance fee was $5 and the winner, if he had been undefeated, would get prize money, otherwise nothing... whats really the difference between the two?
I must burn in hell, suffer and pay for my sins
But Gods the one who's losing, Satan always wins!
Well, he did mention that the 100 was to scare off people who weren't serious about it... I mean, I probably would have requested the data if it was free and I knew about the challenge... if he had charged even $1 though, I would have considered it too much trouble...
I must burn in hell, suffer and pay for my sins
But Gods the one who's losing, Satan always wins!
Going back over the exachange covered on the webpage, I'm more inclined to agree with you now.
I must burn in hell, suffer and pay for my sins
But Gods the one who's losing, Satan always wins!
Unbelieveable. This guy comes across a theoretically impossible task, and turns it into a scam basically, whereby anyone who wants to try to meet the challenge must pay up front. This is truly disgusting.
BR>If this were something where a person sponsored it with a little web space, publicity, and the prize money, then ok, I can understand that, that's fantastic. But this was nothing more than a very thinly veiled scheme to make money off of other people's attempts at a valid challenge.
Hey, I wish Fermat's Last Theorem hadn't been solved. Then I could put up a site, saying "I'll pay $5,000 to anyone who can solve it, but first send me $100. Once I have the money, I'll tell you what Fermat's Last Theorem is."
For your security, this post has been encrypted with ROT-13, twice.
Uh, yeah, I know that. It's called humor, dude...it's not made to fit like car parts in a Ferrari :)
In other words, I was associating one scumbag scheme with another.
For your security, this post has been encrypted with ROT-13, twice.
The difference is mostly in amount, and what purpose it really serves. If you're running a quake server (hardware investment, bandwidth costs) and charge $5 a participant...well, I can see that, particularly since you're offering a service as well. But this guy didn't really offer much, did he? Do you really think that it incurred that much cost/trouble/work to get a bunch of random data (which someone else generated, btw) and FTP it to someone?
It'd be like trying to charge $800 for a Red Hat distro, to cover shipping and handling or some strange thing like that. And what's more, it makes me question the motive behind the whole thing. My opinion is that this wasn't motivated by scientific curiosity, but just greed. He picked a challenge he didn't think anyone else could meet, and charged $100 a head just to try. Where do you think that $100 ended up?
For your security, this post has been encrypted with ROT-13, twice.
The challenge, I think, was well stated. It was clearly stated the data was to be compressed. In this case, the data was hidden by way of the filesystem. There wasn't anything clever or even novel about what that guy did. He didn't complete the challenge and the time spent on both parts was wasteful. Of course there is going to be loopholes, and literal translations, but this was simply a play on words. The point is both parties understood the terms, both were trying to trick each other AND make money at it, and both just ended up wasting time.
"I'll just chip in a bit for RedHat: I actually have that installed on my university machine." - Linus, '95
An MD5 or any other checksum is only "unique" for a given file it it has as many bits as the original file. If you have 36 bits of data and do a 32-bit checksum, there are 16 (2^(36-32)) possible files that will all have that same checksum. Even if you use two non-overlapping algorithms, you cannot recreate the original file. You can create a file that has the same MD5, the same CRC32, and the same file length, but the data will be completely different.
First off it would be better to choose a pseudo-random sequence that is easier to generate, something like RC4 which is designed to run fast.
Secondly the index into the random stream would increase with the number of bits, so there might be a section of Pi that met the criteria, chances are that it would be a very large number of bits out.
There is however a possible approach that is arguably not cheating. Chances are that the person setting the challenge did not generate a perfect random stream. If they are generating megabytes of the stuff they are most likely using some pseudorandom number generator to do so.
Cryptanalysing the input stream to identify the source could well reveal the random number generator. This would not be a cheat in my opinion.
Of course the person setting the challenge could have done the job properly as I did for a one time pad system once. What I did there was to hook up a microphone, record a reasonably random source (a loudish fan) and sampled at 16 bits. Then after determining that the ergodicity of the source was at least 4 bits per byte I ran blocks of 1K of data through a modified version of MD5 to remove any bias in the input stream.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
I suspect that a good enough mathematician with the right expertise probably could meet that challenge without "cheating" by shifting data into the directory. In the simplest case, if Goldman derived the file from a psuedo-random number generator, you guess the formula used, then your "decompressor" is the generator and your "compressed file" is the seed value. More generally (like if the data was truly random from an inherently random quantum process), you try matching pseudo-random (did I finally get the spelling right?) formulas against parts of the file until you hit something that fortuitously matches for long enough. Snip that segment out of the data file. Write the "decompressor" to copy all the bytes before, then generate the pseudo-random sequence, then copy the rest of the bytes.
However, I think if you are genius enough to make that work, you probably have easier ways of making $5,000...
...or did the author ask that his personal information NOT be posted to the newsgroup, but then he turned around and one-upped that, and posted it to /.!
Did I miss something...
--Those who can, do; those who can't, sue
There are some prime numbers which can be represented by (2^n)-1 (e.g. 3,7,31 but not 15 or 63)
That sounds like you're saying 15 and 63 are primes. Is that bad grammar or bad math?
Nope, no sig
Nope, no sig
If the guy had read the FAQ, he would know that such tricks are not compression
Did the rules as stated say that the solution had to comply with the principle laid out in the FAQ? Not that I can see. If you're going to pup up a $5,000 challenge, you damn well better post all the rules.
For instance: I had a friend who entered a paper airplane contest. He signed up for the "maximum distance" part. The only rules were that the contest organizers supplied the single piece of paper, you had five minutes to fold or tear it however you want to, and then everyone threw them at the same time. The one that stops moving farthest from the launch line wins.
He waited until 4:45 into the time, then crumpled the paper up into a ball and threw it. While everyone else's planes were circling lazily around, his went straight, landed, and rolled several more feet.
While you can debate all you want about whether his wad of paper counted as a "paper airplane" -- and we did debate (hey, it was college) -- he did saitsfy the rules of the contest. Had there been enough money on the line for someone to threaten legal action, I suspect he would have won.
Nope, no sig
Just find the spot in pi that contains a string of numbers that can represent the file. It's gotta be in there somewhere.
what a loser.
you know, i wouldn't give patrick the 5k either
in fact, i would have just loved it to death if this was a scam, and he lost $100 in his attempt to be clever.
however, in the same spirit, i think that the the chalengee (Mike) should send him the 5K US in whatever the Italian equivalent of the penny is, and send it COD... how many tons of coins would that be?
A: None. The Universe spins the bulb, and the Zen master merely stays out of the way.
Send me your $100 entry fee to take part in my new contest, there is a $5000 prize! To win all you have to do is get FP with a goat post in every single Slashdot article that appears over the course of the next 24 hours.
Wow alot of worthless text... damn
God spoke to me
Sorry if someone already asked this question, but this thread has had a lot of activity.
Wouldn't this be an acceptible solution to this challenge.
Recall file length can be stored in a DWORD.
Begin by searching through the data file for the most occurances of subsequences of length M > 4. For each occurance of this sequence, you can store the position in a table, costing 4 bytes, and remove the sequence from the data file, reducing it by M bytes. This gives you a savings of M-4 bytes of each occurance. Depending on your luck, you may be able to find enough occurances to allow the decompressor to work, which would need to house the following information :
1. the original subsequence (size M)
2. array of locations N (size 4*number of occurances)
3. filename to decompress (2 bytes)
4. output filename (1 byte if needn't be the same). Or alternativey 3 & 4 can be stored in the command line.
5. code to implement the reintroduction of the subsequences into the original code, something along the lines of
FILE * fin = open(INPUT_FILE, READ);
FILE * fout = creat(OUTPUT_FILE);
for (i = 0; i < sizeof(N)/4; i++)
{
while(current_position < *N) putc(getc(fin), fout);
for(j = 0; j < sizeof(M);j++) putc(M[j], fout);
}
Anyone know of a general result giving the probability of an occurance of an M digit sequence appearing more than N times in a file of arbitrary length L? With appropriate values for M,N, and L, one minus this number would be Mike's chances of not having to pay...
No sig
The challenge he accepted was to produce "a compressor and several COMPRESSED files whose
total file size [was] less than the original uncompressed file and from which [he could] regenerate the original uncompressed file".
(Emphasis mine.)
The files he produced were not compressed. They contained blocks of the original data unchanged. Ergo, he failed the challenge, even under the modified terms.
What if you tried to compress your 8 digit and 9 digit numbers (in your example) using more conventional compression methods? Could those (either both or at least one) be compressible so that your decompress routine decompresses the 8 & 9 digit numbers back to their original values before you apply n=a^2 + b
Or will a & b always be very random also when this is applied to a highly randomized n, thus making them also equally hard to compress?
Seems like if a & b are long enough, you might get lucky.
First off, if you have the ability to compress randomly occurring data (that is, each bit has an approximately 50/50 chance of occurring), i seriously doubt you'd send $100 to some guy you don't know in the hope that he'll send you $5000 back. Why, he'd just reverse engineer the decompressor, patent it, get rich, and buy an island inhabited with young bikini clad virgins before the your check had even cleared.
Secondly, i've been thinking about this problem for about 3 years now.
Imagine that you have three sets of numbers, X, Y and Z. X constitutes all currently compressible data (say, a bit of text). Y all seemingly random data (say, a text file compressed with zip, or Pi/2 rounded off to 4000 digits). And finally Z, all purely random-occurring evenly distributed data. Let A(n), B(n) and C(n) be functions that spit out the n'th number contained within the respective sets, in ascending order.
Now, for all possible combinations of bits within a 128KB file (2^1048576) how many numbers fall into each of X, Y and Z?
If an infinitely fast computer could calculate all numbers contained within the set Z, and X+Y was a few times greater than Z, then storing the value of 'n' might just compress...
awww shit, no it won't... cos the decompressor would be huge. Ah well, that's another $1 into the bogus ideas piggie bank.
Why not take the compressed file and move it to a base that is extraordinarily huge...
A 3 megabyte file in base10 is how much in base10000? And -- just include a space after every XXXXX XXXXX to seperate them... I figure a small enough algorithm could be coded in ASM to accomplish this and -- since runtime isnt an issue (just that it accomplish it): why wouldnt that work?
Disclaimer: I am sure that I have made some huge mistake here but I can't figure it out.. would love someone to point it out to me...
----------- You look at life differently while suspended upside down and gagged.
I saw this article a few days ago on BBSpot.com. I'm an avid reader of both yet I have to side with Brian for posting his originally.