Slashdot Mirror


How I Completed The $5000 Compression Challenge

Patrick Craig points to this page which details (in email) the strange life of an online challenge as posed, clarified and (at least to some eyes) met successfully. This should give pause to anyone willing to pose -- or accept -- seemingly impossible problems. In short, Patrick's not $5,000 richer, but you may think he deserves to be.

140 of 414 comments (clear)

  1. Re:Intriquing by sjames · · Score: 2

    which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).

    But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.

    Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.

    If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.

  2. Re:sorry! by sjames · · Score: 2

    It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.

    The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.

    It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!

  3. Re:Intriquing by sjames · · Score: 2

    I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.

    Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.

  4. What is "total file size"? by CaseyB · · Score: 2
    The problem is that the definition of "total file size" is subject to interpretation. If the guy holding the contest had said "total file size as reported by ls -l", he'd be screwed. But he could argue that "total file size" means the number of bytes of disk used to store the file.

    Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.

  5. Re:"to be disbursed at my sole discretion" by CaseyB · · Score: 2
    He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.

    However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)

  6. No, and it's provably false by hawk · · Score: 2
    There are 2^n ways to arrange bits. The number of ways to arrange all possible sequences of bits ranging from since 1 to (n-1) is 2^(n-1).


    There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.


    OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.


    hawk

  7. Re:Compression by AxelBoldt · · Score: 2
    As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.

    Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.

    The bet can be won; not in every instance of course, but in the long run.

    --

  8. Re:Compression by AxelBoldt · · Score: 2
    Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.

    In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.

    --

  9. Re:Compression by AxelBoldt · · Score: 2
    No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
    I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.

    If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.

    --

  10. Re:Compression by AxelBoldt · · Score: 2
    As understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.

    Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.

    If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.

    --

  11. Re:Compression by AxelBoldt · · Score: 2
    the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n)

    It's a*2^(-n), sorry.

    --

  12. I'd comment... by Enahs · · Score: 2

    ...but Yahoo! seems to have taken the site down. Thank you, Yahoo!

    --
    Stating on Slashdot that I like cheese since 1997.
  13. Re:An ingenious solution... by jd · · Score: 2
    No. There is a difference between random DATA and a random number GENERATOR. To be random, data must follow some mathematical rules, one of which is that the probability distribution is equal at all points. Another is that the next point's probability is NOT affected by the prior point's.

    You are correct, however, in saying that -any- sequence can be generated by a random number generator. However, there is no general solution to the compression of a sequence S into a smaller sequence C, where a unique 1:1 mapping exists.

    HAVING SAID THAT, if you compressed S into C, and could manually determine which of the reverse mappings was S, you -would- be able to "compress" generic random data of any size and of any degree of randomness.

    (This is because the user posesses some additional information, I, which is therefore redundant and can be ignored, for the purpose of the compression.)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  14. Re:An ingenious solution... by jd · · Score: 2
    The 218 files would have taken no additional space on the FAT, as it's fixed size.

    Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".

    If, however, we store all the files in a single indexed file, there's no dead space to account for.

    The filenames could be considered another issue. However, again, DOS' root directory is fixed-size and therefore file entries within it take no additional space.

    Lastly, it doesn't matter if the guy "admits" to exploiting vague wording. Within the wording of the rules as given, he won. Within the intent of the rules, as given, he lost.

    Ergo, BOTH people have a claim to "victory". They can either split the purse, or they can shake hands and donate it to some worthy cause.

    In the end, this comes down to something I've been asked all too often myself. If you had to choose, would you rather be happy or right?

    IMHO, the guys should cut out the righteous act, and stick with being happy. The contestant that he "defeated" the challange, and the guy who set the contest up, that he achieved his goal of reducing the snake oil surplus.

    Let's face it, those are worthy achievements in themselves. Be satisfied! The money is just causing grief. Neither probably needs it. So why not clear the air, and give the cash to someone who does?!

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  15. Re:An ingenious solution... by jd · · Score: 2
    Yawn. I know perfectly well that the FAT uses clusters of sectors. So? You can configure the disk geometry in the boot sector.

    Hex disk editors are fun tools. You might want to try them, some day, when you're finished trolling and want to get some work done.

    The dead space (NOBODY calls it "slack space"! Slack is what people do on Fridays) is a function of sector size, and you can define whatever sector size you like. (The defaults are the "sensible" geometry, but by no means the only one.)

    I suggest you also get Peter Norton's books on MSDOS, and print out a copy of the Interrupt listings, which you can find on the Internet. The Norton Guides are extremely handy, too. Oh, and you'll want a copy of Flight Simulator 2.0, which came on a 5.25" floppy. The "copy protection" scheme is to use 1 Kb sectors, which not only stopped DOS reading them directly, it also gave the disk about 256 extra bytes per track.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  16. Omega by Zooko · · Score: 2

    He should have used Gregory Chaitin's Omega number to generate the challenge file.

    Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.

    Zooko

  17. no genius required by slew · · Score: 2

    As people have already mentioned, the numbers are probably not pseudo-random as they were
    retrieved from random.org...

    However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
    you don't need to be a good mathmetician, you just need to know how to read a paper written
    by a good mathmetician... Look for the paper...

    Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
    IT-15(1): 122-127.

    Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
    for decoding BCH codes actually provides a general solution to the problem of synthesizing the
    shortest linear feedback shift register capable of generating the prescribed finite sequence of
    digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
    problem is demonstrated...

    This is probably in most textbooks on linear codes for encryption and/or error detection and
    allows you to recreate the shortest LFSR (linear feedback shift register, a common component
    of pseudo random number generators), given a sequence of digits. Of course nobody would
    use a non-cryptographically secure PRNG like random() would they?

    So you take the number, pass it through this algorithm, find the shortest LFSR, and the
    decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.

    Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
    knows it might just accidentally compress the sequence the challenger gives you (not likely,
    but who knows?)...

    Always remember... "stand on the backs of giants" whenever possible ;^)

  18. Re:Compression by Genom · · Score: 2

    ahh...but would THAT 1000 bit number be random, or compressible on it's own?

    Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.

  19. A variation which might work by stevelinton · · Score: 2

    As several people have observed, the attacker in this case used his "split into multi-files" trick to hide the necessary information in the directory structure of the disk. This is just barely ruled out by the "one compressed file" rule. However, suppose you hid the data in the filename of one file. For example write a program like

    echo $1 | cat - $1

    (about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.

    This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.

    Steve

  20. Re:An ingenious solution... by Ed+Avis · · Score: 2

    In this sort of contest, it's traditional that finding ingenious ways to exploit loopholes in the rules is a perfectly valid way to win. Then the rules are tightened up if ever the contest is re-run.

    Mike Goldman didn't make it clear what 'total size' meant - I think most programmers if you asked them would assume the sum of file sizes, if you didn't specify otherwise. His fault, I feel. He should pay the $5000 and write the rules more carefully next time.

    BTW, if you do assume that 'total size' is just the sum of file sizes, you can compress anything down to zero bytes. Number each possible sequence of binary digits - the empty string is zero, '0' is one, '1' is two, '00' is three, '01' is four, '10' is five, '11' is six, and so on. Then just create the required number of zero-length files.

    --
    -- Ed Avis ed@membled.com
  21. Mike defined total file size himself by A+nonymous+Coward · · Score: 2

    Patrick specified a file size, and Mike generated a file of that specified size. Mike did not include any OS filesystem overhead in his file size. It seems rather hypocritial of Mike to claim OS file system overhead counts for Patrick's files but not Mike's. Mike defined FILE SIZE, not DISK SPACE USED.

    --

    1. Re:Mike defined total file size himself by ChadN · · Score: 3

      However, the decompressor DOES depend on the filename numbering in order to work; it is part of the algorithm (and hence the "decompressor"), and those bytes (the bytes used to number the files) should be charged against the decompressor size. Then Patrick loses.

      While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.

      --
      "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
  22. Re:Intriquing by llywrch · · Score: 2

    > What I'm thinking of, is wouldn't it be possible to create a minimalist compression algorithm that tended to not affect the file size,
    > but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).

    All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."

    Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:

    s/a/311/g
    s/b/47/g
    s/c/9724/g
    ...
    etc.

    That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.

    Hmm. I seem to have reinvented pkzip.

    Geoff

    --
    I think I see a trend here. Maybe for them it really would be easier to muzzle the entire internet than to produce p
  23. Re:Another tack... by scrytch · · Score: 2

    > My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way?

    Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
    --

    --
    I've finally had it: until slashdot gets article moderation, I am not coming back.
  24. Re:PAY UP MIKE by edhall · · Score: 2

    If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.

    The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.

    -Ed
  25. Good deal for contestant by Sloppy · · Score: 2

    The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.


    ---
    --
    As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    1. Re:Good deal for contestant by dasmegabyte · · Score: 2

      Correct. He no doubt weeds his files for the normal patterns of random similarity, because even ONE of these would make the contest a farce (you could replace the bytes very easily). But there must be other alternatives to simple dictionary replacement, alternatives like dictionary on a bit shift (you're guaranteed to find some stuff that way, though the compression will be slow as hell), fractal decompression (match a sequence of the random to some buildable equation and save that equation) or possibly compression of a matrix of the file itself. Of course, for each of these methods I mention, Mike has probably generated a file that's taken these into account, and that's the point of this: he wants to see the magical next step in compression, which doesn't rely on these simple notions that things are repeated or can be rerepresented. There's some elusive technology he wants us to uncover, and hence the $5000 (and why he wouldn't pay up on that sham of a compression scheme Pat generated...converting "5" to an EOF does not constitute compressing.)

      --
      Hey freaks: now you're ju
  26. *Read* the article before you post. by arcade · · Score: 2

    Read the damn article before you post - he didn't use 'gunzip' in his proposition.


    --

    --
    "Rune Kristian Viken" - http://www.nwo.no - arca
  27. Re:The page has been removed by GeoCities by Skapare · · Score: 2

    Their 486 got overloaded

    --
    now we need to go OSS in diesel cars
  28. Re:sorry! by th0m · · Score: 2

    there's no scamming involved. goldman required the (suitably small) entry fee to deter time-wasters; craig wasn't after the money in the first place. they were both just out to prove a point (goldman, that truly random data can't be compressed in the general case; craig, that goldman's challenge was amusingly open to abuse), not to make any money.

    --

    -- in china, chinese food is just called food.

  29. Re:Compression by platypus · · Score: 2

    You could scan the file for the smallest possible number whose binary representation isn't contained in the file,
    i.e. let's say 1001010111001010101 (I know this number is too short) is not there replace the 1000 zero-bits with that number.

    But I fear now it only get's more difficult to calculate the probabilties of a "win", but the outcome will stay the same.

  30. Re:An ingenious solution... by Surak · · Score: 2

    No way...the guy didn't compress shit. All he did was some data shifting. If the guy had read the FAQ, he would know that such tricks are not compression...you mention an MSDOS FAT filesystem. Even on a FAT filesystem, the 218 (or whatever) files and the decompressor would have taken up more space than the original file due to slack space...although the amount would vary from hard drive to hard drive, if you figure an average of 512 bytes of slack space per file thats 111,616 bytes of slack space, far more space than Patrick claims to have "saved."

    And anyways, the guy basically admits that he cheated by exploiting the vague wording of the challenge.

    I've got to hand it to Patrick, though, because if you do take the challenge out of context, he basically won...it depends on what the meaning of the words "compressor" and "decompressor" mean.

  31. Re:An ingenious solution... by Surak · · Score: 2

    The 218 files would have taken no additional space on the FAT, as it's fixed size.

    Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".


    I never said that it would take additional FAT space, I'm talking slack space...if you knew anything about the FAT filesystem (which you obviously don't), you would know that its not sectors you have to worry about, it's clusters (GROUPS of sectors). there are only so many clusters because of FAT's fixed size... I was being QUITE generous in saying that 512 bytes is a practical average. On a 1.2 GB MS-DOS formatted drive (not FAT32, mind you, we're talking DOS here, not Windows), the cluster size ends up being something like 16K, leaving the average slack space to be 8K. so I was being VERY generous there.

  32. Re:That wasn't in the challenge by Surak · · Score: 2

    The hard drive platters were free. The coffee can lids might have worked better though because they are lighter in weight, although they wouldn't have been as rigid, so I'm not sure how well they would have supported the weight of the vehicle...

    I guess I could try it, though. :)

  33. Re:That wasn't in the challenge by Surak · · Score: 4

    Again, that depends on what you mean by "paper airplane." Admittedly, I won a similar contest involving rubber-band driven cars. The contest rules said that CDs or DVDs could not be used for tires due to the fact that they make for very narrow tires, thus having reduced friction. So I used hard drive platters. :-)

  34. IF I'm not mistaken. by mindstrm · · Score: 2

    He never said it was okay to deliver a whole bunch of 'compressed' files. He said ONE decompressor, and ONE compressed file. Though Patrick asked about this, all the guy did was restate his terms.

    What Patrick did was take a small bit of information out of the file, and, rather than having a table of 'offsets' telling his decompressor where to insert 5's, he broke it up into a bunch of files, effectively using the directory/inote tables to store this information.

    Had he not insisited on a single file, I'd say Patrick here would have technically won, though not at all in the spirit of the competition.

    1. Re:IF I'm not mistaken. by mindstrm · · Score: 2

      Hmm. okay.. point taken.
      Guy should cough up 5 big ones then.

    2. Re:IF I'm not mistaken. by ChadN · · Score: 2

      I would argue that the decompressor needs the filename numbers in order to work (for ordering) It CANNOT work on a bitstream of the "compressed" files alone (even if the ordering is preserved), so it is dependent on the numbering in the filename, and thus, their size needs to be counted as part of the decompressor. If you add in these bytes, then the decompressor (file plus filename numbering bytes) and datafiles are larger than the original file.

      So, after some consideration, I would say that Patrick didn't succeed, even considering the specific wording of the challenge (and the discussion in the emails).

      --
      "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
    3. Re:IF I'm not mistaken. by selectspec · · Score: 2

      At first, when I was reading the page, I felt that Patrick clearly cheated, given that he exploited the filesystem to store data. However, after reading your comment, I must agree. Mike should have clarified the rules. Clearly, Patrick deserves the cash.

      --

      Someone you trust is one of us.

    4. Re:IF I'm not mistaken. by dachshund · · Score: 3
      > I meant can I send you a compressor and several compressed files whose
      > total file size is less than the original uncompressed file and from
      > which I can regenerate the original uncompressed file
      >
      > Patrick

      Sure -- but you send me a *decompressor*, I don't need the compressor.

      Sure means Sure. As far as I can tell this is acceptance of Patrick's request. As Mike is the guy inventing the rules, I don't see why this wouldn't be binding.

  35. Re:Already compressed data? by mindstrm · · Score: 2

    The data has to come from the guy accepting the challenge; you get to tell him how much data to compress, he gives you the data.

  36. Re:An ingenious solution... by ChadN · · Score: 2
    In particular, the filename encodes information about the order in which the files need to be reassembled; so for each byte saved by the "compressor", several bytes are needed (on average) to encode the ordering as an ascii integer in the filename.

    The challenger should have made it clear that these costs would be counted against the reconstructed filesize (or simply not allowed any deviation from the "one file" rule) The decompressor should operate by having all the files piped into stdin, thus eliminating filesystem state info entirely.

    --
    "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
  37. Re:Another tack... by ChadN · · Score: 2

    I've heard this idea conjectured a lot by various people; but although the mandelbrot set is in some sense infinite, the set of images that can be produced has bounds. Basically, choosing any fractal, and sampling at any resolution, with any slice through the set, still won't produce the entire possible set of data that could be represented in the finite space that you've chosen. Due to the sampling, you would repeat some data and never produce others. In fact, the self-similarity properties of fractals would seem to indicate that this would be a rather poor (or inefficient) way to represent general data.

    --
    "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
  38. Re:it's "file size", and that's all it is by ChadN · · Score: 2
    NEVER does Mike say "file size". He uses the phrase "total in size". In fact, here is a quote:

    In order to meet the challenge you must provide to me a compressed file and a decompressor which together total less than the size of this original.dat, and which can generate a file identical to original.dat.

    He doesn't say "file size" in this or any other passage. It is Patrick who says (and assumes) this.

    Also, Patrick specifically addresses the issue of filename based compression in this passage.

    It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.

    Thus, I think it is Patrick who hadn't considered the fact that his "decompressor" is not only just the 'dfi' script, but also the filename information used to indicate file boundaries and ordering. Like I said, his algorithm (much less his actual implementation) could not properly reassemble the file if it were fed all the properly ordered "compressed" files as an input stream (through a pipe, say), since it wouldn't know where to place the missing fives. So given that Mike said "total size", in all cases, when discussing and offering the challenge, I think his position is safe.

    Now, I do think that Patrick was reasonably clever, and that Mike was overly smug in both the challenge and the initial news posting. And in the exchanges Patrick did explictly ask about file sizes, but I do not fault Mike for not clarifying further, since he had no a priori knowledge of the scheme, and therefore didn't know he would have to be so specific. But his use of terminology was consistent, and I think, appropriate, to protect against this type of scheme.

    Anyway, that's just my take on it. Mike can hopefully learn some humility from this, and Patrick can learn that his "decompressor" is more than just the sum of its file bytes.

    --
    "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
  39. Re:Are you for lawyers or against? by Moofie · · Score: 2

    As soon as Mike accepted Patrick's money, it was no longer a rhetorical challenge. It was a bet. Patrick, at that point, had caught Mike proffering a reverse sucker bet. Mike says "I bet you $100 at 50:1 odds that you can't do X." Patrick, confident that he could do X, accepted the bet and then did X. Mike says "Oh, I really meant for you to do X.1. I only tricked you into trying to do X because it's impossible." Patrick, in a really amazing display of good sportsmanship, declined to have his money refunded. (Which Mike wisely proffered...otherwise, he'd be in serious danger of fraud...)

    My momma told me never to take a sucker bet. Mike got caught by his OWN sucker bet...which takes a) greed and b) foolishness. He got lucky that the person who took him was such a good sport about it.

    It'd be really nice to see Mike donate the money to ACM or EFF or somebody like that...that'd be a pretty classy way of taking care of this.

    --
    Why yes, I AM a rocket scientist!
  40. Re:Are you for lawyers or against? by Moofie · · Score: 2

    I'm not necessarily arguing a point of law, it's more of a point of honor. Wagers are well-understood in many cultures, and in this case it seems clear that Mike proffered an arguably dishonest wager (a bet where he challenged people to do something that was not possible to do to his satisfaction). Somebody hoist Mike by his own petard. Good for Patrick.

    --
    Why yes, I AM a rocket scientist!
  41. Re:Compression & Time by angst_ridden_hipster · · Score: 2

    Yeah, my exceptionally slow compressor counts the number of "one" bits and the number of "zero" bits. It stores those two numbers, along with several position-dependent checksums. Compression is very fast.

    Decompression can take roughly the lifetime of the universe (for large files).

    Interestingly, my brilliant strategy is also a encryption algorithm ;)
    bukra fil mish mish
    -
    Monitor the Web, or Track your site!

    --
    Eloi, Eloi, lema sabachtani?
    www.fogbound.net
  42. Re:Compression & Time by angst_ridden_hipster · · Score: 2

    (should have said "slow de-compressor" but you get the picture)
    bukra fil mish mish
    -
    Monitor the Web, or Track your site!

    --
    Eloi, Eloi, lema sabachtani?
    www.fogbound.net
  43. Lzip! by PovRayMan · · Score: 2

    Lzip should have gotten $5000 for being the best compression program.

    ----------

  44. Re:Compression by jfunk · · Score: 2

    Too bad the Python doesn't get executed.

    Maybe in Squishdot, as long as rexec is used, of course. You could import StructuredText from Zope and your post could act like a Wiki.

    - *this* would look like this

    - **this** would look like this

    The above would become an actual HTML bulleted list.

    http://this.would.be/a/link

    || Instant Table ||
    || 1 || 2 ||

    Ah, well. :-)*

  45. Re:Are you for lawyers or against? by ajs · · Score: 2

    I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.

    Data point: The corolation for me is: modern US law as practiced is harmful, and anti-productive; this contest was clearly stated, and clearly won. Mike should pay up.

  46. Re:Are you for lawyers or against? by ajs · · Score: 2

    But it was clear. The contest was in terms of total file size of the compressed output and decompressor program.

    The proviso given (which should not have been) was that the compressed file could be provided in multiple parts. What he forgot is that an ordered sequence of files has more information than just the sum of their contents. Ooops.

    This is not a wording problem, but one of a poorly thought out bet. This guy gave 1:50 odds that no one could compress truely random data. He chose to allow conditions that made the bet doable, and then did not want to pay up when the bet was won.

    Actually, even without using multiple files, I think the bet's doable for any single given data file (though you cannot write a generalized compressor for all truely random data). I should come up with the submission to solve for this guy's challenge.

  47. Sucker Bets and Real Proposition Gambling by werdna · · Score: 2
    This was a great story. It reminds me of Sky Masterson's soliloquy from Guys and Dolls, in rejecting a proposed sucker bet about the number of sales for various pastries at a deli:


    "Let me tell you a story. When I was a young man about to go out into the world, my father says to me a very valuable thing. 'Son,' the old guy says, 'I'm sorry that I am not able to bankroll you a very large start. But not having any potatoes to give you, I am going to stake you to some very valuable advice. One of these days in your travels, you are going to come across a guy with a nice brand new deck of cards, with the wrap and seal unbroken, and this guy is going to bet you that he can make the Jack of Spades jump out of the deck and squirt cider in your ear. But, son, you do not take this bet. For if you do, as sure as you are standing there, you are going to end up with an ear full of cider.' "


    Sky, an extraordinary proposition gambler, responded by countering. He covered the offeror's eyes and bet he couldn't tell what color was Sky's tie. "No bet."

    I say this to point out the difference between sucker betting and proposition gambling. It is a sucker bet to offer a bet challenging a mark to do something that is, in fact, impossible. In proposition gambling, the offeror gives the mark a proposition they are unlikely to win, or that is worded in such a clever way that the winning of it leaves the mark too embarassed, entertained or impressed by the cleverness to attempt to challenge having lost.

    In this case, the first offer (compressing high-entropy data) was a sucker bet. Shame on him. The response, doing the sucker bet using multiple files, was a sweet proposition gamble. Shame on him. Once taken, the offeror-now-mark did weasel, but as I see it, ultimately welched.

    I admit that I, too, didn't consider the import of the multiple files on first reading (but, hey, I didn't have $5K against $100 riding on the result).

    At the end of the day, the sucker-bettor was the mark, and he had an earful of cider.

    No sympathy.
  48. A Hypothetical Question by gnfnrf · · Score: 2

    It is my understanding of the situation that Patrick "solved" the challenge by exploiting the fact that multiple files store more implicit data in the filesystem than a single file does. I believe the only such data he used was the EOF.

    This is considered "cheating" because he used 220 EOFs that are not counted in the filesize.

    So, my question is this. If a contestant submits a single compressed file and a single decompressor file which are just 1 byte under the original file, does Mike void the entry because of the extra filesystem space of two files? If not, why not? If so, shouldn't the rules say that specifically, as two is the minimum total number of files one can submit?

    --
    gnfnrf

  49. Intriquing by PhiRatE · · Score: 3

    Its an intriquing problem. I certainly wouldn't stake $5000 on nobody being able to compress the data I supplied, good random number generator or not. Why? because while the concept that random data in general cannot be compressed is correct, there are no patterns, it is not necessarily true that a given set of random data cannot be compressed.

    For example, a file full of 0's, is perfectly possible from a true random number generator, and just as likely as any other set of numbers. As any good geek knows, a file full of 0's compresses very very well.

    In this case, I think the challenge has one major problem, and that is that there is no real time limit on the entry, and no requirement that the decompressor work effectively in general, it can be entirely specialised.

    This means that the problem *may* be solvable, simply by throwing ungodly amounts of CPU power at it. With the appropriate code, it would be possible to simply hunt for unintentional patterns in the data and hope. If you're lucky, there will be enough indexable patterns present that it will outweigh the cost of the indexing, and the code necessary to replace the index points.

    If you're unlucky, the data will have no patterns in the space you searched, and therefore you're out of luck.

    This is why the guy noted that if he got a big enough file, he could win the competition. As the filesize increases, it becomes increasingly likely that there are, accidentally, enough patterns in the data to cover the space required for the decompressor and the dictionary. Of course, as the filesize increases, you also have to burn considerably more CPU in order to locate the patterns in the first place.

    Personally I think it would make an excellent project for a research team, not particularly for the money, but in order to create some really fast, specialised code for locating patterns in assumedly random data.

    I am not aware of much effort in that area, most compression research goes into identifying patterns inherent in the data type, waveforms in audio, visual patterns in images, duplicate strings in text and binary files. It would be very interesting to see just how effective fast analysis code and heavy CPU is against the problem.

    I suspect there may well be some crossover here with Cryptoanalysis, since one of the primary jobs in breaking crypto is finding the patterns that differentiate the ciphered data from pure randomness.

    Any links appreciated.

    --
    You can't win a fight.
    1. Re:Intriquing by gotan · · Score: 2

      Since you have to provide the algorithm for uncompressing too you can regard the machine (OS/shell+tools) as one huge decompressor. It does f(compressed data+algorithm)= decompressed data. Simple counting yields that since there are at least as many (compressed data+algorithm) pairs, as there are sets of decompresssed data. Even by choosing only the shortest versions at least some of those pairs have to have a larger total length than the decompressed data (not accounting for tricks with hiding data in EOFs and filelengths).

      --
      "By the way if anyone here is in advertising or marketing... kill yourself." -- Bill Hicks
    2. Re:Intriquing by Erasmus+Darwin · · Score: 2
      because while the concept that random data in general cannot be compressed is correct, there are no patterns, it is not necessarily true that a given set of random data cannot be compressed.

      If I'm not mistaken, during the email exchange, someone mentioned that the data was probably skew-corrected. So not only would you have to hope for a nice pattern in the data, but it would have to be a pattern that the person issuing the contest didn't notice.

    3. Re:Intriquing by vidarh · · Score: 2
      Your argument is fatally flawed.

      For it to be impossible to predict the next digit,to be generated, that digit needs to be totally random and chosen without any relation to the preceeding digits. If that is true, then any digit can follow any digit with the same probability, otherwise you would be able to make predictions better than pure chance by looking at the preceeding digit.

      In any sequence of n digits, where every digit is choosen totally on random, every sequence of n digits are just as likely to occur. That means that 0,0,0,0,0 is just as likely as 5,2,7,3,1, or any other sequence of the same number of digit chosen.

      If you're given a sequence of 10.000 digits, that are totally random, and you get to look at the first 9.999 of them, and they are all zero, the chance is still just 1 in 10 that the final digit is zero.

      A number isn't random in itself, it is random if it is chosen at random. You may confuse random numbers with numbers such as Pi, which aren't chosen at random, and thus aren't random numbers, but where you can't predict digit n+1 by looking at any of the digits 1 to n, and where there are no patterns. These numbers are transcendental numbers, not random.

      Obviously, in this case, a randomly generated number isn't what one should be looking for, but a number where there are no patterns to be exploited.

  50. Almost. by rjh · · Score: 2

    Depends entirely on how random the random data is. There's a surprising lot of random numbers which can be compressed--this almost always indicates a critical flaw in the random selection process. Even radioisotope-based RNGs aren't immune; by knowing the isotope, its age, and the sort of radiation detector being used, it's often possible to get a faint hint of determinism out of the RNG. That faint hint of determinism can be exploited to result in compressability... sometimes.

    By and large you're right that random data can't be compressed. But getting really random data is a far-from-trivial undertaking.

  51. Another tack... by wct · · Score: 2

    A while ago on slashdot.org an article was posted about a prime number that had the "magical" property of being able to be g'unzip'd to the DeCSS source. The number was represented as a (relatively short) arithmetic expression.

    My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way? Of course the chances of finding these type of numbers are very slim, but they *are* infinitely large spaces, and the compression would be ultra-compact.

    Maybe something for a quantum computer to chew on anyway...

    1. Re:Another tack... by Relic+of+the+Future · · Score: 2
      You mean this story? Ummm, it was never represented by any "arithmetic expression", only in its full (hex) glory and and like this:

      48565...29443 (1401-digits)

      I'm afraid using '...' doesn't count as compression. ;)

      There are some prime numbers which can be represented by (2^n)-1 (e.g. 3,7,31 but not 15 or 63), but I think you're right that your chances of finding such numbers would be slim.

      God does not play dice with the universe. Albert Einstein

      --
      Those who fail to understand communication protocols, are doomed to repeat them over port 80.
    2. Re:Another tack... by BMazurek · · Score: 2
      One thought that has always intrigued me is the infinitely large nature of fractals.

      Given a set of data to represent, construct a fractal equation, and a set of coordinates and a resolution that when expanded reproduce the original data.

      If you could find them, that would be some mighty cool compression....

    3. Re:Another tack... by vidarh · · Score: 2
      The problem with this method is that it doesn't only require you to find a set of coordinates and a resolution, but to find a set of coordinates and resolution that can be expressed in less space than the original data.

      Which may work for some input sets, but can't possibly work for all input sets. If you believe it can, imagine this scenario: I take an arbitrarily sized block, and compress it. Then I compress the result. I keep going until the data set is one bit. No matter how large your decoder, how would you decode more than two states from one bit?

      You can't.

      The tradeoff for all lossless compression for "generic" data, is to find an algorithm that compresses well for a reasonable set of inputs.

  52. Here is a Real Challenge by superid · · Score: 2
    There is another numerical analysis challenge thats been around for several years and has only had one winner of the $100 prize. Check out the Weak Signal Challenge. You need to recover a morse code signal embedded in noise. And you don't need to send any $$$ to enter, just download the WAV and start number crunching :)


    SuperID

    Free Database Hosting

  53. The problem by Hard_Code · · Score: 2

    The problem is that Mike said it was OK to use multiple files, but did not specify whether that meant that the actual additional filesystem space that they took up would be counted against Patrick anyway (with the idea that Patrick could use many files, just so long as the eventual information still added up smaller than the original), or whether he meant that it shouldn't make a difference, and didn't care.

    The information that could be considered "removed" by the files, would basically be the start and end offsets of the chunks, if they were all in one file...which would naturally add up to more than the original (219 * 2 additional info). The "missing" information escaped into the unix shell and filesystem, through the facility of "cat" being able to print out a block of bytes given a single piece of info.

    --

    It's 10 PM. Do you know if you're un-American?
  54. Re:Compression by Hard_Code · · Score: 2

    POT appears not to mean that Slashdot will *interpret* your post as POT, but instead that Slashdot *assumes* your post to be text that is not meaningful HTML. They should changed it so any HTML is appropriately escaped (entitied) from POT submissions. Angle brackets, etc, are very annoying to have to fix.

    --

    It's 10 PM. Do you know if you're un-American?
  55. Re:One born every minute by QuantumG · · Score: 2

    ie, it would confirm what I say every day.

    --
    How we know is more important than what we know.
  56. But this solution WOULD work by Cy+Guy · · Score: 2

    Since the challenge allows you to see the file before building the compressor (and the neither the size or speed of the compressor is relevant), all you need to do is find the single array of bytes (I'd say string, but presumably the 'arbitrary file' would be binary) greater than three bytes, that occupies the greatest percentage of file space in the compressed file. Then replace that array wherever it occurs with an EOF character and name each new section of the file sequentially using a 1 byte name.

    (As noted in another post) As long as you pick an initial file length long enough that 2% of the time there statistically should be enough repeats of some 3 byte (or longer) array to equal the number of bytes in your decompressor (which would be roughly the same size as Patrick's), then you will win the challenge enough times to make it profitable.

  57. Re:Compression by Cy+Guy · · Score: 2

    Nice try, but won't work. You have to keep the position of that serie of '1'. That position will use about log2(position) bits. Unsuprisingly, 2^n is the size the file needs to be to have a good chance to have a serie of n '1's.

    Not if you expand on Patrick's original idea replace the series of 1's (or zeros, your compressor should determine the longest continues series of each and then use the longer) and replace it with EOF. Since EOF can't appear in the original file except at the end, your de-compressor will know that this is the position to insert the original series. You could also repeat this as many times as like, provided the replaced series is at least as long as length of the EOF character plus the length of each new file name created plus memory space needed to store the value of the length of the series.

  58. Re:Are you for lawyers or against? by mjh · · Score: 2
    I think that specifity is important. Any software engineer will tell you that

    [ ... ]

    The english language (in common use) is suitably ambiguous to make that impossible

    Specificity is critically important when you're communicating with a computer. But over specificity is actually a deterrant to communicating with another human. Ambiguity is a common mechanism by which we humans learn. So, for example, when I say "cat" you have one image in your mind and I have another image in my mind. But when we both discover that the thing that I was talking about is different than the thing you were thinking, we'll have expanded our understanding of what the word "cat" comprises. Just now, I was thinking of a cheetah. Do you remember when you learned that cheetahs and lions and tigers were also cats?

    I'm overstating some things here becuase it's been about 8 years since I took my philosophy of language class. But to sum up, generally human communication works much more efficiently when there's some ambiguity in it.

    Which leads us to where we are - I don't really like it when a silly case wins in court due to a loophole, but I see no alternative

    But it's not just in the event of silly cases. You are cornered by the silliness everyday. Clearly you use a computer connected to the Internet. Do you think that the terms of service that govern your connection protect you? The point is that if we can't figure out a better way to deal with specificity and intent, then we consign ourselves to the rules that are written by those who can afford to write them.

    I don't have an answer for this either, but I'd like to think that the discussion isn't so easily dismissed.

    --
    Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
  59. Are you for lawyers or against? by mjh · · Score: 5
    I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.

    If you are for the challengee, then you believe that exact specificity is more important than intent. And you inadvertantly empower the legal profession to profit from being better able to specify rules, etc. Which, in turn, gives those with money even more power. They can write the rules by which your interaction with them will be governed. You can be certain that the rules will be written to the advantage of the guy with the money.

    If, on the other hand, you are for the challenger, then you believe that his intent was more important. (Considering that the challenge came from the compression FAQ, the exact definition of "compression" should have not been vague. It seems clear to me that his intent was obvious, even though his challenge wasn't very specific.) Still there is a problem with valuing intent over specificity. That is it's easy to misrepresent intent after the fact. Thus the need to clearly specify our intentions prior to anything we do. That being the case, with accurate documentation, it seems possible to me that we can discern the original intent w/out having to specify everything legalistically. For example, I think that I was able to clearly determine the challenger's intent well after the fact, and I did this having read the challengee's representation of the events.

    Now you may say, in rebuttal, "Who cares who I choose? I don't get to make the law." But think of yourself as the jury trying to determine to whom you would award the money. You may one day find yourself on a jury and trying to determine who should be awarded money. I see this as a microcosm of exactly how our court system has gotten into the state that it's in. Specificity is clearly more important than intent. And it's that way because juries award that way. Since juries are comprised of members of society, it's not a far stretch to say that society as a whole regards specificity as more important than intent.

    Is there any chance that we can change this? And if we did, would it be better?

    --
    Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
    1. Re:Are you for lawyers or against? by radish · · Score: 2


      I was very interested by your comments regarding the difference between specification and intent. However, I think there are some other factors in play here.

      You mention intent - let's look at the challenger's intent. It was NOT to find an amazing new compression scheme, in other words this contest was not meant to sponsor research in the field. He set the challenge PURELY to get one over on everyone else, to shout "told you so - ha ha ha" like some kiddie cracker. As he's essentially trying to catch people out, I think the onus is on HIM to make sure he specifies things EXACTLY. What has happened is the smartass challenger has had his whole scheme flipped up and fired back at him, which I personally relish and applaud.

      I think that specifity is important. Any software engineer will tell you that - the number one quote from a user when you deliver your system?? "Oh, I didn't MEAN I wanted it to do that...". You may think you know what the other party intended, but how can you be sure? The english language (in common use) is suitably ambiguous to make that impossible. (An aside - I'd love to know if developers in non-english speaking countries have the same problems, is english particularly ambiguous??). A lawyer would tell you that's exactly what they are trying to do, deduce intent from the actual written law (or contract or whatever). The two sides have differing interpretations, which are simply two different models of the original intent, deduced from the same written document. Where do you draw the line between the "quest for truth" and 2 bit scumbags just trying to pick holes in perfectly good laws? The simple fact is I don't think you can.

      Which leads us to where we are - I don't really like it when a silly case wins in court due to a loophole, but I see no alternative. People will disagree, they will always disagree, so how do you settle it? If you don't go with exact interpretation of the letter of the law, you end up having to go with an interpretation of the spirit of the law, which puts total (and to me unacceptable) power in the hands of whoever happens to be making that interpretation. At least an argument over english linguistics is transparent and explainable. Hopefully the inclusion of a jury in many cases helps to add the "human touch" to the process - at least I know it did when I served recently.

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    2. Re:Are you for lawyers or against? by Theodore+Logan · · Score: 2
      In this case, it is clearly obvious that the challengers intent is either to make a fool out of someone, or to make money. Not only does this person (Mike) believe the challenge will not be successfully met, he also believes it cannot be successfully met. This is not good sportmanship to me, and the only right thing to do is to look for the little things in the way the challenge was stated, that may allow bending the rules just enough to "win". This is exactly what Partick did.

      Had Mike had a different intent with this challenge, or stated it so that he thought it would be possible, but very unlikely, that he would have to cough up, the kind of approach Patrick took may be considered wrong. But this was not the case, and it seems to me as if you're trying to make this into something that it isn't.

      --

      "If you think education is expensive, try ignorance" - Derek Bok

    3. Re:Are you for lawyers or against? by startled · · Score: 2

      I believe you're under several misconceptions, which might be remedied by sitting in on a few trials, or reading up on a few cases.

      First of all, only naive idiots are universally pro-lawyer or anti-lawyer. Lawyers, like any other large group of people, are diverse. Often, people who say they're anti-lawyer mean the asshole working for the insurance company who cheated them out of a claim; they don't mean the guy suing to protect national forests, or defending the latest Leonard Peltier.

      Next, intent plays a much larger role than you'd believe. Law is typically not a game of "gotcha!" on contract wording. Intent determines whether you're charged with manslaughter or murder. Intent quite often is used to clarify, or even override, portions of contracts. If the challengee went to court, and attempted to base his entire case on the "sure" response to his "compressor" e-mail, he'd likely get tossed out-- the challenger could simply say, "oh, I thought he meant decompressor", or, "I was just attempting to clarify-- that wasn't intended to be a modification of the contract", or some other such thing. Again, the law is not a simple game of "gotcha".

      Finally, juries are another reason specificity is often trumped. If they see a person getting cheated out of their money because they missed one line in a quite long, and very misleading, contract, they'll tend to side with that person.

      You should sit in on a case or two some time; they can be quite enlightening. Court rooms are generally open to the public, and while some of the cases may be quite dry, you'll get a lot of insight in just a few hours.

    4. Re:Are you for lawyers or against? by sulli · · Score: 2
      Which is precisely why they are made that long - so people won't read them and object.

      Back on topic, I think Mike should return the $100 (or Y12200) in a nice picture frame. Patrick nailed him good, and even though he didn't disprove the mathematical principle, he did poke some appropriate holes in his hubris. Exact intent of the contest is less important than the specifics of their exchange; Mike (who really should have known better) took the bait, hook, line, and sinker!

      --

      sulli
      RTFJ.
  60. Re:The page has been removed by GeoCities by MikeBabcock · · Score: 2

    And here if needed.

    --
    - Michael T. Babcock (Yes, I blog)
  61. Re:An ingenious solution... by MikeBabcock · · Score: 2
    The quote everyone seems to avoid is:
    I am curious, though, and I am not trying to get any sort of angle on you here, what brilliant idea have you come up with that makes you think you can compress arbitrary data? Whether or whatever you disclose won't affect this challenge and I will live up t my end of the bargain regardless.
    That sums it up for me.
    --
    - Michael T. Babcock (Yes, I blog)
  62. Re:There is no solution... by ZahrGnosis · · Score: 2

    I have to agree with this. The trick is Given a sufficiently large file. I can't imagine anyone coming up with a 2GB file off of which a modified RLE program couldn't snip a few kB. By "modified", I mean spend all the time you want analyzing the file (this is allowed), and find strings of similar bytes in arithmetic progression, or in strange but calculable patterns. Remember, the compressor can be as complicated and large as possible; it's the decompressor that needs to be small.

  63. Re:But then what about "compressible" files. by BlueUnderwear · · Score: 2
    > however you have no way to represent non-compression resistant files.

    The challenge specifies that you can choose your compressor after you had a peek at the datafile. If it turns out to be compressible enough by a known algorithm, just ship that compressor. If it isn't, use your meta-compressor.

    However, I have a gut feeling that the uncompressible files are too numerous for this meta-compressor to work: if 99% of all possible files are uncompressible, then the index number of the file would not be that much smaller than the file itself, and you couldn't fit the meta-compressor's code (which would be rather complex...) in the small space saving.

    As others have pointed out, a better alternative would be to simply "play the odds": make an algorithm which only works in 10% of the cases. If the file can't be compressed, resubmit an entry until you get one that works. On average you would have spent $1000, and won $5000!

    --
    Say no to software patents.
  64. Re:PAY UP MIKE by selectspec · · Score: 2

    Clearly, this is the crux of the arguement, and intellegent people can waiver here. In a sense the fragments were compressed in that the 5's were chopped off. I would actually argue that this is a form of compression, however impractical in this case. For example, if you create a dictionary of "phrases" and install the dictionary within the decompressing client, and send the "compressed" files you are really doing the same thing that Patrick did. The difference, is that Patrick exploited the filesystem to store his dictionary data, as opposed to actually putting the dictionary inside the decompressor. In fact what he did IS compression, and the files were compressed in size. Patrick could have done even more devious techniques by moving all of the data into the filenames themselves. Because Mike did not spesify that external linkage to datasources was illegal, what Patrick did in my view was within the confines of the contest.

    --

    Someone you trust is one of us.

  65. A Solution by selectspec · · Score: 2

    One solution however impractical an imperfect is to reverse-checksums. Essentially, you take the data block and perform a series of running checksums (like CRCs). You store only the checksums in the "compressed" file plus tidbits of "hint" real data. What you have is a lossy form of compression. However, with enough processing power (a few dozen E10,000s) you can reverse calcuate the checksums which will give you accurate results with 99.999+% of data blocks (some data blocks will return false solutions). It is simply a matter of home much processing time you are willing to throw at the problem.

    --

    Someone you trust is one of us.

    1. Re:A Solution by Animats · · Score: 2

      MD5 is supposed to be resistant to that kind of attack. Generating data with a given CRC, though, is not all that hard.

  66. Re:Compression by selectspec · · Score: 2

    Aside for not clarifying the use of external dictionaries as being illegal (Mike let Patrick use the filesystem filenames to store data), Mike also didn't clarify processing time. If processing time is not a factor, you can create reverse CRC/Checksum compression that is acurate almost everytime. Reverse-crc compression is accurate inversely porportionate to the rate of compression. So, you simply need to choose an large data block (1 gig or so) to compress with a very low rate of compression, say 1000/999 or something like that. Then you let the reverse crc decompressor work on it for a few days, and eventually you'll come up with the right data 999/1000 times.

    --

    Someone you trust is one of us.

  67. PAY UP MIKE by selectspec · · Score: 3

    .

    Patrick I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file

    Mike Sure.

    Pay up Mike. Pay up. You were too careless to plan out the rules and think about your intent before you threw 5k into the sink. Cough up the cash or loose all honor.

    --

    Someone you trust is one of us.

    1. Re:PAY UP MIKE by jgerman · · Score: 2
      EVERYBODY doesn't know anything. Science is NOT a world where everything is fair.

      Let's start with the term EVERYBODY. Your use of the word has implicit assumptions that you consider self evident truths. Does EVERYBODY include the five year old down the street? I guess it's not EVERYBODY then is it. Your assumption on the meaning of everybody is vague and ambiguous. Thus the need for explicitly written rules and any loophole in the agreed upon rules is valid. Regardless of whether EVERYBODY knew what the intent was. This is a case of smug idiot screwing up the rules to his challenge.

      As far as science being fair. Fair to whom? That's a completely meaningless statement. why don't you elaborate and show how science is FAIR?

      --
      I'm the big fish in the big pond bitch.
    2. Re:PAY UP MIKE by FTL · · Score: 2
      > I meant can I send you a compressor and several compressed files...

      They weren't "compressed files". That's the only problem. They were just literal fragments of the original file.
      --

      --
      Slashdot monitor for your Mozilla sidebar or Active Desktop.
    3. Re:PAY UP MIKE by dasmegabyte · · Score: 2

      Well, this isn't just a bet. It's like a research grant. Mike wants to see compression -- meaning, the file takes less disc space -- and that isn't what he got. He got a bunch of files that no doubt took up more disc space due to the file information headers...basically, he did the same thing as those cheater programs back in the '80s that would "compress" files by moving the EOF up to the front of the file, and storing the real bits in the free space area. The trick here was in the file system, and on a file system that included the size of the header in the file size (like MacOS) it would be obvious he failed. Don't try and milk Mike's grant money for a shuyster who did nothing other than lie in an attempt to win(which is what this was...even if it wasn't especially shiftless). EVERYBODY knows the purpose of this challenge is compression, not rhetorical trickery, and the advancement of computer science is a bit more important than a scientist's offhanded approach to definition. I agree that he was careless, but science *IS* a world where things are fair...and in this case, fairness precludes and should preclude honor.

      --
      Hey freaks: now you're ju
  68. it's a bad challenge regardless of wording. by daniell · · Score: 2

    Let us disregard the issue of money and the issue of a specific answer's acceptability and analyze the question

    In my opinion any person, given enough time to come up to speed on the subject, and given that that time, plus the time it takes to come up with and implement their solution all multiplied by the factor of time it takes to maintain themselves (eating, resting etc.) during that time, and given that they live that long, can come up with a winning solution to Mike's challenge due to a flaw, not in wording of the challenge, but in Mike's reasoning behind the challenge.

    To put a lot of discussion to rest (in my mind) just read the relevent portion of the FAQ in which the challenge is found. i.e. the section entitled "9.2 The counting argument"

    Here you'll find, clearly stated, the flaw in trying to find any one algorithm which will always compress data of a fixed length to less than that length. Also you'll see the decent challege by Steve Tate which paraphrased says:

    "give me your best algorithm and the size of data it works on, and I'll give you one counter example of data in that size which will not compress with your algorithm."
    Mike Goldman's addendum challenge effectivly states:
    "I'll give you data in any size you request, you give me data and a program which in total are smaller than the original size and for which the program applied to this data produce my original data."

    Reading it in its full context, which aparently a lot of slashdotters didn't, it immediately gives itself up as a challenge which does not hinge on on the mathematical theorem it is presented with. i.e.

    Given, there is no one algorithm which can compress all data, nor a subset of data selected by being a fixed length.
    Therefore, if you give me an algorithm which claims to compress all data or all data of such a subset, I can give you at least one counter example.
    It does not follow that I can give you one counter example to all compression algorithms for a given length (of your choice).

    The implication of the first challenge is that the poser is confident in both the mathematical theory and either his analytical abilities when studying the given algorithm, or in his ability to test and check against the given algorithm in a sufficient period of time.

    Conversely, the implication in the second challenge is that Mike in fact is the possessor of a set of super-data, of which any sub-set cannot be compressed by any algorithm.

    Since we live in a time when any data sufficiently limited in size yet sufficiently large enough to offer decent gains when compressed (the challenger may pick that size) can be analyzed fully and with great speed, I think Mike has made a bit of a fool of himself.

    I am not saying that Mike can't supply <<the uncompressable data from hell>&gt, but he will have to go to some lengths to get it. It will likely be selected to be large enough so Mike cannot generate it himself without an algorithm. It cannot be generated with a small input set or that set could be discovered and when coupled with the generator would defeat the challenge. It may be generated from a large set, but the result must show no patterns that can be bitstuffed (see networking) against or more simply generated and indexed for, nor may the source set contain any of the same properties. It may be possible that Mike is collecting this data from a source, i.e. digitizing random radio signals; we may all be interested in any source data which cannot by any reversable algorithm be represented/stored more succinctly particularly without any regard for retaining its meaning while compressed.

    In summary, if Mike doesn't want to be giving away money, then he doesn't understand his own challenge's failure to be based on any mathematical certainty (unless I'm wrong and he has some super-data). However Patrick's solution was clearly not compression, though it was the start of a decent form of compression. Had concatenated his file-set and headed it with information as to each file's length in order, he just might have saved space enough for the decompression, particularly if he picked a suffiently large recuring patern that could be generated rather than having to be stored verbatim.

    However, I might just be able to claim, that if the universe is deterministic, if the input conditions for the creation of the universe could be represented in less data than a generated block of data which can be represented in that universe, and if I were given enough resources in which to recreate the universe, I should be able to offer the input conditions of the universe and an index into it's ongoing generation which could point-to a generated block of data which exactly matches one provided. I just might be able to give you the process, its inputs, and your data before you've finished generating the data for my participation in the challenge, if in my generated universe I could significantly warp time without affecting the results.

    -Daniel

    That last part's an odd joke

  69. Re:Compression by daniell · · Score: 2
    I tend to agree that it should be always possible to get the $5000 from this "challenge". You can see my extended thoughts at: article 403

    I'm surprised at the number of people using the FAQ's "pigeon hole" theorem backwards. The theorem states that a reversable algorithm cannot be used to represent every piece of data as a piece of data smaller than itself. It does not address being able to pick an algorithm well suited to just one specific piece of data. If there is any data to which no compression alogrithm can be applied to, tailored to, or made for, that's a pretty amazing piece of data, and we should all be made aware of it Mike.

    -Daniel

  70. "to be disbursed at my sole discretion" by Baldrson · · Score: 2
    The reason I said "to be disbursed at my sole discretion" in the first sentence of my rocketry prize was to make it clear that I was acting as the arbitrator of my own "contract" with the public and that therefore if anyone wanted to play "games" with me, they would quickly discover they were playing with themselves.

    Goldman's mistake was two fold:

    1) He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.

    2) This he failed to do so within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.

    1. Re:"to be disbursed at my sole discretion" by Baldrson · · Score: 2
      And of course, I had at least a one-fold mistake in my comment on Goldman's second of his two fold mistake -- which comment should have read:

      2) This he failed to do within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.

      Such mental typos being made at my decaffinated 6:54 AM brain's sole discretion.

    2. Re:"to be disbursed at my sole discretion" by Baldrson · · Score: 2

      If someone had a genuine solution -- something that actually fulfilled the intent of the challenge -- they would be likely to submit the $100 because the odds that Goldman was not scamming the compression group would be vastly higher than 1/50. :)

  71. Re:Reality? or Data Hiding? by Apotsy · · Score: 2
    The guy was clever and found a loophole in the rules of the challenge. He even said so.

    The point of the story is to be careful when issuing so-called "impossible" challenges. Don't leave loopholes in the rules!

  72. Re:Bar room bets? by Apotsy · · Score: 3
    Exactly. "Pompus ass" were the first words to come to my mind as well.

    When Goldman gleefully posted to the newsgroup that someone had accepted his challenge, he said "some people just don't understand information theory too well". Yeah, and some people just don't understand issuing challenges very well, either!

    When you make a challenge like that you absolutely must be sure to word the rules carefully and try to avoid leaving any loopholes. In this case, all Goldman really needed to do was tack a sentence at the end of his original challenge saying "the data must actually be compressed according to the definition of 'compression' contained in this FAQ. Challengers who merely point out loopholes or exploits in the rules of the challenge will not be rewarded".

    The moment Mr. Craig started iquiring about about the specifics of the rules and saying "wait and see" regarding his methods, that should have been a warning sign to Goldman that Craig had some trick up his sleeve. But no, Goldman was just so in love with the idea that he found some "impossible" task which suckers might try to undertake that he couldn't be bothered to think about such things.

    What's even more embarassing to Goldman is his initial reaction to Craig's submission:

    I have accessed your server and downloaded exactly two files, the "decompressor" (dfi) and the "compressed file" (comp). The rest of the files in the directory (comp.0 through comp.217 inclusive) I have not touched.
    Even though Craig had already asked if more than one file was okay and Goldman had said yes!

    And Goldman's following point:

    In further point of fact, the 218 parts together must occupy more space than the original file (original.dat) that you were given. Each file requires space for a directory entry, and each such directory entry requires in excess of one byte of space. Thus you have in actuality expanded the data which you were given to compress.
    is even more ridiculous given that Goldman had already said that "sure" in response to Craig's question about multiple files whose "total file size" (which does not include directory entries) are less than the size of the original data file. Goldman tried to change the rules rather than admit he was sloppy in setting them up.

    Goldman also says "I think that I did not make a mistake", even though he admits in the same message that he needs to restate the rules:

    Perhaps it will be necessary to restate the terms in a legalistic form which is not open to this sort of miscommunication...
    "Perhaps"? How about "definitely"! And any "miscommunication" was entirely one-way. Craig was very sly and clever, and Goldman was too full of himself to realize what the fuck was going on until after the fact. A "pompus ass" indeed.
  73. Re:Psuedo-random generators by kevinank · · Score: 2

    I think in his post to comp.compression Mike wrote that his random data had come from random.org. Random doesn't use a PRN generation algorithm; they get random data from atmospheric radio noise (tune your AM radio to a non-station and you'll hear what that means.) To remove any bias in the data they postprocess the noise by removing high order bits, and the low order bits are shifted to normal using a bit discarding algorithm.

    --
    LibBT: BitTorrent for C - small - fast - clean (Now Versio
  74. The challenge was *not* impossible. by Nonesuch · · Score: 2
    The challenger was mistaken, he thought that he was offering an impossible challenge, but was mistaken.

    The comp.compression fact discusses that for any given compression algorithm, it is possible to craft an input fill that cannot be compressed.

    This is different than the challenge proposed, where Mike, the challenger, provided a file of size X, and Patrick, the challengee, was required to provide a de-compression program of size Y and a data file of size Z, where X > Y+Z.

    The difference here is that Patrick was allowed to hand-craft the compression method to the data file, but Mike was not allowed to hand-craft the original data to the compression routine.

    Basically, Mike was wrong, and should pay the $5K.

  75. Re:Reality? or Data Hiding? by Nonesuch · · Score: 2
    The quest was for 'Lossless data compression', but the challenge was presented backwards-

    Given a sufficently large source file of truly random data, I can create a binary and a data file which can recreate that one specific source file, losslessly, with my binary+data being smaller than the source. That is, I can create a routine that will work for just that one specific source file, no others.

    The problem of 'Lossless Data Compression' refers to the opposite case- given any general purpose routine to compress files, I can create a file that it cannot compress. That is the oppposite of the challenge given, and actually is impossible.

    As stated, even with the restriction of a single compressed file, the challenge *IS* winnable.

  76. Re:The bogus ideas piggie bank by Nonesuch · · Score: 2
    There is a huge difference between 'have the ability to compress randomly occurring data' and 'have the ability to compress the specific file Mike supplies'.

    It is impossible to write a routine that can losslessly compress every possible 3-megabyte file that could ever be generated.

    It is not impossible to write a routine that can compress the one specific file that Mike supplied to you. That is, it is quite likely that you could write a program that can losslessly compress that one single input set, such that:
    length of decompressor) + (length of compressed data)

    Why, he'd just reverse engineer the decompressor, patent it, get rich, and buy an island inhabited with young bikini clad virgins before the your check had even cleared.
    That would only be true for the general casse compressor-decompressor. I could write a routine that would work for that one specific input file, but fail miserably on any other input. This would win me the $5K, but be valueless for any other purpose.

    econdly, i've been thinking about this problem for about 3 years now. :) I've tried so many methods and ways of interpreting and looking at random data, i'd have enough to write a book if i'd bothered to keep accurate logs. If i had $1 for every time i said "I'VE GOT IT!" i'd probably have $5000 already. Well, maybe that's a tad off the mark, but anyway, you get my point. Right, onto my conclusion...

    The mistake both you and Mike make is that while writing a program for the general case of compressing random data is impossible, it is very different than writing a program to compress one specific file of 'random' data.

  77. Re:Compression by gimbo · · Score: 2

    Slightly over-zealous use of "Submit" without previous use of "Preview", methinks. ;-)

    BTW, you can use &lt; to get "less than", like this:

    <

    This is very very standard stuff in HTML.


    --

  78. Re:Compression by gimbo · · Score: 2

    Heh - interesting: I've never used anything other than "Plain Old Text" (including on this message - but spot the italics!).

    Strange... Hey ho, live and learn I guess.
    --

  79. Re:Compression by gimbo · · Score: 2

    def test():
    print 'so what do the other modes do?'
    this_is_the_example_for('code')
    return("cool, works the way I'd hope")

    if __name__ == '__main__':
    test()

    --

  80. Re:Compression by gimbo · · Score: 2

    But tomorrow morning, I shall be sober.

    Oh, hang on, that's not right.
    --

  81. Re:Compression by mikera · · Score: 2

    You know, I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length. This is because knowing the file to be compressed *before* you create the decopressor gives you an advantage over the arbitrary random file case.

    Furthermore, I reckon I can write a program G that will, when given any input file A of length>=X, will create a decompressor D and compressed file C such that:

    D(C) = A
    size (D) + size (C) size (A)

    Don't know what X is yet, I supspect around 1 meg is fine though.

    Anyone willing to bet either way?

  82. Re:Compression by mikera · · Score: 2

    Slashdot apparently doesn't like open angle brackets.

    the condition should read:

    size(D) + size(C) "is less than" size (A)

    Slightly over-zealous HTML filter, methinks....

  83. Re:Compression by mikera · · Score: 2

    Uh, just spotted that isn't possible. Cos then you've created G(A)->(D,C) which is a universal compressor, and therefore not possible by your usual arguments. Course, I'm still interested to see if you could create G which works for a percentage X of possible values of A that gets arbitrarily close to 1. My guess is that this is possible, in which case you can still win money off the original bet by ensuring that X>0.02 Of course, if your opponent knew G, they could always choose a value of A that would confound it. But this isn't the case here. Also, it may be that X increases only very slowly with filesize, so that in order to get X>0.02 you would need a few solar systems worth of storage and CPUs to complete the challenge. Can anyone prove/disprove any of this?

  84. Re:Compression by jareds · · Score: 2

    Anyone willing to bet either way?

    Yes, assuming appropriate restrictions against storing information in the compressed file's name or attributes, I'll give you N:1 odds that you cannot write the program G, where N is any positive number. X can also be any number.

    Or, save yourself time, don't try to write the program at all, just send me money.

  85. Re:Compression by jareds · · Score: 2

    It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.

    However, your decompression program would have to be smaller than M - log_2 (N) = log_2 (2^M/N) bits, where M is the number of bits in the compression-resistant file, and N is the number of compresion-resistant files preceding it in your sort order. So, the number of compression-resistant files would have to be small relative to the number of possible files if you want more than a bit to write your program. Since this is not the case, you would fail.

  86. Re:Compression by jareds · · Score: 2

    Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out. The bet can be won; not in every instance of course, but in the long run.

    That reflects a mistake on the part of Mike Goldman. If you want your decompressor and compressed file to be a total of length N, you can store log_2(N) bits of information in how you set the lengths of the two files. If you have more than two files, you can store even more information. Mike should have required that (number of bits in original data)+log_2(length of original data) > sum of number of bits in each compressed file and decompressor + log_2(product of lengths of each file and of decompressor), and he should have permitted himself to rename or change the file attributes of any file he wished in case information was being stored therein.

  87. Re:Compression by jareds · · Score: 2

    Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.

    In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.

    No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.

    I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.

  88. Re:Compression by jareds · · Score: 2

    If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466.

    As I understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.

  89. Re:That wasn't in the challenge by Richy_T · · Score: 2
    We recently had a "Active Learning day" at work. The staff was divided into two teams and various activities occurred throughout the day, the aim being to accumulate clues. When all clues were accumulated, you had to use them to open a briefcase which contained the final activity of the day.

    The theory was that each teams set of data was incomplete so they would have to cooperate to get the suitcase open, thus teaching us the importance of acting as a whole team or something.

    However, I have always had an interest in locks (and "unlocking" them) and the briefcase had those cheapy combo locks on it so I wandered over and had it open in 20 seconds.

    Lesson: Teams are good but sometimes an individual makes the difference.

    Unfortunately, because things had gone wrong, they decided to usurp my victory and give the final activity (locate an item in the woods) to both teams. My team lost. I'm not sure what the message there is but I chose to learn my own message which is "Messing around for the day is much more fun than sitting in an office working"

    Rich

  90. Re:An ingenious solution... by susano_otter · · Score: 2

    Ingenious, but not ingeneous enough. The FAT table is a limited resource, and his solution uses up more of that resource than the original file. It's still not a valid compression - not to mention the fact that it's totally unportable anyway. Other OS's won't even concede this marginal point. . .

    --

    Any sufficiently well-organized community is indistinguishable from Government.

  91. Random data... by abiogenesis · · Score: 2

    If you try to compress random data using one generic algorithm, you will sometimes have size savings, and sometimes will end up with a larger file. If you are smart enough to put a "not compressed" flag in your data file when it becomes larger, you can reduce the size on average.

    But, if you are free to select your algorithm after seeing the data file, chances are you can always compress the data (although I may not mathematically prove this). Remember the article here on slashdot a few weeks ago where a prime number, when gunzipped, produces the source code for a DVD decoder. Well, you may say that they are not the same because the source code was compressed with gzip beforehand, but please think a different algorithm than gzip (as in mandelbrot fractals) and you *might* end up with an algorithm which compresses some data which is random to you but not-so-random statistically or mathematically.

    PI is infinitely random, but can't you compress it in a few lines of code?

    Although I should admit that finding the algorithm itself would be mostly by chance.

    --

    Donate free food to the hungry at The Hunger site.
  92. Re:Maybe he should have used the "lossy" lzip algo by DrXym · · Score: 2
    No he wouldn't. The goal is to compress the data, such that decompressing it returns it to its original form.

    Lossy implies decompression will *not* return it to its original form.

  93. here's a decompressor by aozilla · · Score: 2

    works on freebsd, anyway
    -------------
    #!/bin/sh

    fetch http://211.0.18.242/~patrick/other/original.dat

    --
    ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?
  94. Re:There is no solution... by ssimpson · · Score: 2

    Kind of true: you can't universally compress random data using a single program, but this doesn't mean you can't compress a single instance of random data.

    This competition is kind of "balanced" because:

    • The contestant can attempt and mount any number of different types of compression - all he has to do is find one instance that meets the requirements of the challenge.
    • The challenge owner can choose a file of the specified length that is "strong" against compression: Test the data with tools such as Diehard and Ent - if the file doesn't seem "strong enough" then create a new one.

    Given a sufficiently large file (a couple of Gb would probably do) then I think virtually any file could be compressed with a specially crafted compressor and decompressor. Given a 2Gb file, you only need to achieve .0001% compression to have 2147 bytes to write the decompressor.

    --
    "Mary had a crypto key, she kept it in escrow, and everything that Mary said, the Feds were sure to know."
  95. The page has been removed by GeoCities by locutus074 · · Score: 5
    There is now a mirror available.

    --

    --

    --
    We have fought the AC's, and they have won.

  96. What arrogancy! by Theodore+Logan · · Score: 2
    As seen in his letter posted to the newsgroup after the challenge had been accepted, Mike was very arrogant, and quite stupid. As many slashdotters have noted already, being able to compress a single "random" file, is not the same as being able to compress any "random" file. $5000 is a lot of money for a not very clearly thought-through competition, and though it is obvious that this guy can't cough up although he should, he ought to apologize and give the $100 back. Bending the rules are what these kind of contests are about. Besides, threatening with legal action makes him lose his face completely.

    Hopefully, at least, he learned his lesson, which most likely was all Patrick ever hoped for anyway.

    --

    "If you think education is expensive, try ignorance" - Derek Bok

  97. Re:Compression by e_lehman · · Score: 2

    Pick a large enough length, and you are virtually guaranteed that there is a 1024-byte block of zeroes in it somewhere.

    Your reasoning contains an error. What were you thinking, arguing with an MIT guy, for heaven's sake! :-)

    As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.

    For example, to a first order, you need a file of length around 2^1000 bits to expect a run of 1000 consecutive zero-bits. Great, But specifying the position of this run requires-- you got it!-- a 1000-bit number. We're hosed!

  98. Re:50/50 compression by John_Booty · · Score: 2

    That's almost a good idea, and it made me laugh when I read it, but your decompressor would surely take more than one bit, wouldn't it? So the total size of the decompressor plus the compressed file would certainly be greater than the original file (this was a key part of the original challange).

    http://www.bootyproject.org

    --

    OtakuBooty.com: Smart, funny, sexy nerds.
  99. Re:Compression by clare-ents · · Score: 2

    "
    I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length.
    "

    Nope,

    If I have N bits of file. There are 2^N possible files that can be created. There is no way you can always have a compressed file of length 2^(N-1) in length simply because you only have 2^(N-1) distinct possibilities leaving 2^(N-1) files untouched.

    However, I intend to compress 2^(N-1) files into a length of 2^(N-1) and the other 2^(N-1) files into a length > 2^N. Since there's a 2^-1 chance of getting each file I expect an average return of $2400. (2^-1 * $prize - $entry).

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  100. Re:Compression by clare-ents · · Score: 2

    Sorry, I meant to say length N-1, and length N.

    I've muddied my explanation however hopefully the point is clear [it's possible to compress 50% of the files to smaller than the starting length at the expense of the other 50% being larger].

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  101. Re:*sigh* Here's why. by clare-ents · · Score: 2

    "
    You will therefore find a match, ON AVERAGE, at a distance of about 2^99 bits into PI
    "

    And hence *on average* you win the competition.

    In n entries you will win n/2 times and lose n/2 times - admittedly you will tend to expand the data - the amount the data expands on average in the losing cases is greater than the average saving in the winning cases. However, that doesn't matter here. Providing we can shrink the data in better than 1 in every 50 attempts we will end up making a profit.

    If the guy wanted to have a winning money source he should have made the prize $199.

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  102. 50/50 compression by clare-ents · · Score: 2

    Ignore the last bit

    Supply it at random.

    50% of the time you're wrong and lose your $100

    50% of the time you're right and win $5000

    Of course this loses the spirit of the competition [design a universal compressor which is clearly impossible] but wins the other competition of 'how do I make the arrogant prick look like a fool'

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  103. Compression by clare-ents · · Score: 4

    Ok, I've done some information theory, I know it's impossible to create a perfect compressor. Boxes arguments and all that.

    However, to make money, you don't have to.

    All that is required is a compresser that saves 1 bit on better than 1 in 50 attempts.

    For example, if we can save 1 bit on 50% of the attempts - even if the other 50% of the attempts results it catasrohpically large files. On average we've lost the compression war.

    However, on 100 attempts, we've paid $100 x 100 or $10000 but we've won on half of these - netting us $250000 or a cool $240000 profit.

    Anyone want to demonstrate this?

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
    1. Re:Compression by edp · · Score: 2

      "... this challenge where the rules are 'compress this specific file which has been carefully generated to resist compression'."

      It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.

    2. Re:Compression by edp · · Score: 2

      I'll retract my previous comments. Trying to find a 1024-byte block of zeroes requires bits to record where the block is found, which consumes the potential savings. Whether counting compression-resistant files saves space depends on the choice of "compression-resistant."

    3. Re:Compression by edp · · Score: 3

      "Since this is not the case, ..."

      Sorry, not so. The proportion of compression-resistant files necessarily decreases as file length increases. To see this, consider that if a file had a nice block of 1024 bytes of zeroes somewhere in it, we could easily write a simple program that wrote the bytes before that block, then 1024 bytes of zeroes, then the bytes after that block. This program would easily fit in less than 1024 bytes, plus the bytes before and after the block of zeroes. So such a file is compressible. I will show that the proportion of files that are not compressible like this decreases as file length increases.

      To make this easy, I'll count for compressible files just the ones where a block of zeroes begins at a multiple of 1024 bytes from the beginning of a file. The chance that a single 1024-byte block (selected randomly from a uniform distribution) is not all zeroes is (2^8192-1)/2^8192, a very tiny bit less than one. The chance that all n blocks of n 1024-byte blocks are not zeroes is (2^8192-1)/2^8192)^n. Obviously this approaches zero as n increases. Thus the proportion of files that are not compressible decreases as file length increases. Pick a large enough length, and you are virtually guaranteed there is a 1024-byte block of zeroes in it somewhere.

      When we open compression up to a wider range of techniques, the same effect occurs; the proportion of files that are not compressible under any given, reasonable technique decreases as file length increases. This makes Goldman's challenge a bad offer for him -- the fact that the average file size change must be positive under any compression technique in the limit does not mean the probability of a positive change in a randomly selected file is large enough to give him a positive return on the challenge.

    4. Re:Compression by bmongar · · Score: 2

      Since you get to look at this file before hand there is little chance involved you can tweek make an algorithm to specifically compress that file, or at least drop it by a byte or two as in the above case.

      --
      As x approaches total apathy I couldn't care less.
  104. Re:Nope. by edp · · Score: 2

    "However, there should be no way for the decompressor to generate compression resistant files in order to enumerate them."

    Generating files is easy. You just iterate through them: empty file, all the one-byte files (0x00, 0x01, 0x02, ... 0xff), all the two-byte files (0x0000, 0x0001, 0x0002, ... 0x00ff, 0x0100, 0x0101, ... 0xffff), all the three-byte files (0x000000, 0x000001, ...), et cetera. It would take a while, but time isn't the issue.

    Of course, the decompressor wants to count compression-resistant files, but it does this by generating each file, as described above, and then running it through the compressor. If it does not come out smaller, it is compression-resistant.

  105. Re:But then what about "compressible" files. by edp · · Score: 2

    "So you're running them through an external compressor ..."

    No, it is not external. The compressor and decompressor are written with an internal copy of the compressor known to be used by the person who generated the file.

    "... your decompressor ONLY decompresses 'uncompressable' files"

    That is the point here. This thread is about how one might go about compressing files that had been generated by somebody who filtered out compression-resistant files.

  106. Ironic - Goldman's challenge is no proof. by ckedge · · Score: 2

    Section 9.2 of the FAQ, the Counting Theorem, says that "no program can compress without loss *all* files of size N >= 0".

    Mike Goldman's challenge explicitly involves "a specific known file" of size N, where both N and the data within are arbitrary and predetermined.

    Mike's intent is obvious, but his logic is horrible. It's quite clear that Mike's challenge has HUGE gaping holes in it's mathematics, statistics, and logic as compared to the Counting Theorem. There's nothing "for sure" about Mike's challenge. Just reading it by itself made me sure he's going to lose, if not this time, then another time. He set himself up to lose.

  107. sorry! by grammar+nazi · · Score: 2
    I will avoid splitting the sentences in the future.

    What were the /. readers' opinions on this subject? It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win. This fellow from Japan knows this, but due to a wording technicality in the contract, he figures he can scam the $5k from the scammer. They both seemed to be good natured chaps about it, but I feel that they still have some deep resentment towards each other.

    Any other opinions?

    --

    Keeping /. free of grammatical errors for ~5 years.
  108. Re:There is no solution... by grammar+nazi · · Score: 2
    you are correct that eventually the sequence of characters will come up. However, in 1/2 of the cases, your initial Z will be larger than the size of the string of characters, whether it is in magnitude or precision (or both).

    A similar approach may be used with the logistic map (a=4 ... X(n+1)=4 X(n) * (X(n)-1), 0X(n)1 )

    --

    Keeping /. free of grammatical errors for ~5 years.
  109. it's "file size", and that's all it is by mr.ska · · Score: 2
    However, the decompressor DOES depend on the filename numbering in order to work; it is part of the algorithm and those bytes should be charged against the decompressor size. Then Patrick loses.

    Well, if Mike had specified anything other than "file size", I'd agree. But he said "file size". And the file size of what Patrick sent is less than the file size of the original file. Balls to directories taking up space, or more files meaning more space is taken up. File size is file size.

    I don't believe that the intent was followed. However, as this challenge was "impossible" to begin with, I think that the $5K should go to Patrick for creativity. He met the letter of the challenge, which in this case should be good enough as the challenge as it was intended was "impossible".

    Mr. Ska

    I slit a sheet
    A sheet I slit

    --

    Mr. Ska

  110. A real solution - Re:I have an idea... by Zeinfeld · · Score: 2
    Just find the spot in pi that contains a string of numbers that can represent the file. It's gotta be in there somewhere.

    First off it would be better to choose a pseudo-random sequence that is easier to generate, something like RC4 which is designed to run fast.

    Secondly the index into the random stream would increase with the number of bits, so there might be a section of Pi that met the criteria, chances are that it would be a very large number of bits out.

    There is however a possible approach that is arguably not cheating. Chances are that the person setting the challenge did not generate a perfect random stream. If they are generating megabytes of the stuff they are most likely using some pseudorandom number generator to do so.

    Cryptanalysing the input stream to identify the source could well reveal the random number generator. This would not be a cheat in my opinion.

    Of course the person setting the challenge could have done the job properly as I did for a one time pad system once. What I did there was to hook up a microphone, record a reasonably random source (a loudish fan) and sampled at 16 bits. Then after determining that the ergodicity of the source was at least 4 bits per byte I ran blocks of 1K of data through a modified version of MD5 to remove any bias in the input stream.

    --
    Looking for an Information Security student project suggestion?
    Try http://dotcrimeManifesto.com/
  111. Re:'Did I compress anything?' by markmoss · · Score: 2

    Re ADFS, Goldman apparently believed that the FAQ on his website excluded storing parts of the file in the catalog, for instance by creating a filename out of the first N data characters. If it had been explicitly stated in the two paragraphs of the challenge, I think that would have prevented the use of the ADFS to store a small file in the same allocation block as the catalog, but Goldman was sloppy...

    You cannot get real space saving's by Patrick's method -- the extra directory entries created will always be larger than the data removed from the file. Think of the smallest possible directory structure -- an array of pointers. (Files aren't named, you get them by number. Files are stored sequentially, so you don't need a length -- EOF is the last byte before the next file starts.) To accomodate files up to 2^n, the pointers have to be n bits long. Patrick's trick is to split the file at a particular sequence of m bits, m being less than n so you can be sure that that sequence will be found within a 2^n length random file. So you take away m bits and add n bits for every split.

    But Goldman's challenge was not clearly worded in this regard, even though he was apparently aware of possible cheats involving hiding data in the directory. As Patrick said, if a 1,000 byte file could have been duplicated by a 499 byte decompression program and a 500 byte data file, there is no doubt that it would have met the challenge even though it would have taken more HD space.

  112. Psuedo-random generators by markmoss · · Score: 2

    I suspect that a good enough mathematician with the right expertise probably could meet that challenge without "cheating" by shifting data into the directory. In the simplest case, if Goldman derived the file from a psuedo-random number generator, you guess the formula used, then your "decompressor" is the generator and your "compressed file" is the seed value. More generally (like if the data was truly random from an inherently random quantum process), you try matching pseudo-random (did I finally get the spelling right?) formulas against parts of the file until you hit something that fortuitously matches for long enough. Snip that segment out of the data file. Write the "decompressor" to copy all the bytes before, then generate the pseudo-random sequence, then copy the rest of the bytes.

    However, I think if you are genius enough to make that work, you probably have easier ways of making $5,000...

    1. Re:Psuedo-random generators by markmoss · · Score: 2

      I'm not much into this sort of mathematics -- but one thing I do know is that people can see "patterns" in purely random data. (Constellations, Schiaparelli's Martian "canals", all mystics, ...) And an infite random file should include every possible finite sequence. That is, if it's really random and really long, somewhere in there should be 1,2,3,4,..., and 2,4,6,8,..., and so on. What you need for this challenge is to find just one sequence in a finite random file that continues for long enough to exceed the length of the program that generates it, by enough to also cover the code to copy the rest of the data from file to file.

      That is, you need a program less than m bytes long that copies bytes 1 to n, generates 1,2,3,...,m, and copies bytes n+1 through the end of file. Since the program size doesn't scale with the length of the sequence, there is some m for which this will be true. And then you ask for a random file long enough to ensure that 1,2,3,...,m will be somewhere in it. (I don't know how to calculate the required length, and I've got a feeling the file might have to be sent as truckloads of 40G tapes...)

      Of course, the challenge doesn't promise that the file will be entirely random. It might be the output of a quantum random number generator, filtered to eliminate obvious sequences. But if it's filtered, it's not random, and some form of actual compression should give a slight reduction.

      This is not a usable compression algorithm -- it's just a one-time trick that squeezes a few bytes out of one particular extremely large file, with far more work than compressing any file is worth. But it would meet the terms of the challenge.

  113. That wasn't in the challenge by drew_kime · · Score: 4

    If the guy had read the FAQ, he would know that such tricks are not compression

    Did the rules as stated say that the solution had to comply with the principle laid out in the FAQ? Not that I can see. If you're going to pup up a $5,000 challenge, you damn well better post all the rules.

    For instance: I had a friend who entered a paper airplane contest. He signed up for the "maximum distance" part. The only rules were that the contest organizers supplied the single piece of paper, you had five minutes to fold or tear it however you want to, and then everyone threw them at the same time. The one that stops moving farthest from the launch line wins.

    He waited until 4:45 into the time, then crumpled the paper up into a ball and threw it. While everyone else's planes were circling lazily around, his went straight, landed, and rolled several more feet.

    While you can debate all you want about whether his wad of paper counted as a "paper airplane" -- and we did debate (hey, it was college) -- he did saitsfy the rules of the contest. Had there been enough money on the line for someone to threaten legal action, I suspect he would have won.

    --
    Nope, no sig