Slashdot Mirror


First Hutter Prize Awarded

stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.

191 comments

  1. Just goes to show... by Anonymous Coward · · Score: 0

    It's not what you know, or who you know, but how much you can compress what you know.

    Although I could tell a fair few stories about crushing students into phone booths.

    1. Re:Just goes to show... by Anonymous Coward · · Score: 0

      Although I could tell a fair few stories about crushing students into phone booths.

      You could probably tell a lot of stories about packing fudge, too.

  2. what the hell? by hjf · · Score: 0

    last time I checked, you couldn't make a "lossy" compression of text.

    oh... nevermind. I forgot about teens with SMS.

    1. Re:what the hell? by Barny · · Score: 3, Funny

      Hehe, and remember the study a few months back that found that the order of letters within a word is unimportant, so long as the first and last are correct? Potentially big words could be listed merely by how many of each letter they have, then randomly mix them up inside, order doesn't matter :)

      --
      ...
      /me sighs
    2. Re:what the hell? by Anonymous Coward · · Score: 1, Informative
      Hehe, and remember the study a few months back that found that the order of letters within a word is unimportant, so long as the first and last are correct?

      Remember the study a few years back that proved this to be false, pointing out that slpmiy rnisreveg the ioiretnr lrettes mdae it sltnacifingiy mroe dluciffit to raed the wdros?

      Now let us never speak of this again.
    3. Re:what the hell? by SillySnake · · Score: 1

      It took me 10 minutes to figure out "ioiretnr" :-/

    4. Re:what the hell? by McFadden · · Score: 1
      It took me 10 minutes to figure out "ioiretnr" :-/
      Presumably this was after you had already figured out "reversing" and still drew a blank.
    5. Re:what the hell? by Anonymous Coward · · Score: 0

      Who cares?

      If you're looking for a place to put lossy compression to use in text, it's still an interesting effect to help choose which information is lost.

    6. Re:what the hell? by Carewolf · · Score: 1

      Not a bad idea, you could always rearrange them to the right order using a dictionary.

    7. Re:what the hell? by Trogre · · Score: 1

      Which would work fine, until it had to distinguish words like salt and slat.

      Or words not in a dictionary, like GetPidOf()

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    8. Re:what the hell? by Schraegstrichpunkt · · Score: 1

      "GetPidOf()"... wasn't that renamed to "Bribe()" some time ago?

    9. Re:what the hell? by SamSim · · Score: 1

      Yeah, and remember when that turned out to be baloney?

    10. Re:what the hell? by thre5her · · Score: 1

      Weird, I just played FFI for the first time, and I could read that perfectly.

  3. Seems like a strange contest by Salvance · · Score: 4, Interesting

    It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines, and one of only 3 people who submitted entries. Seems like a strange thing to award up to 50,000 Euro for ... his compression was only 20% smaller than what I just quickly did with gzip. I'm rather surprised that more people didn't try though ... it's not like people are throwing money at data compression researchers.

    --
    Crack - Free with every butt and set of boobs
    1. Re:Seems like a strange contest by Barny · · Score: 2, Informative

      20% of a few petabytes is.... a lot :)

      And yes, offering cash in this way is a great incentive for programers.

      Also, if its a cpu friendly (read: reasonable cpu useage or even able to be offloaded on a processing card) method it could potentially add 20% onto your bandwidth :)

      --
      ...
      /me sighs
    2. Re:Seems like a strange contest by Raul654 · · Score: 3, Interesting

      There aren't many Wikipedia related things I don't know about (I'm a Wikipedia administrator, arbitrator, and bureacrat, featured article director, and I'm on the Wikimedia Foundation's Press Committee), and this is the first time I've ever heard of this contest. I think it's fair to say it's been relatively low-profile.

      --


      To make laws that man cannot, and will not obey, serves to bring all law into contempt.
      --E.C. Stanton
    3. Re:Seems like a strange contest by QuantumFTL · · Score: 1

      It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines

      Wow, that makes me really wish I had entered. There are some great multipass lossless text compression systems that would work well for Wikipedia...

    4. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      "And yes, offering cash in this way is a great incentive for programers."

      Except, with this contest as an example, it clearly isn't.

    5. Re:Seems like a strange contest by Shados · · Score: 1

      Keep in mind that the rule was that the file had to be a self executable. If you just did default gziping, you'd have to include the overhead of a self executable in the total.

    6. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      rm -r comes to mind.

    7. Re:Seems like a strange contest by Salvance · · Score: 5, Insightful

      20% is a lot, when the compression/decompression is fast. gzip/WinRAR/WinZip all compress this file to the 22-24MB mark in a minute or two on my desktop (which is comparable to their test machine). The winner's algorithm compressed the data to 16.9MB, but spent 5 hours and 900MB of RAM doing so. The contest would be far more interesting if it added a reasonable time/RAM restriction (e.g. 10 minutes and 500MB of RAM).

      --
      Crack - Free with every butt and set of boobs
    8. Re:Seems like a strange contest by muszek · · Score: 1

      but rm is lossy. you'd end up as one of those that didn't meet criteria guidelines.

    9. Re:Seems like a strange contest by gerbalblaste · · Score: 1

      the contest is still open and will likely be for a long time.

    10. Re:Seems like a strange contest by Barny · · Score: 1

      Uh, someone did it didn't they? :)

      --
      ...
      /me sighs
    11. Re:Seems like a strange contest by gameforge · · Score: 1

      The overhead for both compression and decompression on my box anyway looks to be about 50k (the size of my /bin/gzip).

      No clue whether it requires more code to compress than to decompress, but I would guess your overhead would be less than 30k. On an order of maybe 20MB of compressed data, that's not much.

      GZIP is a bad example because it also works with binary data; it's too much. If you only have to worry about encountering 128 possible values, compression can get real interesting, and there are a lot more algorithms out there, though I'm too lazy to find some examples.

      In this case, since the data looks to be fixed (this particular 100MB of text, not just any 100MB of text), it seems to me that you could really optimize the bejebus out of it - perhaps there's a large abundance of a particular word or combination of words, etc. If I was going to put some time into this, I'd run some heuristics type algorithms to find out what's unique about this specific piece of data that I could take advantage of.

      Unfortunately I am not a compression guru, so I can't really say any of this with certainty... just some thoughts.

    12. Re:Seems like a strange contest by complete+loony · · Score: 1

      Finding a way to compress just a little bit more, becomes an increasingly harder problem. You begin to hit entropy limits, and have to use methods that are only relevant to the type of data being compressed. For example, a good lossless audio compressor won't handle text data very well, and vice versa.
      Most compressors in this space use a prediction algorithm to generate a signal, and then only compress the differences between the generated signal and the actual data. To get the best possible compression, you have to build a system that has a perfect knowledge of the information being compressed and can then derive the source data again. Hence good compression is thought to be a good test of AI type systems.
      20% more compression may not seem like much on the surface. But this is a very difficult problem.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    13. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      rm'ing wikipedia would be like subtracting a negative number. So its ‘loss’ would really be a net gain.

    14. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      I don't think efficiency of computational resources is the point of this contest. As I understand it, the idea is that maximal, ideal compression could only be gotten by actually *understanding* the stuff being compressed. So an algorithm better in terms of end size than any before presumably understands the Wikipedia text better than the previous algorithms did. Rinse & repeat, and hopefully you'll have an algorithm that will be closer to AI.

    15. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      True enough, though stress on the "one." :-P

    16. Re:Seems like a strange contest by QuantumG · · Score: 0, Redundant

      It was on Slashdot, how low profile is that?

      --
      How we know is more important than what we know.
    17. Re:Seems like a strange contest by CryoPenguin · · Score: 2, Informative

      Since the criteria for entry say that any new submission must beat the current record, it's no surprise that only 3 people are listed. You're not seeing any of the people who didn't win.

    18. Re:Seems like a strange contest by Firehed · · Score: 1

      The only problem is that if your heuristics are designed for this specific chunk of data, it'll end up being largely useless for a full database crunching. Mind you, they could still be useful, but you don't want them to be sample-specific when you're talking about orders of magnitude more data that will need to get compressed.

      I really don't know jack about how compression is done, but I can toss out an idea. Scan article text for all of the used characters... expect to find your A-Z a-z 0-9 and some characters - with luck, you'll still end up a decent way under 128 characters. Then scan through the articles for the (128-#) most frequently used words, and assign each of those words to have a 7-bit value. Or depending on the size of the article, use all 256 characters and give yourself a solid 150 words to do to the same thing, only with an 8-bit value. Then just do some sort of find-and-replace, turning all of the "the"s in an article to ç, "and"s to ü, and "then"s to ß. Or whatever (ideally, you'd find full phrases rather than just individual words; better yet, they'd capture anything in the 128-255 charset and eliminate the need for 8-bit encoding). Do some funky algorithm so that if you have an unusually large number of longer words (which would be reasonably likely in the more scientific articles), it'll check if replacing the dozen occurrences of "subwoofer" is more space-saving than the hundred uses of "of".

      Like I said, I'm no expert, but I don't think the above idea would be too hard to implement (if you want to try, go ahead; I'd appreciate 10% and a mention if you win a prize for it). So long as you have some sort of dynamic heuristics, if that'd be the correct term, I think it could be really efficient. Though, in honesty, I'd expect that's how it already works, at least to a limited extent.

      --
      How are sites slashdotted when nobody reads TFAs?
    19. Re:Seems like a strange contest by Anonymous Coward · · Score: 1, Insightful
      There aren't many Wikipedia related things I don't know about (I'm a Wikipedia administrator, arbitrator, and bureacrat, featured article director, and I'm on the Wikimedia Foundation's Press Committee), and this is the first time I've ever heard of this contest.

      Cut the rank pulling. Almost half a million people have a lower slashdot ID than you, thousands of them with much more important functions than you, but they don't see a need to brag about their positions every time they've given half a chance.

      I think it's fair to say it's been relatively low-profile.

      Unlike you, you mean?
    20. Re:Seems like a strange contest by Breakfast+Pants · · Score: 1

      And conversely, modding up your post would be like adding a negative number, so it would be a net loss.

      --

      --

      WHO ATE MY BREAKFAST PANTS?
    21. Re:Seems like a strange contest by tabrisnet · · Score: 1

      Lookup Huffman Encoding. It's been done.

    22. Re:Seems like a strange contest by RealGrouchy · · Score: 1
      There aren't many Wikipedia related things I don't know about ... and this is the first time I've ever heard of this contest.

      That's because the contest has nothing to do with Wikipedia; it just uses Wikipedia's content as a standard block of compressable data.

      Just like how the numbers in a Sudoku puzzle have nothing to do with math; they're simply a set of nine characters/symbols.

      - RG>
      --
      Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
    23. Re:Seems like a strange contest by dch24 · · Score: 1

      Does anyone know if WinRK has been submitted? It's probably pretty slow, and may go over the 10 hour time limit.

    24. Re:Seems like a strange contest by Raul654 · · Score: 1

      Why use Wikipedia then? There must be dozens (if not hundreds or thousands) of free sources of regular text out there.

      --


      To make laws that man cannot, and will not obey, serves to bring all law into contempt.
      --E.C. Stanton
    25. Re:Seems like a strange contest by Kris_J · · Score: 1

      WinRK uses PAQ technology. Since the winner in this story used a tweaked PAQ engine, I doubt that WinRK would perform better. Better than the WinZip, et al, examples people have been posting, but not better than the winning entry.

    26. Re:Seems like a strange contest by Kris_J · · Score: 1

      My Javascript Packer does that. To actually compete with the recent winner you need to start exploring far more complicated stuff.

    27. Re:Seems like a strange contest by Ninjaesque+One · · Score: 1

      That was the best post to post as AC, wasn't it?

      Anyways, the guy was talking about Wikipedia. Compare 'im to Wikipedian admins:

      http://en.wikipedia.org/wiki/Wikipedia:List_of_adm inistrators
      http://en.wikipedia.org/wiki/User:Crzrussian

      Just inserted a random admin there. Look at the long list of barnstars, like they were friggin' medals. Look at all that idiosyncratic stuff; userboxes innumerable, a random cabal-thing that I've certainly never heard of, and so on. Now, click the "administrators" category, then go to any other user-page in the list. You will see the same damn thing.

      --
      Ninjas and pirates. How piquant.
    28. Re:Seems like a strange contest by Skuto · · Score: 1

      Well, there *is* a restriction on the amount of time/CPU. Just read the website.

      Whatever is "reasonable" is just personal preference.

    29. Re:Seems like a strange contest by BoberFett · · Score: 1

      If the purpose of compressing information such as Wikipedia is for distribution for something such as the OLPC, compression time is largely irrelevant. Compression is done once, on one system, and the decompression is usually fairly fast.

    30. Re:Seems like a strange contest by r3m0t · · Score: 1

      That's a limit on the decompression program.

    31. Re:Seems like a strange contest by r3m0t · · Score: 1

      How about mine?

      http://en.wikipedia.org/wiki/User:R3m0t

      Seriously: not everybody's page is like that.

    32. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      Yes clearly, now that he's posted to the story, he still doesn't know about it.

    33. Re:Seems like a strange contest by vidarh · · Score: 3, Informative

      You miss the point. The goal isn't to achieve better usable compression, but to encourage research they believe will lead to advances in AI.

    34. Re:Seems like a strange contest by rbarreira · · Score: 1

      More importantly: why not use wikipedia? Read their FAQ before making irrelevant comments and questions here...

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    35. Re:Seems like a strange contest by rbarreira · · Score: 3, Insightful

      There aren't many women related things I don't know about (I have a girlfriend, I've had sex, my mother and my sisters are all women), and this is the first time I've ever heard of the "women's olympic weighlifting" contest. I think it's fair to say it's been relatively low-profile.

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    36. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      Hmm. Sounds like someone's got a case of jealousy...

    37. Re:Seems like a strange contest by navarroj · · Score: 1

      I wish I had mod points. Most of the "interesting" messages in this thread are plain wrong and stupid.

      If you had read TFA (I know, I know), you would knew that:
      * Alexander Ratushnyak was the only person to even meet the challenge's guidelines, because the guidelines for the contest are supposed to be hard to meet.
      * The contest is not about making a cool/fast/better general purpose compressor. It is a contest to see how good we can compress some particular body of text (a snapshot of Wikipedia), to investigate some interesting relations of compression with (artificial) intelligence.
      * The contest is not "closed" or just "finished", the contest is still open, anyone who thinks that can do better could send their copression tool for participation. As far as I remember, every time that someone can make better than the current record, he/she "wins the prize".
      * The posting about "I work a lot in Wikipedia and didn't knew about this, it should be low-profile" is just lame and stupid. This contest is trying to promote some very interesting area of research, and has nothing to do with Wikipedia other that they are using it as a body of text (human knowledge) to benchmark the tools and algorithms.

      Please visit the site and read the related information. It is actually a very interesting research topic, and the relations with maths and artificial intelligence are also insightful.

    38. Re:Seems like a strange contest by navarroj · · Score: 1

      Mod parent up. Yes, this is the goal of the contest. Not making a faster/more efficient compressor.

    39. Re:Seems like a strange contest by aug24 · · Score: 1

      The point of the exercise is not really to worry about how much time/memory it took, but to improves mechanisms for finding and then understanding structure within data.

      Perhaps two classes would be interesting: one with, and one without, time/space limitations.

      Justin.

      --
      You're only jealous cos the little penguins are talking to me.
    40. Re:Seems like a strange contest by archeopterix · · Score: 1
      The contest would be far more interesting if it added a reasonable time/RAM restriction (e.g. 10 minutes and 500MB of RAM).
      Maybe it would be more interesting, but it would also be a totally different contest. It isn't a contest for generic file compressors, goddamit! It's supposed to drive research in the field of knowledge representation, based on the supposition that in order to compress knowledge well, you have to find good representation for it. The compression factor is supposed to be a kind of benchmark for knowledge representation and the other characteristics useful in a generic file compressor (memory and cpu use) are irrelevant in this case.
    41. Re:Seems like a strange contest by GreatBunzinni · · Score: 1
      The contest would be far more interesting if it added a reasonable time/RAM restriction (e.g. 10 minutes and 500MB of RAM).

      What's the point of adding processing and/or memory requirements if the sole point of this prize is to squeeze the most information into the smallest possible package? After all in this case it's the end product that matters, not what it takes to get there.

      --
      Slashdot, fix your code or at least hire someone who is competent at it to do it for you.
    42. Re:Seems like a strange contest by Anonymous Coward · · Score: 0

      ...and most compression algorithms, including the one that won, use neural networks trained to predict what comes next (or some variant thereof), which is a way of representing knowledge (albeit "non-constructively", in that we don't know how it does it).

      The contest is doing exactly what it was designed to do.

    43. Re:Seems like a strange contest by pizza_milkshake · · Score: 1

      I tried. It's hard.

    44. Re:Seems like a strange contest by gurps_npc · · Score: 1

      Next year, his algorithm, running on a standard desktop will do it in 3 hours. By 2015, a standard computer using his algorithm will do it in that same minute or two.

      --
      excitingthingstodo.blogspot.com
    45. Re:Seems like a strange contest by Walter+Carver · · Score: 1

      Wikipedia (its structure, its events) is a much smaller topic than just "sports" or "women sports".

    46. Re:Seems like a strange contest by exp(pi*sqrt(163)) · · Score: 1
      There aren't many women related things I don't know about
      That's the funniest thing I've read in days!
      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    47. Re:Seems like a strange contest by Iron+Condor · · Score: 1

      It's supposed to drive research in the field of knowledge representation, based on the supposition that in order to compress knowledge well, you have to find good representation for it.

      The problem that I see here is that we're precisely not compressing knowledge, but a certain, precisely-drawn-but-arbitrary slice of it. It might be possible to represent, say, all knowledge of classical electrodynamics in form of a bunch of equations, but how do you represent the subset of that knowledge contained in an arbitrary (but precisely proscribed) number of pages of a particular book? In the end, "dF=0" is an extremely compressed version of a lot of knowledge. More knowledge, than you can find in, say, chapter 3 of Jackson. And yet, compressing that chapter will require many more bytes than "dF=0".

      I think the problem here is that the project doesn't actually compress knowledge per se, but a certain, given representation of it (namely in the English language encoded in ASCII or whatever) and that the criterion for winning is NOT (un)packing the knowledge, but re-establishing that particular representation.

      --
      We're all born with nothing.
      If you die in debt, you're ahead.
    48. Re:Seems like a strange contest by QuantumG · · Score: 1

      No, it was on Slashdot when it was announced, I remember because I bookmarked the "Universal AI" book that it featured. here it is.. August 13.

      --
      How we know is more important than what we know.
    49. Re:Seems like a strange contest by rbarreira · · Score: 1

      But this is not a wikipedia event! It's just an event which uses wikipedia data...

      If I decide to throw a Mercedes off a cliff, is it a Mercedes event? Are Mercedes executives supposed to know about it?

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    50. Re:Seems like a strange contest by Walter+Carver · · Score: 1

      Good point. I was quick to reply in what I thought as sarcasm.

    51. Re:Seems like a strange contest by dch24 · · Score: 1

      In the wikipedia article I linked, it suggests WinRK could still outperform a tweaked PAQ engine. But it seems it has not been submitted.

    52. Re:Seems like a strange contest by adamgolding · · Score: 1

      yes, but of course the speed of the compression is important for AI--if the compression would take longer than the lifetime of an organism it's not a reasonable model of how that organism thinks. We probably perform fast compression in real-time as we 'understand' things, but other, slower, forms over our lifetimes.

  4. Lossless from Lossy? by Anonymous Coward · · Score: 2, Funny

    "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia"

    Wikipedia? Knowledge? Isn't that already a lossy compression mechanism?

    1. Re:Lossless from Lossy? by ClamIAm · · Score: 1

      I find that most Wikipedia articles suffer from the reverse: poor writing style and trivia make them much larger than needed. Concise, cruftless articles would probably be a better way of reducing the disk/archive needs of Wikipedia.

  5. Hmm... by Cinder6 · · Score: 4, Funny

    Am I the only one who finds it slightly ironic that (as of this writing), there is no entry for the Hutter Prize on Wikipedia?

    --
    If you can't convince them, convict them.
    1. Re:Hmm... by iamwoodyjones · · Score: 1

      Give me a few minutes...;-)

    2. Re:Hmm... by Evil+Pete · · Score: 1

      But there is for the PAQ algorithm (see link in summary) with mention of the awarding of the Hutter prize.

      --
      Bitter and proud of it.
    3. Re:Hmm... by CastrTroy · · Score: 2, Informative

      While this is not article on the Hutter Prize itself, you will be relieved to know that it is mentioned in the article on Marcus Hutter

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    4. Re:Hmm... by megaditto · · Score: 1

      Stand by for a picture of a giant phallus.

      Seriously though, I respect the guy for giving up half his prize to the other algorithm developer.

      --
      Obama likes poor people so much, he wants to make more of them.
    5. Re:Hmm... by cciRRus · · Score: 1

      It's there, just that you'll need to decompress it before you can read it.

      --
      w00t
  6. For comparison .... by Ignorant+Aardvark · · Score: 3, Informative

    For comparison purposes, WinRAR on its best setting only gets this down to 24MB. Doubtless 7zip could get even lower, but I don't think either could crack the 17MB mark. And certainly neither of those would be self-extracting, which this contest requires.

    1. Re:For comparison .... by Ksempac · · Score: 1

      If you read a bit further than the main page you find this :

      Relaxations
      * In lieu of a self-extracting archive, a decompressor program decomp8.exe plus a compressed file archive8.bhm may be published, where decomp8.exe produces data8 from archive8.bhm. In this case, the total size is S := length(decomp8.exe)+length(archive8.bhm).


      7zip is less than 0.9MB and i guess you could strip it down to less than that by removing a lot of useless features (GUI, Support for others formats). So that's a not big overhead, and you re free of the "self-extracting" rule.

    2. Re:For comparison .... by RuBLed · · Score: 1

      and surely you could do the compression in less than 5 hours...

    3. Re:For comparison .... by thue · · Score: 1

      I tried compressing it with gzip, bzip2, and lzma programs. I tried posting the results, but they do not fit within the lameness filter :(.

      The best result was with lzma, the algorithm used by 7zip, which got it down to 25,188,131 bytes. So the 17MB achieved in this contest is pretty impressive.

  7. I want by Anonymous Coward · · Score: 0

    a Lossless Compression of the Internet

  8. makes one wonder by v1 · · Score: 1

    How slow this code is. Usually when you are trying to squeak out another 1% of space in a file your algorythm triples in size and quadruples in runtime to get that improvement. I wonder how practical this new algorythm really is. Probably not so much an issue nowadays but I also wonder how big the compression program is. This used to be a really big deal many years ago when the compression programs could easily be larger than the data they were compressing.

    --
    I work for the Department of Redundancy Department.
    1. Re:makes one wonder by CastrTroy · · Score: 1

      Accoring to somebody who posted above (he probably read it in the article, but i'm not about to look there), the program took, 5 hours and 900MB of RAM. I find it's interesting that he only got down to 17MB, while consuming all those resources. It makes it seem like it isn't even worth it. I'm sure you could find some pretty small program that would compute pi to some very large number of digits, and find the wikipedia excerpt in the results, but It would take a very long time to run.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    2. Re:makes one wonder by mh101 · · Score: 1
      FTFA:
      Restrictions: Must run in less than 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
      If someone's program takes close to 10 hours to compress 100MB it better get considerably more than just an additional 1% out of it!

      --
      Duct tape is like the Force. It has a light side, a dark side, and it holds the universe together.
    3. Re:makes one wonder by Alaria+Phrozen · · Score: 2, Insightful

      If you'd RTFA you'd find the running times ranged from 30 minutes to 5 hours. They have a whole table and everything.



      The whole point of the challenge was to create a self-executing compression program that made a perfect copy of their 100MB file. Final file sizes were in the 16MB range. Geeze, seriously RTFA.

    4. Re:makes one wonder by infaustus · · Score: 1

      If you'd rtfo, you'd know that it took 5 hours. But point taken.

      --
      Frosty piss posts are worthless, GNAA posts are worthless and hurtful, but they are the least of this site's neuroses.
    5. Re:makes one wonder by AxelBoldt · · Score: 2, Informative
      I'm sure you could find some pretty small program that would compute pi to some very large number of digits, and find the wikipedia excerpt in the results, but It would take a very long time to run.
      In all likelihood, that trick won't work: your program would need to know the start index of the Wikipedia text in the digit sequence of pi, and that index would be so astronomically huge that writing it down would be about as long as writing down Wikipedia itself. As a little example of the effect, think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen.
    6. Re:makes one wonder by Henry+V+.009 · · Score: 4, Funny

      Assuming a random distribution of digits in pi (and why does everyone assume this? -- there is certainly no proof), the odds are that you'll see one sequence of 4 zeros every 10,000 (decimal) digits. On the other hand, you'd expect to see 4 zeros once every 16 binary digits. Wikipedia, at 100MBs would be expected to be found once every 2^800,000,000 binary digits of pi.

      Then again, maybe we're lucky, and this universe is God's way of storing the universal encyclopedia in the digits of pi. Wikipedia might be up near the front somewhere.

    7. Re:makes one wonder by arrrrg · · Score: 1

      Tangentially related and mildly amusing trivia:

      Feynman Point. From the Wikipedia article: "The Feynman point comprises the 762nd through 767th decimal places of pi ... The name refers to a remark made by the physicist Richard Feynman, expressing a wish to memorise the digits of as far as that point so that when reciting them, he would be able to end with '... nine, nine, nine, nine, nine, nine, and so on.'"

      So, I dunno about "0000," but for some interesting sequences we may get lucky :).

    8. Re:makes one wonder by Anonymous Coward · · Score: 0
    9. Re:makes one wonder by NoGuffCheck · · Score: 1

      think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen

      ahh, but pi is exactly 3!

      --
      serenity now!
    10. Re:makes one wonder by arth1 · · Score: 1
      If someone's program takes close to 10 hours to compress 100MB it better get considerably more than just an additional 1% out of it!

      Indeed. Who in their right mind would spend 10 hours compressing instead of simply moving the entire dataset in a fraction of the time?

      Oh, and for a good compression scheme:

      uuencode filename filename | mail external@stora.ge

      Then all the decompresser needs to do is fetch the mail and uudecode it.

      Regards,
      --
      *Art
    11. Re:makes one wonder by Spikeles · · Score: 1
      think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen.
      That's easy... The string 0000 occurs at position 13,390 counting from the first digit after the decimal point. The 3. is not counted. http://www.angio.net/pi/piquery
      --
      I don't need to test my programs.. I have an error correcting modem.
    12. Re:makes one wonder by dido · · Score: 1

      But again, this assumes that Pi is a normal number. So far, nobody knows for sure whether or not this is true. I don't know why everyone seems to make this unfounded assumption when dealing with Pi. Attempts to establish the truth of this matter are one of the main reasons why mathematicians are engaged in calculating Pi to billions and billions of digits; there is no other practical use for that many decimal places of accuracy; 39 digits of the number would be enough to calculate the circumference of the known universe given its radius to within the diameter of a hydrogen atom.

      --
      Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.
    13. Re:makes one wonder by Gromius · · Score: 1

      Mildly amusing, I read the article. Incidently it states that 0000 (actually 000000) occurs at position 1,699,927 (and no this wasnt added in the last hour as a joke).

    14. Re:makes one wonder by Gromius · · Score: 1

      Ah I'm an idiot and cant read the article properly. Nevermind.

    15. Re:makes one wonder by Fyz · · Score: 1

      But suppose it's somewhere near the front, except there's a speling mistake every place "speling" occurs. Would that be proof that God isn't infallible?

    16. Re:makes one wonder by x2A · · Score: 1

      How about storing how long it would take a room full of monkeys to type it?

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    17. Re:makes one wonder by CastrTroy · · Score: 1

      I was just pointing out how resource intensive this program really was. I mean, sure you could use this program for compression (i'm interested on how it would perform on other pieces of data, or whether or not there was specific hacks put in for the data set), but it consumes a ton of resources and doesn't really give you something that much better than using gzip. At least not good enough that it's worth spending 5 hours decompressing 100 MB worth of data. Oh, and you could fit very large numbers in a very small space, as long as you use the correct notation. For instance, If I wanted to store the number 3.4028236692093846346337460743177e+38, you could store it as I've shown, or as a really long integer, (which I'm not going to write out) or you could store it as (2^64)^2, which is would use much less space.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    18. Re:makes one wonder by oopsdude · · Score: 1

      There is no way to prove that something is random - what would the test look like? We don't have an algorithm for producing truly random numbers, so how could we test if a set of numbers are random? However, no one has ever been able to find a pattern to the digits of pi, which is as close as one can come to proving randomness.

      Also, (I could be wrong on this, but) since pi is irrational, it can't have a pattern. Having a pattern would mean that it could be expressed in the form a/b where b!=0, which would make it rational.

    19. Re:makes one wonder by AxelBoldt · · Score: 1
      Also, (I could be wrong on this, but) since pi is irrational, it can't have a pattern.
      Irrational numbers can't have a repeating pattern such as 0.123012301230123..., but they can still have a pattern, for instance 0.1230123001230001230000123...
    20. Re:makes one wonder by AxelBoldt · · Score: 1
      Oh, and you could fit very large numbers in a very small space, as long as you use the correct notation.
      That works only for very few numbers; you would have to be extraordinarily lucky for this trick to work with the starting index of Wikipedia in the digit sequence of pi.
    21. Re:makes one wonder by Anonymous Coward · · Score: 0
      We don't have an algorithm for producing truly random numbers, so how could we test if a set of numbers are random?
      What kind of argument is that? You don't need to be able to produce something with some property to test for that property. For example, we don't know any odd perfect number, but we can test if a number has the property.
    22. Re:makes one wonder by prichardson · · Score: 1

      It would be proof that God has a sense of humor.

      --
      Help I'm a rock.
    23. Re:makes one wonder by shadow_slicer · · Score: 1

      It'd be proof that man can't spell

    24. Re:makes one wonder by Anonymous Coward · · Score: 0

      Not so sure about the rest of your message (seem to imply that research into alternate ways of doing things is stupid since gzip works), but you bring up a good question. Is it possible to reduce knowledge down to a simpler collection? This would be simliar to what you did for a very specific number 3.4028236692093846346337460743177e+38 = (2^64)^2 .

  9. Lookup table? by Anonymous Coward · · Score: 0, Interesting

    I wonder if you could just store all the common english words into a lookup table indexed by a number (32 bit enough?). So every word then gets replaced by a number which would then be compressed further. Only catch is the client needs a copy of the lookup table which might be feasible.

    1. Re:Lookup table? by LordEd · · Score: 1

      It would not be feasible in this competition because either the archive must be self-extracting, or that the decompression app's size is included in the final result.

      A lookup table of 32 bits would immediately cause an overhead of over 20 GB bytes (assuming an average of 5 letters per word, which is probably too small for the number of word options in 32 bits).

    2. Re:Lookup table? by MichailS · · Score: 1

      Maybe it could be put into a universal "english library" that the decompressor looks into, and the words are put in a 1000x1000 matrix (there are about E+6 words in english)? That could be addressed with only 2x10 bits.

      Of course, words can be bent and you need spaces and such, but in average each word should be able to compress into three bytes of data. The whole bible is about 800 000 words, which would mean 2-3 MB of data compressed.

      Of course, you might need the "King James' English Decompressor Library", the "New American Standard English..." et cetera.

      Also, it would make for some really interesting transcriptions using say a german decompressor library...

    3. Re:Lookup table? by Anonymous Coward · · Score: 0

      [Disclaimer: I'm not the GP AC.]

      There are 14,480,868 words, of which 354,032 are unique and a little over 50% are only used once. Uncompressed, the "dictionary" is 3,144,996 bytes (354,032 unique words). The dictionary compresses to 1,363,936 bytes with bzip2 --best, or 1,150,615 bytes with gzip --best.

      Since there are only 354,032 words, that's 18.4 (round to 19) bits per word. 19 * 14,480,868 / 8 = 34,392,061.5. Thus, it would take about 32.8 Megs to encode all the words as 19-bit lookups into a 1.1 Meg table. That makes just under 34 Megs, or about twice as big as the current best entry, and we haven't even considered compressing the other 1% of the file: punctuation. Ouch.

      However, the word frequency varies from 1 to 714917 ("the"). Only about 10% of the words appear at least 20 times, only about 5% appear at least 58 times, and only about 1% appear at least 449 times. Those top 1% words account for 10,820,939 of the total 14,480,868 words, so we know we could save at least 7 * 10,820,939 / 8 = 9,468,321.625 or about 9 Megs using Huffman encoding. Subtracting 34 Megs - 9 Megs = 25 Megs. That's still a really long way from 17 Megs though.

  10. compress knowledge = intelligence by Profound · · Score: 2, Insightful

    Being able to compress knowledge well is believed to be related to acting intelligently. - IHNPTTT (I Have Not Passed The Turing Test), but while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.

    1. Re:compress knowledge = intelligence by illuminatedwax · · Score: 1

      intelligence != acting like a human

      --
      Did you ever notice that *nix doesn't even cover Linux?
    2. Re:compress knowledge = intelligence by Kris_J · · Score: 1
      intelligence != acting like a human
      Truer words were never typed.
    3. Re:compress knowledge = intelligence by qbwiz · · Score: 1

      Intelligence != being able to store Wikipedia in a small space. In fact, according to Wikipedia, programs are already able to predict what comes next in English text approximately as well as humans, but they aren't as intelligent as humans.

      If we want to make something less intelligent more intelligent, it seems likely that at some point, when its intelligence is in some sense the same as the average human's, that its behavior would be more similar to a human's than it is now. Of course, that's hardly a certainty.

      --
      Ewige Blumenkraft.
    4. Re:compress knowledge = intelligence by Anonymous Coward · · Score: 0

      IHNPTTT

      What are you, a very small shell script?

    5. Re:compress knowledge = intelligence by Iron+Condor · · Score: 1

      If we want to make something less intelligent more intelligent, it seems likely that at some point, when its intelligence is in some sense the same as the average human's, that its behavior would be more similar to a human's than it is now.

      Why would it?

      The first flying machines we built weren't very good. Today we have flying machines that fly much better than birds. At now point in the intervening time did we ever have a machine that behaved anywhere similar to a bird.

      A point could be made that airplanes are useful exactly because they do NOT behave like birds. In the same vein, I do not expect intelligent mechanisms to behave like people and their value to be derived from exactly that dissimilarity.

      --
      We're all born with nothing.
      If you die in debt, you're ahead.
    6. Re:compress knowledge = intelligence by 4D6963 · · Score: 1

      IHNPTTT (I Have Not Passed The Turing Test)

      Dude, do you even know what you're talking about? Failing at the Turing Test means that a human during a 30-minute chat with you (simulataneously with a "robot") could tell you're a robot/computer program and not a human (and think the robot sounds more human than you). As retarded as you can be, you would still pass the test. Actually, you wouldn't need 30-minutes to pass this test, one line such as the one you just typed would be enough for anyone to know that Artificial Stupidity is centuries away from being able to generate such a thing ;-)

      --
      You just got troll'd!
  11. Intelligence revisited ... by foobsr · · Score: 1

    From the contest specs:
    thus reducing the slippery concept of intelligence to hard file size numbers

    I doubt the instance which wrote that was cynical.

    CC.

    --
    TaijiQuan (Huang, 5 loosenings)
    1. Re:Intelligence revisited ... by Skuto · · Score: 1

      They were certainly not, and they are right.

      The contest produces a hard, verifiable result with a hard restriction on resources that you can use to attain it.

      If you look at the state of AI research, then you will understand that introducing some cold hard numbers wont hurt.

    2. Re:Intelligence revisited ... by rbarreira · · Score: 1

      I recommend that you read this before making any further irrelevant comments.

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    3. Re:Intelligence revisited ... by foobsr · · Score: 1

      And you, Sir, perhaps first read some more basics before you try to educate me.

      Quote, just one example of a broader definition: A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994: a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings--"catching on", "making sense" of things, or "figuring out" what to do. (reprinted in Intelligence Gottfredson, 1997, p. 13)

      And also, to let you attain some historical background, I would advise you to have a look at "Human Problem Solving" by Newell and Simon.

      Finally, a real challenge would be the self-inventing compression method, not the self-extracting algorithm.

      CC.

      --
      TaijiQuan (Huang, 5 loosenings)
    4. Re:Intelligence revisited ... by foobsr · · Score: 1

      If you have a reductionist approach, you should name it.

      Secondly, there is definitely more to consider if you talk about intelligence, be it artificial or not.

      Third, a set of reasons for the state (better the dynamics) of AI research that certainly still applies was given by Drew McDermott thirty years ago: "Artificial intelligence meets natural stupidity".

      CC.

      --
      TaijiQuan (Huang, 5 loosenings)
  12. I think compressing porn would be more useful by Anonymous Coward · · Score: 0

    Then, we can use the rest of the leftover space to store the sum of all human knowledge in an uncompressed form.

    - Dave

  13. Done before.... by gmuslera · · Score: 1

    ... by Douglas Adams. Wonder what rate of compression is putting the meaning of life, universe and everything in just the number 42.

    1. Re:Done before.... by netsharc · · Score: 1

      There was a small story in the HHG series about aliens who encoded the whole of human knowledge into a number, took a meter ruler, and made a precise mark between 0 and 1. But then they chipped the ruler while taking it home.

      Or was it some other book?

      Nice idea, but I guess you can only get to a certain precision before reaching the Nanometer limit..

      --
      What time is it/will be over there? Check with my iPhone app!
  14. I suppose the Windows equivalent of by WhatDoIKnow · · Score: 1

    wget "http://cs.fit.edu/~mmahoney/compression/enwik8.zi p" | gunzip

    doesn't count?

    :wq

  15. mods... by Anonymous Coward · · Score: 0

    ..that isn't offtopic at all, it is a rather cute and harmless example of a compressed first post that is actually ON topic. And, no I am not the submitter. Now normally I never chastise here, nor pay much attention to the frosty pissers, but c'mon, cut the dude some slack, it's a +1 funny at least.

  16. Anyone notice the judges are white supremacists? by Anonymous Coward · · Score: 0

    At least one of them, Jim Bowery. He posts online under the nickname Baldrson and his personal homepage is filled with self published articles about how the jews are responsible for the aids and such.

  17. what matters most... by ElitistWhiner · · Score: 1

    is what you get when you run the new compression backwards!

    This reminds me of the cryptographer paid to crypto-compress datastreams into musical notation. The work laid the foundation to real time packet sniffing of the Internet

    Cryptographer? Got atta boys

  18. hah! compression good, speed, ram.. nope by Anonymous Coward · · Score: 0

    it took 5 hours, with 900 megs of ram..

    i dont even have half that processing power in my own brain as well as my computer!

  19. A simple solution in 3 steps... by plaxion · · Score: 0

    Step 1: Read in the 100MB file as binary
    Step 2: Scrub/remove all the 1's (they're bigger than 0 anyway so they're obviously taking up the most space)
    Step 3: Compress the remaining 0's to just one 0 because as we all know from math class 00000000000(etc) == 0

    There, you've just converted 100MB of data into a single "0" which is just one byte long! I've got the compressor part down to a perl one-liner, but I'm still working on how to decompress it efficiently because strangely enough my decompresser is ~100MB in size, so it's still kinda alpha at the moment.

  20. not impressed by Anonymous Coward · · Score: 0

    some guys made an entire single player 3d game with doom 3 like graphics that fits into only like 128 k
    they have plenty of long 3d rendered movies that fit in 64k too.
    Im sure they could have won this thing if they had tried.

  21. Compression related to acting intelligently? by christpants · · Score: 1

    How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply... anybody else get it?

    1. Re:Compression related to acting intelligently? by Anonymous Coward · · Score: 0

      George Bush submitted an entry, and his algorithm was able to get the excerpt down to 101MB...

    2. Re:Compression related to acting intelligently? by Skuto · · Score: 1

      Read the website. (Oh wait, this is /., nobody does that).

      A system that understands the text it is compressing will compress better than one that doesn't.

      The winner improved upon previous compression programs by adding semantic modelling. Improving the modeling further would improve the results. You're heading towards language understanding.

    3. Re:Compression related to acting intelligently? by adrianmonk · · Score: 4, Interesting
      How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply...

      Nope, I had heard about the contest before seeing it today on Slashdot. The summary is fine. That is in fact the thesis that the contest is designed to investigate.

      As for why this is the case, I spent some time studying compression two or three years ago, and one of the things that you quickly realize is that there is no such thing as a truly general-purpose compression algorithm. Compression algorithms work on the principle that there is some underlying pattern or structure to your data. Once the structure is understood, the data can be transformed into a smaller format. Think of compressed files as a set of parameters to a magic function which generates a larger file. The real art is in finding the function.

      To give a concrete example, one of the simplest forms of compression is Huffman coding. You analyze your stream of symbols, and you realize that some occur more often than others. You then assign shorter bit strings to more frequently-occurring symbols and longer ones to less frequent symbols. This gives you a net gain. You were able to do this because you had the insight that in human language, some letters (like "e") occur more frequently than others (like "q").

      There are, of course, other patterns that can be exploited. You can take the frequency distribution trick above up to another level by noting that the frequency of a symbol is not really independent of the symbol before it. For example, the most likely symbol after a "q" is "u". Sure, you could have other things. But "u" is the most likely. You can exploit that to get better compression.

      But of course, you might be compressing something other than English text. There are different techniques for whatever kind of data you're trying to compress. A row of pixels in a faxed (1-bit) image tends to be very close to the same as the previous row. Each pixel is a bit. If you represent a row of pixels not as on or off directly but instead represent it as its bit value XORed with the pixel above it, you end up with 0 bits for every pixel that hasn't changed and 1 bits for every value that hasn't. Presto, you have just managed to skew the probability distribution radically towards 0 bits, and you can use further tricks to gain compression from that.

      The point of all this is that to come up with these tricks, you have to understand something about the data. One of the better definitions I've ever heard for data compression was simply "applied modeling".

      Given that, you have to ask a question: have we reached the point today where compression algorithms are as good as they're going to get at compressing English text based on its surface-level characteristics such as character frequencies and repeated strings? Have we exhausted the low-level modeling that is possible? There has been a lot of work on this, so it's possible that we may have. And if so, then any further gains in the compression ratio could very well be the result of some sort of higher-level modeling. Maybe even modeling the knowledge rather than just the language. And that is what this contest is about, as I understand it.

      It's not for sure that a winning contest entry necessarily requires the submitter to have developed some kind of machine intelligence. But it's an intriguing enough idea that it might be worthwhile to run the contest just to see if something interesting does happen.

      By the way, for some interesting reading on this subject, look up Kolmogorov Complexity. The wikipedia seems to have a pretty decent article on it (although I haven't read the whole thing).

    4. Re:Compression related to acting intelligently? by navarroj · · Score: 1

      I've actually learned about this contest and the theory from a previous story on slashdot. There are some nice mathematics and theoretical results that support this claim. This is also the motivation to have this 'contest', which for all people saying "20% is not much", "it takes too long to be used in applications"; well, this is not about creating better/faster applications to .zip your files, but more about testing the applicability of some interesting theoretical results.

      More information can be found here:
      http://cs.fit.edu/~mmahoney/compression/rationale. html

    5. Re:Compression related to acting intelligently? by dido · · Score: 1

      The idea is basically an extension of what scientific theories are supposed to do. To understand any phenomenon, it is necessary to compress it. A scientific theory, at its core, explains the data from many different experiments by means of a formula simpler than the data. Say, we did a thousand experiments measuring the energy released from the annihiliation of large numbers of electrons and positrons. We could use something like the Lagrange polynomial of the data from the experiment, and then come up with a "theoretical model" that is just as complicated as the data it tries to explain, and that would tell us nothing that we already didn't know. As Leibniz would say: "When a rule is extremely complex, that which conforms to it passes as random." But then, if we had something simpler, like E=mc^2, to explain the experiments, then we might be onto something. That equation explains the whole lot of data from the experiment, and gives an explanation for the transformation of matter into energy and vice-versa, in just a few symbols. I suppose, in the same way, compressing natural language would be a step to trying to get computers to understand natural language. If you want more on this idea, you could try reading the articles on Gregory J. Chaitin's website.

      --
      Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.
  22. Done on 9/25/06? by SillySnake · · Score: 2, Interesting

    According to http://prize.hutter1.net/ this happened on Sept 25 of 2006.

    The site also gives some of the requirements..
    Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:

            * Create a Linux or Windows executable of size S L := 17'073'018 = previous record.
            * If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
            * If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
            * Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.

    1. Re:Done on 9/25/06? by Skuto · · Score: 1

      >According to http://prize.hutter1.net/ this happened on Sept 25 of 2006.

      There is waiting period for public comment/verification etc...

  23. I can do better than that. by Anonymous Coward · · Score: 0

    Here's some eleven character lossless compression of the full Wikipedia: Utter crap.

    1. Re:I can do better than that. by daverabbitz · · Score: 1

      Unfortunately that doesn't qualify as lossless compression.

      Imperceptible loss on the other hand...

      --
      What could be better than a jet powered motorcycle? http://www.youtube.com/watch?v=u8l6GTHLSWE
  24. That's nothing by snoggeramus · · Score: 1

    Compressing information is easy. Now .... let's see them compress an AOL chat room!

  25. WIkipedia-specific compression algorithms by billstewart · · Score: 1
    • 1. Itemize the articles in the Wikipedia extract.
    • 2. Edit them down to stubs.
    • 3. Compress File
    • 4. ...
    • 5. Profit!
    • 6. ...
    • 7. Other Wikipedia authors replace and augment the stubs.
    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    1. Re:WIkipedia-specific compression algorithms by Anonymous Coward · · Score: 0
      • 1. Itemise the articles in the Wikipaedia extract.
      • 2. Re-imagine them to accentuate their brevity
      • 4. hello.jpg
      • 3. Compress File
      • 5. Profit![citation needed]
      • 6. PRES.BUSH SMELLS LIKE OLD PPL
      • 7. unneeded(how do i delete this????)


  26. Interesting related webpage by Skuto · · Score: 1

    http://www.cs.fit.edu/~mmahoney/compression/text.h tml

    Info about contenders and results of common compression programs on the testset. (All the "just use gzip/rar/winrk/..." fools can stop jabbering now...)

    1. Re:Interesting related webpage by Fulcrum2000 · · Score: 1

      And a nice other link on lossless text compression. As you can see top3 programs all use PAQ-like compression engines: http://www.maximumcompression.com/data/text.php

  27. Error in Wikipedia by Baldrson · · Score: 1
    A more precise Wikipedia link is to the erroneous section on Entropy as information content which reads:
    Experiments with human predictors show an information rate of between 1.1 and 1.6 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character.
    The range for Shannon's experiments are actually between .6 and 1.3 bit per character.

    This error even caught Dr. Dobbs compression expert, Mark Nelson so I guess it isn't too surprising it caught the Wikipedia folks.

    The beauty of this sort of error in Wikipedia is it demonstrates just how even erroneous human knowledge can result in greater human knowledge if it is compressed:

    Imagine, if you will, a perfectly compressed Wikipedia (despite the fact that an optimum compression is not generally computable). This perfectly compressed Wikipedia would generally "hang together" with a great many pieces cross-checking coherently with other pieces and hence be represented with common (compressed) codes. But then, some pieces would stand out as not cross-checking. They would not "hang together" with other pieces and would require specific codes to represent them. To a Sherlock Holmes kind of mind -- the mind of a scientist -- the mind of a ruthless epistemologist -- these codes would represent "suspects" of the crime of lies or error. The idea that compressors like PPM are just as able to compress text as humans (given enough time) would not hang together -- it would not be predicted from the more general body of human knowledge and would therefore become suspect due to its relative incompressibility.

  28. Know what to compress, not how. by fahrbot-bot · · Score: 1
    Being able to compress knowledge well is believed to be related to acting intelligently.

    Bah! Knowing what to compress is more intelligenter...

    --
    It must have been something you assimilated. . . .
    1. Re:Know what to compress, not how. by Anonymous Coward · · Score: 0

      > Knowing what to compress is more intelligenter...

      Maybe you meant "the most intelligentest" ?

  29. This is the AI problem by LesPaul75 · · Score: 1
    Being able to compress knowledge well is believed to be related to acting intelligently.
    There's actually a little more to it than that. The creators of the Hutter prize believe that intelligence is the ability to compress knowledge. That's why they're offering this prize -- to solve the artificial intelligence problem. I'm not saying I buy into it, but that's their claim. That's why they call it "The most crucial technology prize of all." Here is the page describing the origin of the prize.
    1. Re:This is the AI problem by kilraid · · Score: 1

      Me and my geeky friends once spent a full week programming bots that fought each other... in rock, paper and scissors. It was interesting that the best bots were actually doing things very much related to data compression. They were guessing the opponent's next move based on the previous moves. Markov chains and such. Think if you had the dictionary in front of you. It might be possible to guess which word comes next, based on the previous ones.

      At a time there was a bot that beat all the other bots. Well, an analysis of the .class file turned out that it was actually loading the other bots' .class files and simulating them. :-)

    2. Re:This is the AI problem by rbarreira · · Score: 1
      Nice story ;) It would have been even better if:

      At a time there was a bot that beat all the other bots. Well, an analysis of the .class file turned out that it was actually loading the other bots' .class files and simulating them. :-)
      ... unknowingly, two such bots had been made and entered an infinite loop. That would teach their authors ;)
      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  30. Compress knowledge != intelligence by Moraelin · · Score: 0

    I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid. In fact, it sounds like the kind of idiotic publicity stunt coming from a marketting guy who never heard of AI or programming.

    The things you'd realistically need, and probably use every day unconsciously, include:

    - good indexing: Just having the information somewhere in your head, well compressed is pointless if you can't quickly find it.

    E.g., when you see a faucet, you need to immediately access such information as "turning it this way makes water come out, turning it the other way stops the water from coming out." You need to start from "faucet" and basically search for the information you need relating to "faucet". That's an easy example, but even trivial daily activities like driving your car often involve navigating a whole graph of associations, and if, say, a kid jumped in front of your car, you better not need hours to uncompress the data and find a solution.

    - good "tokenization" so to speak. Think every word or concept being not a sequence of letters, but a pointer to that very concept. Think Wikipedia's hyperlinks. It may resemble compression on a superficial level, but that's like saying that a bird resembles the space shuttle on a superficial level. The goal here isn't just to minimize size by agnostically matching any chunk of text that came before, but to create a graph of interdependencies that you can navigate to find a solution.

    E.g., again, if you started with "faucet" and went through "turn it clockwise", you need to imediately be able to reference "turn" and "clockwise" without sitting and sifting through a zip archive. And for that you need the real tokens, not any old chunk of matching text for compression reasons. A match for "turn" is good. A match for "rn i" between "turn it clockwise" and a previous occurence of "a barn in the backyard", as a compression algorithm would match, is utterly useless.

    - good filtering. Yes, the brain is sorta lossy at all stages, because it mercilessly filters out everything that doesn't look relevant or in the focus of attention. E.g., when looking at the street, you might just see "it's a car", because the details about it don't look relevant at the moment. E.g., when focusing on the car for more details, you might completely miss the guy in a pink gorilla suit doing cartwheels in the background, because it looks irrelevant to your current focus of attention.

    This is more important than it sounds, and is the _real_ compression for the most part when processing that data. The brain isn't just compressing the whole scene to save bandwidth, it's really mercilessly clipping 90% of it out, tokenizing the rest, and again clipping out most of the details about those tokens as long as they're not explicitly required. It's not just compressing the image of a car coming at you, it's transforming it into the "tokens" (so to speak) "[car] [incoming]". Anything else from that scene (e.g., the guy in a pink gorilla suit across the road) isn't compressed, it's filtered out, and so are the details about those tokens. (E.g., car model, colour, registration number, etc.) You can choose to pay attention to some of them (e.g., look at the registration number or the driver's face), and that will change the filtering criteria to give you those details at the expense of something else.

    Even what you mention ("while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.") is just an example of such filtering. A lot of the details were just filtered out before you even memorized that (e.g., you might have gotten the idea that a giraffe has a long neck, but not its colour, because it didn't seem important at the moment), an a lot just weren't stored or indexed or given a high priority. If it didn't look useful to remember what colour a giraffe is, it didn't get stored. Again, it's filtering, not dumb stream compression.

    And so on.

    And missing all that in favour of a "let's see who compresses the best" stunt seems just stupid and clueless.

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:Compress knowledge != intelligence by epine · · Score: 1
      I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid.

      Apparently, sounds deceive. You might want to practice your hearing. A good start would be subtracting out the sound of your own voice droning on at the drop of a pin about subjects you haven't made an effort to fully appreciate.

      Kolmogorov-Chaitin complexity is the deepest theory going about the inter-relationship between compression and prediction. There used to be a great and highly readable account of omega and the math surrounding omega available online by Chaitin, but he appears to have removed the online version I used to enjoy and perhaps recycled it as a book.
    2. Re:Compress knowledge != intelligence by Abcd1234 · · Score: 1

      I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid.

      Ooooh! Well, if you say it's stupid, then it must be true! I mean, I'm sure you hold a PhD in information theory and artificial intelligence, and as such, are in a perfect position to criticize this work, right? I'd hate to think you were attacking the work of someone far more educated than you in this subject matter out of sheer ignorance and stupidity.

    3. Re:Compress knowledge != intelligence by Moraelin · · Score: 1

      So, by the same logic, if you're not holding a Ph.D. in Economics, you're not allowed to say that an apology of Communism is bogus, right? They had plenty of Ph.D.'s in Moscow writing books about how great that system is. You wouldn't go and attack the work of people more educated than you out of sheer ignorance and stupidity, right?

      Here's a fun concept for you: appeal to false authority. Because that's the fallacy you're committing there.

      Show me someone who's actually built a working AI, and then we'll talk. That's a real authority I'm willing to accept any time. Anything else is, for the scope of discussing what's _needed_ for AI, just an unproven hypothesis. No more, no less. Even if their other maths is good, the supposition that's indeed needed for inteligent behaviour is no more than an unproven hypothesis pulled out of the ass. Arguing that since someone is an expert in the first, they must be automatically right about the second, is no more and no less than an appeal to false authority. It's like arguing that since Newton was good at physics, he _must_ have also be right about alchemy. (Which is another thing he studied, btw.) It's just that stupid.

      So seeing that noone actually produced anything that can take a RL information stream and do anything even remotely intelligent with it, I'm not sure what makes them irrefutable authorities on the subject. Where _is_ the proof-of-concept program that uses data compression to actually extract useful knowledge out of Wikipedia and act intelligently about it? No, seriously. Anything else is just taking a guess. So exactly what there says that you should automatically stop thinking for yourself and take his unproven ideas for absolute truth?

      In the meantime, however, we have this thing called a human brains that actually exhibits intelligence. And we're starting to understand some things that it does. Such as that, as I've mentioned, it does _not_ work by simply compressing the whole stream, but by extracting the relevant bits and utterly ignoring the rest.

      We have _millenia_ of proof that if you're interested enough in the magician's left hand, you completely miss what his right is doing, or that if you focus on the miracle horse doing maths, you miss the owner giving it hints when to stop tapping. Heck, proofs that if they get your attention on something in the foreground, you'll completely miss the guy in a gorilla suit jumping up and down in the back, even made it onto mainstream TV.

      So between some unproven hypotheses, and something that actually works that way, I don't know about you, but I'll go with what works here and now. I'll go with how the brain actually seems to work.

      --
      A polar bear is a cartesian bear after a coordinate transform.
    4. Re:Compress knowledge != intelligence by Abcd1234 · · Score: 1

      Here's a fun concept for you: appeal to false authority. Because that's the fallacy you're committing there.

      And the fallacy you've committed is decrying a theory without even reading about it, let alone understanding it. Tell me, which is worse?

      Anything else is just taking a guess. So exactly what there says that you should automatically stop thinking for yourself and take his unproven ideas for absolute truth?

      I don't recall saying that. My point is that, with absolutely no basis in actual fact, you've decided to disregard what is, on the surface, a reasonably valid theory (to me, anyway), based on your personal "feelings". Meanwhile, those far more educated than you in the subject matter consider it a valid area of research. So, what makes you believe that you're qualified to disregard their theories out of hand? Other than, of course, sheer arrogance, which you apparently possess in spades.

      I'll go with how the brain actually seems to work.

      And now you claim to know how the brain works. Are you really *that* arrogant? Yikes...

      BTW, I might point out that people like you used to discount quantum theory. "The world around me doesn't work that way, so it can't possibly be true!" they would say. Guess who was right?

    5. Re:Compress knowledge != intelligence by Moraelin · · Score: 1

      BTW, I might point out that people like you used to discount quantum theory. "The world around me doesn't work that way, so it can't possibly be true!" they would say. Guess who was right?

      "They laughed at Einstein. They laughed at the Wright Brothers. But they also laughed at Bozo the Clown." - Carl Sagan

      Did I mention fallacies? I think I did. In this case the Association Fallacy (sharing one quality or in this case one group of enemies with quantum physics doesn't actually prove this one true too), salted with a bit of Ad Hominem ("people like you...").

      Here's a fun reality check for you: quantum theory actually had some actual experimental data to show, and made some easily falsifiable claims. (Falsifiable meaning it could have been easily proven false, if it were false.) And in the meantime it's shown a mountain of evidence and made possible practical stuff such as LASER or the Zener diode. _That_ is what makes it true, not sophistry like who laughed at it or who didn't. That's how science works.

      Unfortunately that idea is lost on some CS clowns who think that just arguing personal beliefs, or occasionally preferences, is some form of science. And that expertise in some unrelated domain makes them automatically right. ("I'm the compiler guru, therefore I can tell you which methodology is more productive." Or "I'm the data structures guru, therefore I can tell you what's needed for intelligence.") No, it isn't. Show me the actual experimental proof, and then we'll talk. Until then, it's no better than the work of all those who wrote from positions of authority about what's needed to transmute lead into gold.

      And now you claim to know how the brain works. Are you really *that* arrogant? Yikes...

      Believe it or not, _some_ things about it are known by now. There are more psychologists, neurologists, anthropologists and so on, all the way to stage "magicians", who've studied those various aspects, than CS clowns busy postulating they just know what's needed for intelligence. And the fun part is, the former group actually has some falsifiable data to show for their efforts.

      At any rate, it doesn't matter if I'm arrogant (yes, I am) or what you choose to be incredulous about. Show me the experimental data and then we'll talk. That's how real science works.

      Otherwise it ends up an "argument ouf ignorance" fallacy. When you've postulated that compression is _needed_ for intelligence, you've made a claim about the brain too. So it ends up basically, "We don't know how the brain on the whole works, so it must be true that it needed data compression to achieve intelligence." Which is a textbook example of that fallacy.

      And the fallacy you've committed is decrying a theory without even reading about it, let alone understanding it. Tell me, which is worse?

      Your using a fallacy yet _again_? Dunno about "worse", but it looks pretty bad. What you assume about me still hasn't proved or disproved the theory you're busy defending. What _would_ even start to prove it would be at least _one_ example of an AI where pure data compression was essential to its workings.

      My point is that, with absolutely no basis in actual fact, you've decided to disregard what is, on the surface, a reasonably valid theory (to me, anyway), based on your personal "feelings".

      Looking reasonable valid on the surface can be very mis-leading. Alchemy looked reasonably valid on the surface. Communism looked reasonably valid on the surface. Inteligent Design looks reasonably valid on the surface for a _lot_ of people. Etc.

      Unfortunately, again, that's not how science works. Science works more along the lines of "show me the experiment". Show me that data compression is even useful at all in a working AI, before making broad claims that it's _needed_ for intelligence. It's that simple.

      Meanwhile, those far

      --
      A polar bear is a cartesian bear after a coordinate transform.
  31. some standard compression tools tested by elmartinos · · Score: 1

    I have tested some of the standard compression tools, here are the compressed sizes:

    100000000 | original size
      36445241 | gzip --best
      29008758 | bzip2
      28860178 | rzip -9
      24864529 | 7zr -mx=9
      24753293 | rar a -m5 -mdG

    7z does not do so well, I think because not so much tuned for text compression, it is much better at compressing binaries. For text compression PPMD and the variants are quite good, so I guess you will see good results with WinRK, Compressia, PAQ and the like.

    1. Re:some standard compression tools tested by Anonymous Coward · · Score: 0

      7zip already uses PPMD var.H for text compression, as does RAR 3.00 or later, if they detect text content.

  32. So somebody did it! by ESqVIP · · Score: 1

    In the end, somebody managed to get through the technobabble!

  33. Hutter Prize by johansalk · · Score: 1

    I really misread this as a Hitler Prize.

    1. Re:Hutter Prize by Anonymous Coward · · Score: 0

      I really misread this as a Hitler Prize.

      I guess the goal of that program would be to compress Jews?

      [I know... go straight to hell, do not pass GO...]

    2. Re:Hutter Prize by kingofc · · Score: 1

      well, agree, that Hitler himself had a greedy reductionist approach to human intelligence as well. some "hard numbers" from wikipedia: About 220,000 Sinti and Roma were murdered in the Holocaust (some estimates are as high as 800,000), between a quarter to a half of the European population. Other groups deemed "racially inferior" or "undesirable": Poles (6 million killed, of whom 3 million were Catholic/Christian, and the rest Jewish), Serbs (estimates vary between 500,000 and 1.2 million killed, mostly by Croat Ustase), Soviet military prisoners of war and civilians in occupied territories including Russians and other East Slavs, the mentally or physically disabled, homosexuals, Blacks, Jehovah's Witnesses, Atheists,[citation needed] Communists and political dissidents, trade unionists, Freemasons, Eastern Christians, and Catholic and Protestant clergy, were also persecuted and killed. Many scholars do not include the Nazi persecution of all of these groups in the definition of the Holocaust, with some scholars limiting the Holocaust to the genocide of the Jews; some to genocide of the Jews, Roma, and disabled; and some to all groups targeted by Nazi racism.[2] Taking all these other groups into account, however, the total death toll rises considerably, estimates generally place the total number of Holocaust victims at 9 to 11 million, though some estimates have been as high as 26 million.[3]

  34. Netopsystems FEAD Optimizer by Schraegstrichpunkt · · Score: 1

    Recomposing...

  35. For all the people laughing at this contest by rbarreira · · Score: 1

    For all the people laughing at this contest: Read this and this before making any more posts about how ridiculous this contest seems, how stupid it is to compress wikipedia, how compression is such a dumb process, and how bad the compressor was for taking a lot of time to process the text.

    --

    The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    1. Re:For all the people laughing at this contest by Viol8 · · Score: 1

      These sorts of algorithms might not be ridiculus but they have limited scope. To compress text in this way you have to understand the language and its idioms. So the algo that did this would for example be useless at compressing French and possibly even English written in some colloquial form. It might have uses in some specific instances but as a general purpose text compression algorithm it leaves a lot to be desired.

    2. Re:For all the people laughing at this contest by rbarreira · · Score: 1

      We can't know that until the algorithms are invented. And I guess the important thing here is not the algorithms, but the theories behind them (which hopefully will be possible to apply in a general way to all the languages).

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  36. Am I the only one that read the news as by lord_rob+the+only+on · · Score: 1

    First Hustler Prize Awarded ?

    Maybe I am the only Slashdot reader that enjoys pr0n ...

  37. What about DNA? by Circlotron · · Score: 1

    A single strand of DNA contains the equivalent of about 3GB, so I read. Given the right circumstances it can uncompress into an entire human being, including a brain that is said to be the most complex thing in the known universe. That's gotta be the most over-the-top blowout of a data compression feat ever! You could probably fit the entire internet onto a single floppy disk with technology that advanced :-)

    1. Re:What about DNA? by Anonymous Coward · · Score: 0

      Yeah,

      But a brain is mostly blank after decompression. Also an avaredge human has 2.3 mutations (diffects) so it's not without errors.

  38. Unfortunately for Ratushnyak... by Bohnanza · · Score: 1

    ...the prize was compressed by a factor of 5.86 as well.

    --

    -----

    Sorry, I'm only a 1336 h4x0r.

  39. OK, I'll bite by rbarreira · · Score: 1
    Have you actually read the links I sent you? There's a mathematical proof that compressing natural language text implies being able to pass the Turing test, which is commonly thought to be an AI breakthrough AI when it happens. While that may not imply "intelligence" for all its definitions, it's surely something worthy of note, which your original comment doesn't seem to acknowledge.

    Finally, a real challenge would be the self-inventing compression method, not the self-extracting algorithm.

    So you're saying that making a computer pass a Turing test is not a real challenge? Do it then!
    --

    The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    1. Re:OK, I'll bite by smallfries · · Score: 1

      OK. I'm not the OP that you're arguing with but you cme across as someone who thinks they understand Hutter's proof so I've got a few questions for you. From the link that you gave, I've read the background before and skmmed through the ECML paper to remember what he'd done.

      The general AI algorithm seems to suffer from two main problem:
      1. Specification. The measurement critera is "wooly". If we assume that certain input symbols are nominated rewards from the environment then we are limiting ourselfs to certain types of problen. Not all intellegence problems can be specificed using such a simple class of interaction. In particular what evidence is there that the method generalises to more complex forms of interaction such as deferred and/or partial rewards?

      2. Computation. Both the set of p being computed, and the bound t must be increased n order for the search to be universal. Some computable problems may require extreme values of t and/or p, and there is no general strategy to find them.

      In short, the general AI model / algorithm doesn't appear to claim anything other than if a solution is computable, and we work in a time-bounded model to avoid non-termination then we will find it sooner or later. This does absolutely nothing to solve the fundamental problems in the field, which are of the nature how. Rather than if.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    2. Re:OK, I'll bite by foobsr · · Score: 1

      There's a mathematical proof that compressing natural language text implies being able to pass the Turing test

      I cannot see it.

      From the article in question:

      What Hutter proved is that the optimal behavior of an agent is to guess that the environment is controlled by the shortest program that is consistent with all of the interaction observed so far.

      Wisely avoiding to mention that the agent is dealing with a computable environment, just to make things clear. Besides, this feels like a guise of Kolmogorov complexity. Finally, none of the caveats that Hutter addresses are named.

      ... Shannon capacity theorem states that the optimal (shortest average) code for s has length log2 1/P(s) bits

      average pointwise entropy? p(s)?

      Our two assumptions state that both humans would answer any given question the same way

      Simple question: What is your name?

      However humans must implicitly know P(s) because our speech and writing, by definition, follows this distribution.

      So I implicitly, by definition, knew the probability of your answers so far?

      Now suppose that the machine can compute P(s) for all s.

      p(s) or P(s) without "for all s" - anyway - strong assumption, at best.

      ... The article is bogus. It does not at all link to the imitation game abd does not give any hint at WSD.

      However, thank you for giving me the motivation to dive into the past and have fun with the associated nostalgia.

      CC.

      --
      TaijiQuan (Huang, 5 loosenings)
  40. Even simpler by Anonymous Coward · · Score: 0

    Take the 100MB data and make a number between 0 and 1 with it. Then make a notch in a stick of wood at the place that exactly splits the stick to this ratio. Done.
    That's how I always backup my hard disk.

  41. Stuffit is even smaller by j0kkk3l · · Score: 1

    24310603 | sitx Best Text Compression, compression level 16, memory consumption 25 (39,6 MB), Optimizers on

    Well Stuffit 9.02 seems to win this round. It took under 3 minutes on a 2 GHz iMac G5.

  42. And the answer is... by rumblin'rabbit · · Score: 1
    The answer in hex is 2A. Or 42, if you prefer decimal,


    Gotter down to 1 byte. Not bad at all.

  43. I can make it 99% small!! by lordmage · · Score: 1

    See I have a reference file Zipper. That means that I take long strings and make a tag outta them. Right now it has 1 tag and using one huge 100 meg string reference file. I will update this later to include more. I think I should get the money! It also takes 20 seconds to run!

    --
    I can program myself out of a Hello World Contest!!
  44. Trying to understand this by jmcwork · · Score: 1
    I am way out of my league with AI at this level so I will try not to be too dumb with this.

    So data compression tries to replace repeated chunks of data (character strings) with a common, hopefully much smaller, reference value (my simplistic view). If you are trying to compress knowledge, are you trying to find repeated (or at least similar) concepts and replace those with a common access point? For example, "Boiling water is hot", "Steaming coffee is hot", "Cheese from pizza right out of the oven can burn your mouth" (to me) have some knowledge relationships that keep you from burning yourself. So could you 'compress' these to a single concept like 'do not touch' or 'be careful'? Again, just trying to get a handle on this concept.

    1. Re:Trying to understand this by Kris_J · · Score: 1
      It's a bit late, but I enjoy compression talk so I'm going to have a go at this.
      "Boiling water is hot"
      Let's take those four words. Rather than looking for standard repeating patterns or common letters, a compressor that understands language might go something like;
      1. Boil -- I predict that "Boil" will be followed by "ing". The chance is higher than for "boil" because the "B" means its at the beginning of a sentence.
      2. Cool, there's an "ing", so I don't actually have to store anything because the algorithm predicts Boiling.
      3. Now, there's typically a space after "ing", even more so after "Boiling".
      4. Space it is, nothing to store here.
      5. Okay, so we have "Boiling " -- The next word will probably be "oil" or "water".
      6. It's "water", so I'll just store a single bit to mark which one of the top two predicted outcomes it is.
      7. And there's the space I was expecting.
      8. (Here my English gets a little sketchy.) We have an adjective and a noun, so we're expecting a predicate. This needs a verb and the most common one at this point is "is".
      9. Well, what do you know, an "is", with a space.
      10. Now that we have "Boiling water is ", what is boiling water. With enough understanding of the world around us, we can guess the next word is "hot". In fact, it's probably such a safe bet that there's a test somewhere in the world where you have to fill in the blank of "Boiling water is ___". So, again, nothing much to store since the algorithm predicted the entire phrase from just the "Boil".
      Now, if you'd said "Boiling water is cold", the algorithm would have needed to store "cold" in some way, because it was unexpected. Similarly, "Boilfish notlob Frank ARRRRRGH!" would need a lot more storage because it's not as predictable.

      Of course, I'm not a qualified expert in this field, so I may well be talking out of my notlob.

  45. Here's some clue for you by Moraelin · · Score: 1

    Here's some free, complimentary clue for you: "All that glitters is not gold." Just because someone wrote a ton of maths about compression, doesn't mean it's anywhere near applicable to building an AI.

    And in fact, noone actually built yet anything even vaguely resembling intelligence yet, so maybe, just maybe, all those hypotheses are missing some vital point. Maybe, just maybe, just because you've compressed something to the minimum number of bits, doesn't mean jack squat about being able to use it effectively in real time. Maybe, just maybe, just because you've run arithmetic compression over a text, doesn't mean jack squat about actually being able to parse and navigate the semantic relationships in it. You just have a compression program and need some hours just to get your stream of bytes back, and be no wiser as to how to act intelligently on it.

    In the meantime, however, we have a ton of psychologists, neurologists and, don't laugh, even stage magicians who've explored how the brain does work. You know, the only actual thing that actually exhibits some intelligence. We've worked out a lot of odds and ends about what happens there. We have some idea of what it filters out and when, how many seconds do some of the buffers hold the data, and so on.

    So maybe you can take some hint from there. Maybe the information you need to extract and act on is really something like "[car][incoming]" instead of ivory-tower exercises like calculating the minimum number of bits you can compress the incoming stream to. That's the hard part: extracting the useful information there.

    Or do you need something with lots of formulae? Read something about AI, lemming. There are a lot of people who've worked on various inference machines and such, and, surprise, none of them involved compression algorithms.

    So, you know, speaking of "ubtracting out the sound of your own voice droning on at the drop of a pin about subjects you haven't made an effort to fully appreciate", maybe you can take your own advice there. Just because you get a boner about some maths, doesn't mean jack squat about applying it to a practical AI problem. Maybe start using your own head about how you'd actually act on that information instead. Just a thought, you know.

    --
    A polar bear is a cartesian bear after a coordinate transform.
  46. Virtue from necessity by Baldrson · · Score: 1
    Since you have to choose at least one representation from which you derive your knowledge English text is a pretty good place to start. Wikipedia consists of more than English text of course and that means those representations will need to be compressed too.

    The result of this is pretty much what you need for epistemology, software and legal disciplines: Optimal language models telling you precisely how language maps to knowledge.

    There was some debate about using the multilingual versions of Wikipedia but those projects are less important than just getting natural language -- any natural language -- to be modeled well enough to construct natural language UI's with some reasonable level of rigor that your mother could use (assuming she can read English).

  47. Kolmogorov Complexity by xtagestax · · Score: 1

    So I don't know if anyone's mentioned this yet, but it might be interesting / worthwhile to point out that the goal of the contest is to compress the 100 MB of wikipedia knowledge into a self-extracting program as efficiently as possible is nearly equivalent to asking "what's the Kolmogorov Complexity of this data? There is a definite (unknown) lower bound on the size of the program that will describe the knowledge, which is the size of the compressed data + the size of the decompressing program. This is a non-trivial problem -- especially with the added constraints of limited memory and time -- that could yield surprising and insightful results.

  48. Your post has more mistakes than a random post by rbarreira · · Score: 1

    The two other replies to your post have already done a good job at proving you're wrong, so I'll just add that if pi is proved random it will most likely not by a test of its digits, but by a more abstract proof, since it has an infinite number of digits.

    --

    The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  49. Re:what the... by Lotharus · · Score: 1
    I would think that a study "a few months ago" could (I'm no scientist) replace knowledge from "a few years back". Remember the study 3,000 years ago that proved the world was flat?

    I didn't have any real difficulty reading that sentence. Only "reversing" and "interior" posed much trouble. Plus, in sorting algorithms (that's really what this is.. the brain resorting the letters in the word), it's not impossible for a sort procedure on a fully-reverse-sorted list to be less efficient than a sort procedure on a randomized list (of course this depends on the algorithm in use). In other words, fully reversing the interior letters could well present a more difficult problem for the brain than randomizing the interior letters. I'm not a neuroscientist either; I don't know how the brain actually handles something like sorting letters.

    Now let us never speak of this again.
    That's what your mom said last night!! BURN!!!
  50. paq8hp6 by Baldrson · · Score: 1

    Matt Mahoney reports that Alexander Ratushnyak already has a new program, paq8hp6.exe which appears to improve on paq8hp5 by the 1% threshold required for another payout -- given it survives the 30 day comment period.