First Hutter Prize Awarded
stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.
It's not what you know, or who you know, but how much you can compress what you know.
Although I could tell a fair few stories about crushing students into phone booths.
last time I checked, you couldn't make a "lossy" compression of text.
oh... nevermind. I forgot about teens with SMS.
It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines, and one of only 3 people who submitted entries. Seems like a strange thing to award up to 50,000 Euro for ... his compression was only 20% smaller than what I just quickly did with gzip. I'm rather surprised that more people didn't try though ... it's not like people are throwing money at data compression researchers.
Crack - Free with every butt and set of boobs
"The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia"
Wikipedia? Knowledge? Isn't that already a lossy compression mechanism?
Am I the only one who finds it slightly ironic that (as of this writing), there is no entry for the Hutter Prize on Wikipedia?
If you can't convince them, convict them.
For comparison purposes, WinRAR on its best setting only gets this down to 24MB. Doubtless 7zip could get even lower, but I don't think either could crack the 17MB mark. And certainly neither of those would be self-extracting, which this contest requires.
Cyde Weys Musings - Scrutinizing the inscrutable
a Lossless Compression of the Internet
How slow this code is. Usually when you are trying to squeak out another 1% of space in a file your algorythm triples in size and quadruples in runtime to get that improvement. I wonder how practical this new algorythm really is. Probably not so much an issue nowadays but I also wonder how big the compression program is. This used to be a really big deal many years ago when the compression programs could easily be larger than the data they were compressing.
I work for the Department of Redundancy Department.
I wonder if you could just store all the common english words into a lookup table indexed by a number (32 bit enough?). So every word then gets replaced by a number which would then be compressed further. Only catch is the client needs a copy of the lookup table which might be feasible.
Being able to compress knowledge well is believed to be related to acting intelligently. - IHNPTTT (I Have Not Passed The Turing Test), but while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.
From the contest specs:
thus reducing the slippery concept of intelligence to hard file size numbers
I doubt the instance which wrote that was cynical.
CC.
TaijiQuan (Huang, 5 loosenings)
Then, we can use the rest of the leftover space to store the sum of all human knowledge in an uncompressed form.
- Dave
... by Douglas Adams. Wonder what rate of compression is putting the meaning of life, universe and everything in just the number 42.
wget "http://cs.fit.edu/~mmahoney/compression/enwik8.zi p" | gunzip
:wq
doesn't count?
..that isn't offtopic at all, it is a rather cute and harmless example of a compressed first post that is actually ON topic. And, no I am not the submitter. Now normally I never chastise here, nor pay much attention to the frosty pissers, but c'mon, cut the dude some slack, it's a +1 funny at least.
At least one of them, Jim Bowery. He posts online under the nickname Baldrson and his personal homepage is filled with self published articles about how the jews are responsible for the aids and such.
is what you get when you run the new compression backwards!
This reminds me of the cryptographer paid to crypto-compress datastreams into musical notation. The work laid the foundation to real time packet sniffing of the Internet
Cryptographer? Got atta boys
it took 5 hours, with 900 megs of ram..
i dont even have half that processing power in my own brain as well as my computer!
Step 1: Read in the 100MB file as binary
Step 2: Scrub/remove all the 1's (they're bigger than 0 anyway so they're obviously taking up the most space)
Step 3: Compress the remaining 0's to just one 0 because as we all know from math class 00000000000(etc) == 0
There, you've just converted 100MB of data into a single "0" which is just one byte long! I've got the compressor part down to a perl one-liner, but I'm still working on how to decompress it efficiently because strangely enough my decompresser is ~100MB in size, so it's still kinda alpha at the moment.
some guys made an entire single player 3d game with doom 3 like graphics that fits into only like 128 k
they have plenty of long 3d rendered movies that fit in 64k too.
Im sure they could have won this thing if they had tried.
How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply... anybody else get it?
According to http://prize.hutter1.net/ this happened on Sept 25 of 2006.
:= 17'073'018 = previous record.
The site also gives some of the requirements..
Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:
* Create a Linux or Windows executable of size S L
* If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
* If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
* Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
Here's some eleven character lossless compression of the full Wikipedia: Utter crap.
Compressing information is easy. Now .... let's see them compress an AOL chat room!
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
http://www.cs.fit.edu/~mmahoney/compression/text.h tml
Info about contenders and results of common compression programs on the testset. (All the "just use gzip/rar/winrk/..." fools can stop jabbering now...)
This error even caught Dr. Dobbs compression expert, Mark Nelson so I guess it isn't too surprising it caught the Wikipedia folks.
The beauty of this sort of error in Wikipedia is it demonstrates just how even erroneous human knowledge can result in greater human knowledge if it is compressed:
Imagine, if you will, a perfectly compressed Wikipedia (despite the fact that an optimum compression is not generally computable). This perfectly compressed Wikipedia would generally "hang together" with a great many pieces cross-checking coherently with other pieces and hence be represented with common (compressed) codes. But then, some pieces would stand out as not cross-checking. They would not "hang together" with other pieces and would require specific codes to represent them. To a Sherlock Holmes kind of mind -- the mind of a scientist -- the mind of a ruthless epistemologist -- these codes would represent "suspects" of the crime of lies or error. The idea that compressors like PPM are just as able to compress text as humans (given enough time) would not hang together -- it would not be predicted from the more general body of human knowledge and would therefore become suspect due to its relative incompressibility.
Seastead this.
Bah! Knowing what to compress is more intelligenter...
It must have been something you assimilated. . . .
I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid. In fact, it sounds like the kind of idiotic publicity stunt coming from a marketting guy who never heard of AI or programming.
The things you'd realistically need, and probably use every day unconsciously, include:
- good indexing: Just having the information somewhere in your head, well compressed is pointless if you can't quickly find it.
E.g., when you see a faucet, you need to immediately access such information as "turning it this way makes water come out, turning it the other way stops the water from coming out." You need to start from "faucet" and basically search for the information you need relating to "faucet". That's an easy example, but even trivial daily activities like driving your car often involve navigating a whole graph of associations, and if, say, a kid jumped in front of your car, you better not need hours to uncompress the data and find a solution.
- good "tokenization" so to speak. Think every word or concept being not a sequence of letters, but a pointer to that very concept. Think Wikipedia's hyperlinks. It may resemble compression on a superficial level, but that's like saying that a bird resembles the space shuttle on a superficial level. The goal here isn't just to minimize size by agnostically matching any chunk of text that came before, but to create a graph of interdependencies that you can navigate to find a solution.
E.g., again, if you started with "faucet" and went through "turn it clockwise", you need to imediately be able to reference "turn" and "clockwise" without sitting and sifting through a zip archive. And for that you need the real tokens, not any old chunk of matching text for compression reasons. A match for "turn" is good. A match for "rn i" between "turn it clockwise" and a previous occurence of "a barn in the backyard", as a compression algorithm would match, is utterly useless.
- good filtering. Yes, the brain is sorta lossy at all stages, because it mercilessly filters out everything that doesn't look relevant or in the focus of attention. E.g., when looking at the street, you might just see "it's a car", because the details about it don't look relevant at the moment. E.g., when focusing on the car for more details, you might completely miss the guy in a pink gorilla suit doing cartwheels in the background, because it looks irrelevant to your current focus of attention.
This is more important than it sounds, and is the _real_ compression for the most part when processing that data. The brain isn't just compressing the whole scene to save bandwidth, it's really mercilessly clipping 90% of it out, tokenizing the rest, and again clipping out most of the details about those tokens as long as they're not explicitly required. It's not just compressing the image of a car coming at you, it's transforming it into the "tokens" (so to speak) "[car] [incoming]". Anything else from that scene (e.g., the guy in a pink gorilla suit across the road) isn't compressed, it's filtered out, and so are the details about those tokens. (E.g., car model, colour, registration number, etc.) You can choose to pay attention to some of them (e.g., look at the registration number or the driver's face), and that will change the filtering criteria to give you those details at the expense of something else.
Even what you mention ("while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.") is just an example of such filtering. A lot of the details were just filtered out before you even memorized that (e.g., you might have gotten the idea that a giraffe has a long neck, but not its colour, because it didn't seem important at the moment), an a lot just weren't stored or indexed or given a high priority. If it didn't look useful to remember what colour a giraffe is, it didn't get stored. Again, it's filtering, not dumb stream compression.
And so on.
And missing all that in favour of a "let's see who compresses the best" stunt seems just stupid and clueless.
A polar bear is a cartesian bear after a coordinate transform.
I have tested some of the standard compression tools, here are the compressed sizes:
100000000 | original size
36445241 | gzip --best
29008758 | bzip2
28860178 | rzip -9
24864529 | 7zr -mx=9
24753293 | rar a -m5 -mdG
7z does not do so well, I think because not so much tuned for text compression, it is much better at compressing binaries. For text compression PPMD and the variants are quite good, so I guess you will see good results with WinRK, Compressia, PAQ and the like.
Open Source Alternatives
In the end, somebody managed to get through the technobabble!
I really misread this as a Hitler Prize.
Recomposing...
http://outcampaign.org/
For all the people laughing at this contest: Read this and this before making any more posts about how ridiculous this contest seems, how stupid it is to compress wikipedia, how compression is such a dumb process, and how bad the compressor was for taking a lot of time to process the text.
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
First Hustler Prize Awarded ?
Maybe I am the only Slashdot reader that enjoys pr0n ...
A single strand of DNA contains the equivalent of about 3GB, so I read. Given the right circumstances it can uncompress into an entire human being, including a brain that is said to be the most complex thing in the known universe. That's gotta be the most over-the-top blowout of a data compression feat ever! You could probably fit the entire internet onto a single floppy disk with technology that advanced :-)
...the prize was compressed by a factor of 5.86 as well.
-----
Sorry, I'm only a 1336 h4x0r.
So you're saying that making a computer pass a Turing test is not a real challenge? Do it then!
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Take the 100MB data and make a number between 0 and 1 with it. Then make a notch in a stick of wood at the place that exactly splits the stick to this ratio. Done.
That's how I always backup my hard disk.
24310603 | sitx Best Text Compression, compression level 16, memory consumption 25 (39,6 MB), Optimizers on
Well Stuffit 9.02 seems to win this round. It took under 3 minutes on a 2 GHz iMac G5.
Gotter down to 1 byte. Not bad at all.
See I have a reference file Zipper. That means that I take long strings and make a tag outta them. Right now it has 1 tag and using one huge 100 meg string reference file. I will update this later to include more. I think I should get the money! It also takes 20 seconds to run!
I can program myself out of a Hello World Contest!!
So data compression tries to replace repeated chunks of data (character strings) with a common, hopefully much smaller, reference value (my simplistic view). If you are trying to compress knowledge, are you trying to find repeated (or at least similar) concepts and replace those with a common access point? For example, "Boiling water is hot", "Steaming coffee is hot", "Cheese from pizza right out of the oven can burn your mouth" (to me) have some knowledge relationships that keep you from burning yourself. So could you 'compress' these to a single concept like 'do not touch' or 'be careful'? Again, just trying to get a handle on this concept.
Here's some free, complimentary clue for you: "All that glitters is not gold." Just because someone wrote a ton of maths about compression, doesn't mean it's anywhere near applicable to building an AI.
And in fact, noone actually built yet anything even vaguely resembling intelligence yet, so maybe, just maybe, all those hypotheses are missing some vital point. Maybe, just maybe, just because you've compressed something to the minimum number of bits, doesn't mean jack squat about being able to use it effectively in real time. Maybe, just maybe, just because you've run arithmetic compression over a text, doesn't mean jack squat about actually being able to parse and navigate the semantic relationships in it. You just have a compression program and need some hours just to get your stream of bytes back, and be no wiser as to how to act intelligently on it.
In the meantime, however, we have a ton of psychologists, neurologists and, don't laugh, even stage magicians who've explored how the brain does work. You know, the only actual thing that actually exhibits some intelligence. We've worked out a lot of odds and ends about what happens there. We have some idea of what it filters out and when, how many seconds do some of the buffers hold the data, and so on.
So maybe you can take some hint from there. Maybe the information you need to extract and act on is really something like "[car][incoming]" instead of ivory-tower exercises like calculating the minimum number of bits you can compress the incoming stream to. That's the hard part: extracting the useful information there.
Or do you need something with lots of formulae? Read something about AI, lemming. There are a lot of people who've worked on various inference machines and such, and, surprise, none of them involved compression algorithms.
So, you know, speaking of "ubtracting out the sound of your own voice droning on at the drop of a pin about subjects you haven't made an effort to fully appreciate", maybe you can take your own advice there. Just because you get a boner about some maths, doesn't mean jack squat about applying it to a practical AI problem. Maybe start using your own head about how you'd actually act on that information instead. Just a thought, you know.
A polar bear is a cartesian bear after a coordinate transform.
The result of this is pretty much what you need for epistemology, software and legal disciplines: Optimal language models telling you precisely how language maps to knowledge.
There was some debate about using the multilingual versions of Wikipedia but those projects are less important than just getting natural language -- any natural language -- to be modeled well enough to construct natural language UI's with some reasonable level of rigor that your mother could use (assuming she can read English).
Seastead this.
So I don't know if anyone's mentioned this yet, but it might be interesting / worthwhile to point out that the goal of the contest is to compress the 100 MB of wikipedia knowledge into a self-extracting program as efficiently as possible is nearly equivalent to asking "what's the Kolmogorov Complexity of this data? There is a definite (unknown) lower bound on the size of the program that will describe the knowledge, which is the size of the compressed data + the size of the decompressing program. This is a non-trivial problem -- especially with the added constraints of limited memory and time -- that could yield surprising and insightful results.
The two other replies to your post have already done a good job at proving you're wrong, so I'll just add that if pi is proved random it will most likely not by a test of its digits, but by a more abstract proof, since it has an infinite number of digits.
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
I didn't have any real difficulty reading that sentence. Only "reversing" and "interior" posed much trouble. Plus, in sorting algorithms (that's really what this is.. the brain resorting the letters in the word), it's not impossible for a sort procedure on a fully-reverse-sorted list to be less efficient than a sort procedure on a randomized list (of course this depends on the algorithm in use). In other words, fully reversing the interior letters could well present a more difficult problem for the brain than randomizing the interior letters. I'm not a neuroscientist either; I don't know how the brain actually handles something like sorting letters.
That's what your mom said last night!! BURN!!!
http://undecidedgames.blogspot.com
Matt Mahoney reports that Alexander Ratushnyak already has a new program, paq8hp6.exe which appears to improve on paq8hp5 by the 1% threshold required for another payout -- given it survives the 30 day comment period.
Seastead this.