First Hutter Prize Awarded
stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.
It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines, and one of only 3 people who submitted entries. Seems like a strange thing to award up to 50,000 Euro for ... his compression was only 20% smaller than what I just quickly did with gzip. I'm rather surprised that more people didn't try though ... it's not like people are throwing money at data compression researchers.
Crack - Free with every butt and set of boobs
"The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia"
Wikipedia? Knowledge? Isn't that already a lossy compression mechanism?
Hehe, and remember the study a few months back that found that the order of letters within a word is unimportant, so long as the first and last are correct? Potentially big words could be listed merely by how many of each letter they have, then randomly mix them up inside, order doesn't matter :)
...
Am I the only one who finds it slightly ironic that (as of this writing), there is no entry for the Hutter Prize on Wikipedia?
If you can't convince them, convict them.
For comparison purposes, WinRAR on its best setting only gets this down to 24MB. Doubtless 7zip could get even lower, but I don't think either could crack the 17MB mark. And certainly neither of those would be self-extracting, which this contest requires.
Cyde Weys Musings - Scrutinizing the inscrutable
How slow this code is. Usually when you are trying to squeak out another 1% of space in a file your algorythm triples in size and quadruples in runtime to get that improvement. I wonder how practical this new algorythm really is. Probably not so much an issue nowadays but I also wonder how big the compression program is. This used to be a really big deal many years ago when the compression programs could easily be larger than the data they were compressing.
I work for the Department of Redundancy Department.
Remember the study a few years back that proved this to be false, pointing out that slpmiy rnisreveg the ioiretnr lrettes mdae it sltnacifingiy mroe dluciffit to raed the wdros?
Now let us never speak of this again.
Being able to compress knowledge well is believed to be related to acting intelligently. - IHNPTTT (I Have Not Passed The Turing Test), but while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.
From the contest specs:
thus reducing the slippery concept of intelligence to hard file size numbers
I doubt the instance which wrote that was cynical.
CC.
TaijiQuan (Huang, 5 loosenings)
... by Douglas Adams. Wonder what rate of compression is putting the meaning of life, universe and everything in just the number 42.
wget "http://cs.fit.edu/~mmahoney/compression/enwik8.zi p" | gunzip
:wq
doesn't count?
is what you get when you run the new compression backwards!
This reminds me of the cryptographer paid to crypto-compress datastreams into musical notation. The work laid the foundation to real time packet sniffing of the Internet
Cryptographer? Got atta boys
It took me 10 minutes to figure out "ioiretnr" :-/
How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply... anybody else get it?
According to http://prize.hutter1.net/ this happened on Sept 25 of 2006.
:= 17'073'018 = previous record.
The site also gives some of the requirements..
Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:
* Create a Linux or Windows executable of size S L
* If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
* If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
* Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
Compressing information is easy. Now .... let's see them compress an AOL chat room!
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
http://www.cs.fit.edu/~mmahoney/compression/text.h tml
Info about contenders and results of common compression programs on the testset. (All the "just use gzip/rar/winrk/..." fools can stop jabbering now...)
This error even caught Dr. Dobbs compression expert, Mark Nelson so I guess it isn't too surprising it caught the Wikipedia folks.
The beauty of this sort of error in Wikipedia is it demonstrates just how even erroneous human knowledge can result in greater human knowledge if it is compressed:
Imagine, if you will, a perfectly compressed Wikipedia (despite the fact that an optimum compression is not generally computable). This perfectly compressed Wikipedia would generally "hang together" with a great many pieces cross-checking coherently with other pieces and hence be represented with common (compressed) codes. But then, some pieces would stand out as not cross-checking. They would not "hang together" with other pieces and would require specific codes to represent them. To a Sherlock Holmes kind of mind -- the mind of a scientist -- the mind of a ruthless epistemologist -- these codes would represent "suspects" of the crime of lies or error. The idea that compressors like PPM are just as able to compress text as humans (given enough time) would not hang together -- it would not be predicted from the more general body of human knowledge and would therefore become suspect due to its relative incompressibility.
Seastead this.
Bah! Knowing what to compress is more intelligenter...
It must have been something you assimilated. . . .
It would not be feasible in this competition because either the archive must be self-extracting, or that the decompression app's size is included in the final result.
A lookup table of 32 bits would immediately cause an overhead of over 20 GB bytes (assuming an average of 5 letters per word, which is probably too small for the number of word options in 32 bits).
I have tested some of the standard compression tools, here are the compressed sizes:
100000000 | original size
36445241 | gzip --best
29008758 | bzip2
28860178 | rzip -9
24864529 | 7zr -mx=9
24753293 | rar a -m5 -mdG
7z does not do so well, I think because not so much tuned for text compression, it is much better at compressing binaries. For text compression PPMD and the variants are quite good, so I guess you will see good results with WinRK, Compressia, PAQ and the like.
Open Source Alternatives
Not a bad idea, you could always rearrange them to the right order using a dictionary.
In the end, somebody managed to get through the technobabble!
Which would work fine, until it had to distinguish words like salt and slat.
Or words not in a dictionary, like GetPidOf()
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
I really misread this as a Hitler Prize.
Unfortunately that doesn't qualify as lossless compression.
Imperceptible loss on the other hand...
What could be better than a jet powered motorcycle? http://www.youtube.com/watch?v=u8l6GTHLSWE
"GetPidOf()"... wasn't that renamed to "Bribe()" some time ago?
http://outcampaign.org/
Recomposing...
http://outcampaign.org/
For all the people laughing at this contest: Read this and this before making any more posts about how ridiculous this contest seems, how stupid it is to compress wikipedia, how compression is such a dumb process, and how bad the compressor was for taking a lot of time to process the text.
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Yeah, and remember when that turned out to be baloney?
qntm.org
First Hustler Prize Awarded ?
Maybe I am the only Slashdot reader that enjoys pr0n ...
A single strand of DNA contains the equivalent of about 3GB, so I read. Given the right circumstances it can uncompress into an entire human being, including a brain that is said to be the most complex thing in the known universe. That's gotta be the most over-the-top blowout of a data compression feat ever! You could probably fit the entire internet onto a single floppy disk with technology that advanced :-)
Weird, I just played FFI for the first time, and I could read that perfectly.
...the prize was compressed by a factor of 5.86 as well.
-----
Sorry, I'm only a 1336 h4x0r.
So you're saying that making a computer pass a Turing test is not a real challenge? Do it then!
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Apparently, sounds deceive. You might want to practice your hearing. A good start would be subtracting out the sound of your own voice droning on at the drop of a pin about subjects you haven't made an effort to fully appreciate.
Kolmogorov-Chaitin complexity is the deepest theory going about the inter-relationship between compression and prediction. There used to be a great and highly readable account of omega and the math surrounding omega available online by Chaitin, but he appears to have removed the online version I used to enjoy and perhaps recycled it as a book.
24310603 | sitx Best Text Compression, compression level 16, memory consumption 25 (39,6 MB), Optimizers on
Well Stuffit 9.02 seems to win this round. It took under 3 minutes on a 2 GHz iMac G5.
Gotter down to 1 byte. Not bad at all.
I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid.
Ooooh! Well, if you say it's stupid, then it must be true! I mean, I'm sure you hold a PhD in information theory and artificial intelligence, and as such, are in a perfect position to criticize this work, right? I'd hate to think you were attacking the work of someone far more educated than you in this subject matter out of sheer ignorance and stupidity.
Maybe it could be put into a universal "english library" that the decompressor looks into, and the words are put in a 1000x1000 matrix (there are about E+6 words in english)? That could be addressed with only 2x10 bits.
Of course, words can be bent and you need spaces and such, but in average each word should be able to compress into three bytes of data. The whole bible is about 800 000 words, which would mean 2-3 MB of data compressed.
Of course, you might need the "King James' English Decompressor Library", the "New American Standard English..." et cetera.
Also, it would make for some really interesting transcriptions using say a german decompressor library...
See I have a reference file Zipper. That means that I take long strings and make a tag outta them. Right now it has 1 tag and using one huge 100 meg string reference file. I will update this later to include more. I think I should get the money! It also takes 20 seconds to run!
I can program myself out of a Hello World Contest!!
So data compression tries to replace repeated chunks of data (character strings) with a common, hopefully much smaller, reference value (my simplistic view). If you are trying to compress knowledge, are you trying to find repeated (or at least similar) concepts and replace those with a common access point? For example, "Boiling water is hot", "Steaming coffee is hot", "Cheese from pizza right out of the oven can burn your mouth" (to me) have some knowledge relationships that keep you from burning yourself. So could you 'compress' these to a single concept like 'do not touch' or 'be careful'? Again, just trying to get a handle on this concept.
Here's some free, complimentary clue for you: "All that glitters is not gold." Just because someone wrote a ton of maths about compression, doesn't mean it's anywhere near applicable to building an AI.
And in fact, noone actually built yet anything even vaguely resembling intelligence yet, so maybe, just maybe, all those hypotheses are missing some vital point. Maybe, just maybe, just because you've compressed something to the minimum number of bits, doesn't mean jack squat about being able to use it effectively in real time. Maybe, just maybe, just because you've run arithmetic compression over a text, doesn't mean jack squat about actually being able to parse and navigate the semantic relationships in it. You just have a compression program and need some hours just to get your stream of bytes back, and be no wiser as to how to act intelligently on it.
In the meantime, however, we have a ton of psychologists, neurologists and, don't laugh, even stage magicians who've explored how the brain does work. You know, the only actual thing that actually exhibits some intelligence. We've worked out a lot of odds and ends about what happens there. We have some idea of what it filters out and when, how many seconds do some of the buffers hold the data, and so on.
So maybe you can take some hint from there. Maybe the information you need to extract and act on is really something like "[car][incoming]" instead of ivory-tower exercises like calculating the minimum number of bits you can compress the incoming stream to. That's the hard part: extracting the useful information there.
Or do you need something with lots of formulae? Read something about AI, lemming. There are a lot of people who've worked on various inference machines and such, and, surprise, none of them involved compression algorithms.
So, you know, speaking of "ubtracting out the sound of your own voice droning on at the drop of a pin about subjects you haven't made an effort to fully appreciate", maybe you can take your own advice there. Just because you get a boner about some maths, doesn't mean jack squat about applying it to a practical AI problem. Maybe start using your own head about how you'd actually act on that information instead. Just a thought, you know.
A polar bear is a cartesian bear after a coordinate transform.
So, by the same logic, if you're not holding a Ph.D. in Economics, you're not allowed to say that an apology of Communism is bogus, right? They had plenty of Ph.D.'s in Moscow writing books about how great that system is. You wouldn't go and attack the work of people more educated than you out of sheer ignorance and stupidity, right?
Here's a fun concept for you: appeal to false authority. Because that's the fallacy you're committing there.
Show me someone who's actually built a working AI, and then we'll talk. That's a real authority I'm willing to accept any time. Anything else is, for the scope of discussing what's _needed_ for AI, just an unproven hypothesis. No more, no less. Even if their other maths is good, the supposition that's indeed needed for inteligent behaviour is no more than an unproven hypothesis pulled out of the ass. Arguing that since someone is an expert in the first, they must be automatically right about the second, is no more and no less than an appeal to false authority. It's like arguing that since Newton was good at physics, he _must_ have also be right about alchemy. (Which is another thing he studied, btw.) It's just that stupid.
So seeing that noone actually produced anything that can take a RL information stream and do anything even remotely intelligent with it, I'm not sure what makes them irrefutable authorities on the subject. Where _is_ the proof-of-concept program that uses data compression to actually extract useful knowledge out of Wikipedia and act intelligently about it? No, seriously. Anything else is just taking a guess. So exactly what there says that you should automatically stop thinking for yourself and take his unproven ideas for absolute truth?
In the meantime, however, we have this thing called a human brains that actually exhibits intelligence. And we're starting to understand some things that it does. Such as that, as I've mentioned, it does _not_ work by simply compressing the whole stream, but by extracting the relevant bits and utterly ignoring the rest.
We have _millenia_ of proof that if you're interested enough in the magician's left hand, you completely miss what his right is doing, or that if you focus on the miracle horse doing maths, you miss the owner giving it hints when to stop tapping. Heck, proofs that if they get your attention on something in the foreground, you'll completely miss the guy in a gorilla suit jumping up and down in the back, even made it onto mainstream TV.
So between some unproven hypotheses, and something that actually works that way, I don't know about you, but I'll go with what works here and now. I'll go with how the brain actually seems to work.
A polar bear is a cartesian bear after a coordinate transform.
The result of this is pretty much what you need for epistemology, software and legal disciplines: Optimal language models telling you precisely how language maps to knowledge.
There was some debate about using the multilingual versions of Wikipedia but those projects are less important than just getting natural language -- any natural language -- to be modeled well enough to construct natural language UI's with some reasonable level of rigor that your mother could use (assuming she can read English).
Seastead this.
Here's a fun concept for you: appeal to false authority. Because that's the fallacy you're committing there.
And the fallacy you've committed is decrying a theory without even reading about it, let alone understanding it. Tell me, which is worse?
Anything else is just taking a guess. So exactly what there says that you should automatically stop thinking for yourself and take his unproven ideas for absolute truth?
I don't recall saying that. My point is that, with absolutely no basis in actual fact, you've decided to disregard what is, on the surface, a reasonably valid theory (to me, anyway), based on your personal "feelings". Meanwhile, those far more educated than you in the subject matter consider it a valid area of research. So, what makes you believe that you're qualified to disregard their theories out of hand? Other than, of course, sheer arrogance, which you apparently possess in spades.
I'll go with how the brain actually seems to work.
And now you claim to know how the brain works. Are you really *that* arrogant? Yikes...
BTW, I might point out that people like you used to discount quantum theory. "The world around me doesn't work that way, so it can't possibly be true!" they would say. Guess who was right?
"They laughed at Einstein. They laughed at the Wright Brothers. But they also laughed at Bozo the Clown." - Carl Sagan
Did I mention fallacies? I think I did. In this case the Association Fallacy (sharing one quality or in this case one group of enemies with quantum physics doesn't actually prove this one true too), salted with a bit of Ad Hominem ("people like you...").
Here's a fun reality check for you: quantum theory actually had some actual experimental data to show, and made some easily falsifiable claims. (Falsifiable meaning it could have been easily proven false, if it were false.) And in the meantime it's shown a mountain of evidence and made possible practical stuff such as LASER or the Zener diode. _That_ is what makes it true, not sophistry like who laughed at it or who didn't. That's how science works.
Unfortunately that idea is lost on some CS clowns who think that just arguing personal beliefs, or occasionally preferences, is some form of science. And that expertise in some unrelated domain makes them automatically right. ("I'm the compiler guru, therefore I can tell you which methodology is more productive." Or "I'm the data structures guru, therefore I can tell you what's needed for intelligence.") No, it isn't. Show me the actual experimental proof, and then we'll talk. Until then, it's no better than the work of all those who wrote from positions of authority about what's needed to transmute lead into gold.
Believe it or not, _some_ things about it are known by now. There are more psychologists, neurologists, anthropologists and so on, all the way to stage "magicians", who've studied those various aspects, than CS clowns busy postulating they just know what's needed for intelligence. And the fun part is, the former group actually has some falsifiable data to show for their efforts.
At any rate, it doesn't matter if I'm arrogant (yes, I am) or what you choose to be incredulous about. Show me the experimental data and then we'll talk. That's how real science works.
Otherwise it ends up an "argument ouf ignorance" fallacy. When you've postulated that compression is _needed_ for intelligence, you've made a claim about the brain too. So it ends up basically, "We don't know how the brain on the whole works, so it must be true that it needed data compression to achieve intelligence." Which is a textbook example of that fallacy.
Your using a fallacy yet _again_? Dunno about "worse", but it looks pretty bad. What you assume about me still hasn't proved or disproved the theory you're busy defending. What _would_ even start to prove it would be at least _one_ example of an AI where pure data compression was essential to its workings.
Looking reasonable valid on the surface can be very mis-leading. Alchemy looked reasonably valid on the surface. Communism looked reasonably valid on the surface. Inteligent Design looks reasonably valid on the surface for a _lot_ of people. Etc.
Unfortunately, again, that's not how science works. Science works more along the lines of "show me the experiment". Show me that data compression is even useful at all in a working AI, before making broad claims that it's _needed_ for intelligence. It's that simple.
A polar bear is a cartesian bear after a coordinate transform.
So I don't know if anyone's mentioned this yet, but it might be interesting / worthwhile to point out that the goal of the contest is to compress the 100 MB of wikipedia knowledge into a self-extracting program as efficiently as possible is nearly equivalent to asking "what's the Kolmogorov Complexity of this data? There is a definite (unknown) lower bound on the size of the program that will describe the knowledge, which is the size of the compressed data + the size of the decompressing program. This is a non-trivial problem -- especially with the added constraints of limited memory and time -- that could yield surprising and insightful results.
The two other replies to your post have already done a good job at proving you're wrong, so I'll just add that if pi is proved random it will most likely not by a test of its digits, but by a more abstract proof, since it has an infinite number of digits.
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
I didn't have any real difficulty reading that sentence. Only "reversing" and "interior" posed much trouble. Plus, in sorting algorithms (that's really what this is.. the brain resorting the letters in the word), it's not impossible for a sort procedure on a fully-reverse-sorted list to be less efficient than a sort procedure on a randomized list (of course this depends on the algorithm in use). In other words, fully reversing the interior letters could well present a more difficult problem for the brain than randomizing the interior letters. I'm not a neuroscientist either; I don't know how the brain actually handles something like sorting letters.
That's what your mom said last night!! BURN!!!
http://undecidedgames.blogspot.com
Matt Mahoney reports that Alexander Ratushnyak already has a new program, paq8hp6.exe which appears to improve on paq8hp5 by the 1% threshold required for another payout -- given it survives the 30 day comment period.
Seastead this.