Text Compressor 1% Away From AI Threshold
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
paq8hp12. when decompressed, it also serves as the source code for the program.
.. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.
Will program for karma.
The link in TFS links to the post about the FIRST payout, here's the link to the second payout (which this article is supposed to be talking about).
Your hair look like poop, Bob! - Wanker.
Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach.
Shh.
This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.
Damned scientists!
Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?
ccalam - acoustic versions of new songs.
The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.
If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
Ewige Blumenkraft.
paq8hp12 uses a neural network, ie: it has a connectionist component.
Seastead this.
I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?
See: Explanation. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.
Shh.
Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.
Seastead this.
The first poster on this topic had a good explanation - it seems like an AI problem, but not why.
Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.
We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.
Mod me down and I will become more powerful than you can possibly imagine!
- The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.
Circumcision is child abuse.
1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.
Training monkeys for world domination since 1439
If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.
I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).
... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.
... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).
... the 100MB enwik8 file. A different file will need a different dictionary.
... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.
... AND do all this in less than 9 hours I believe it takes for the latest compressor.
Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).
Now depending on the text itself, prediction based on previously seen text isn't enough
Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork"
Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.
It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process
Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components
Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc
This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.
At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.
After implementing a few minor tweaks to paq8hp12 and incorporating your grammar optimisation algorithm I managed to compress the above text amazingly to a single character: '&'.
Now you figure out which one it was and how to decompress it.
As far as I can tell given this Wikipedia article, "paq8hp12" means PAQ8, Hutter Prize branch, version 12.
Dang! You must have enemies if you are the very first post and you get modded redundant. Time to work on some positive karma buddy...
When you look up a word in the dictionary, it takes from 10 to 30 seconds to read the definition. But you did need the whole book/brick to do it.