Text Compressor 1% Away From AI Threshold
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create.
The horror.
Life is rarely fair. Cherish the moments when there is a right answer.
so...wikipedia dumps will now be using paq8hp12 instead of l33t 7zip ?
paq8hp12. when decompressed, it also serves as the source code for the program.
.. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.
Will program for karma.
The link in TFS links to the post about the FIRST payout, here's the link to the second payout (which this article is supposed to be talking about).
Your hair look like poop, Bob! - Wanker.
Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach.
Shh.
This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.
Damned scientists!
Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?
Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?
Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
I've worked with some general purpose compression algorithms like zlib, lossy audio compression like mp3, and also lossless audio.
Each is very different and interesting in its own right. MP3 especially, because the compression model is built on what the ears+brain can perceive.
This algorithm I guess would be sort of like mp3 in that it contains some human-based element, maybe a language structure or something, but more like FLAC in that it might use predictors to say what word is likely to come next, with an error bitstream to point to progressively less likely words using bit sequences whose is inversely related to the probability of that word. But that's just a guess from an audio guy.
Can somebody who's looked at this post a synopsis of how it works?
Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?
ccalam - acoustic versions of new songs.
It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.
And yes, that's absolute bollocks. Shannon's number was just an estimate and only applied to serial transmission of characters, because that's what he was interested in. Since then, a lot of work has been done in statistical natural language processing, and I would be surprised if the number couldn't be lowered.
Anyway, since the program doesn't learn or think to reach this limit, nor gives a explanation how this level of compression is intrinsically linked to the language/knowledge it compresses, it cannot be called AI; e.g., it doesn't know how to skip irrelevant bits of information in the text. That would be intelligence...
The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.
If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
Ewige Blumenkraft.
paq8hp12 uses a neural network, ie: it has a connectionist component.
Seastead this.
See: Explanation. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.
Shh.
Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.
Seastead this.
The first poster on this topic had a good explanation - it seems like an AI problem, but not why.
Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.
We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.
Mod me down and I will become more powerful than you can possibly imagine!
Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.
.exe by itself is 124KB. Where did you get 99,696?
Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
Maybe I'm misunderstanding the contents of that rar file. Are both of those data files needed? The
- The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.
Circumcision is child abuse.
They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge.
The apparent empirical result is that predicting which characters are most likely to occur next in a text sequence requires either
1) vast real-world knowledge
OR
2) vast real-world derived statistical databases and estimation machinery
but there can be a difference in their utility. The point of course, is that humans can do enormously more powerful things with that vast real-world knowledge in addition to symbolic estimation.
The underlying question is whether physical natural intelligence is really just real-world derived statistical databases and estimation machinery. Modern neuroscience says,
"depends on what the meaning of 'is' is, but it's at least halfway there."
However would completing mathematical theorems by searching through Google (statistical pattern matching, which might sort of work for all known theorems on Google) work?
Clearly natural intelligence includes many tasks which can be now well solved with data-oriented sophisticated statistical approaches, perhaps with equal or better performance. Modern algorithms like 'independent components analysis' now can estimate individual sources in audition, "the cocktail party effect" a problem some once thought was a clear sign of true 'intelligence'. Turns out that some sufficiently clever signal processing and nonlinear objective functions can do it---so maybe that's what neurons do too.
The still unsolved question is whether there are some tasks which are clearly 'intelligence' where this class of methods will profoundly fail. Maybe like creating really new mathematics?
1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.
Training monkeys for world domination since 1439
If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.
The problem with this approach is that there are many ways the say the same thing, and that this compression/decompression algorithm is tested using strict text-comparison only. A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.
Religion is what happens when nature strikes and groupthink goes wrong.
I suppose I have to post this anonymously, or the Hutter Prize thralls will just mod it down; I like my karma.
/. submissions), Mahoney (who wrote a thesis on this crap), and Ratushnyak, who seems to enjoy wasting his time on this AI-obsessed prize.
I am at a loss as to how this meaningless charade keeps getting posted on Slashdot. Anyone with half a brain who reads TFA (or any of the previous FA Slashdot has posted on this stupid prize) can see this for what it really is: a handful of crazy people who think that this has meaning beyond above-average technogarbage.
There are all of four people seriously involved in this Hutter Prize: Hutter himself, Bowery (who's made all the
PAQ8HP12 may be able to compress the corpus extremely efficiently, but it has obvious and real drawbacks for any real-world application: it's tuned for this specific corpus ("H[utter]P[rize]" is even in the name of the compressor), it's slow as fuck, and it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.
And what's with the obsession with Wikipedia? It is not the be-all, end-all of human knowledge, and, despite its devotees' claims, never will be; just look at the internal politics, and you'll see that it simply can't scale to that size. Is it a useful resource? Of course. Is it something worthy of adoration and fawning over? No.
And then, of course, there's the obsession with AI. These people seem to be of the opinion that a text compressor will actually lead to artificial intelligence -- with no other tuning! An absurd claim if I've ever heard one; the predictive capabilities of a good text compressor are something that would no doubt be useful to an AI, but there's one hell of a lot more to general intelligence than just pattern matching and statistical algorithms for compression.
If one really wanted to sponsor an AI prize, it would probably be much better to focus on creating, say, an effective chatbot -- something that really can predict a desirable response and pass Turing's test.
Not this compression bullshit.
Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.
Singularity: a belief in the "God" idea with the "demiurge" relation inverted.
I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).
... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.
... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).
... the 100MB enwik8 file. A different file will need a different dictionary.
... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.
... AND do all this in less than 9 hours I believe it takes for the latest compressor.
Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).
Now depending on the text itself, prediction based on previously seen text isn't enough
Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork"
Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.
It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process
Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components
Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc
This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.
At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.
PAQ8HP12 -7 enwik8.paq8hp12 enwik8
and moving the enwik8 archive to the parent directory:
./PAQ8HP12 -7 enwik8.paq8hp12
. ..
../enwik8/enwik8
JamesBowery@oldatlantis ~/hutter/paq8hp12
$ time
100000000 enwik8: extracted
16381959 -> 100000000 (1.3106 bpc) in 22398.08 sec (4.465 KB/sec), 941315 Kb
real 373m19.379s
user 0m0.031s
sys 0m0.030s
JamesBowery@oldatlantis ~/hutter/paq8hp12
$ ls -al
total 114216
drwxrwxrwx+ 2 JamesBowery None 0 Jul 1 00:43
drwxrwxrwx+ 6 JamesBowery None 0 Jun 30 14:59
-rwxrwxrwx 1 JamesBowery None 99696 May 14 09:16 PAQ8HP12.EXE
-rwxrwxrwx 1 JamesBowery None 100000000 Jul 1 06:54 enwik8
-rwxrwxrwx 1 JamesBowery None 16381959 Jun 30 21:09 enwik8.paq8hp12
-rwxrwxrwx 1 JamesBowery None 465211 Jul 1 00:43 temp_HKCC_dict1.dic
JamesBowery@oldatlantis ~/hutter/paq8hp12
$ diff enwik8
JamesBowery@oldatlantis ~/hutter/paq8hp12
$
Seastead this.
Connectionist models are models. Any model needs to be interpreted to be understood.
Seastead this.
that is an idea or a concept. Interpreting an idea or concept in different ways is meaningful
only by its context.
ex1 the sky is blue => it's beautifull weather (context: you're making a walk)
ex2: the sky is blue => use #0000FF for the sky area (context: graphic work)
If you say, "the weather is beautifull" to an artist he may draw you yellowish-reddish sunset,
which is not the correct interpretation of "the sky is blue" you had in mind" So the context is vital.
I imagine a real AI would evaluate the context and predict what are the next words most likely to be put forward. If it
succeded to translate a concept to an another in a meaningful context "the sky is blue => it's a beautiful weather let's get down the nasa shuttle"
it would no longer be an AI but an I
"A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong." That might be a good example of acceptable *lossy* AI text compression. One step further and it will compress articles into a proper, readable summary.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
After implementing a few minor tweaks to paq8hp12 and incorporating your grammar optimisation algorithm I managed to compress the above text amazingly to a single character: '&'.
Now you figure out which one it was and how to decompress it.
Which is a shame, because the weather wasn't good.
++ Say to Elrond "Hello.".
Elrond says "No.". Elrond gives you some lunch.
Seriously, this not inveted for mobile hand held devices. At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.
Anything that is science is math.
Ok, computer programming is not necessarily a lot of maths.
But this article is about something that is really computer science... as opposed to making a CRUD screen in VB.net, which is akin to programming a VCR.
Parsing, compiling, linear programming, sorting, searching, indexing, compressing, walking graphs, drawing graphics, designing circuits, optimizing circuits, these are activities that are computer science and that are all maths.
Edsger Dijkstra once said: "Computers are to computer science what telescopes are to astronomy".
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
Compression has reached a point of diminishing returns, getting less and less return for more and more work. And at best it's asymptotically approaching the theoretical limit. You could offer a billion dollar prize and get back maybe a few percent of improvement, while making any further improvement more difficult.
Meanwhile data storage and data transmission technology keeps improving many percent a year, with each improvement compounding on the previous ones.
In Other Words, IMHO money would be better spent on the second area rather than the first.
As far as I can tell given this Wikipedia article, "paq8hp12" means PAQ8, Hutter Prize branch, version 12.
Surely you don't need any mathematical skills to do this kind of work...
= 1&mode=thread&commentsort=0&sid=247781&op=Reply ;)
http://science.slashdot.org/comments.pl?threshold
A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science."
--Lord Kelvin
in mathematical form:
science = math + measurements
That's it. Science is:
1. measure phenomena,
2. figure out the formulas,
3. predict new phenomena,
4. measure new phenomena,
5. if Ok, back to stage 3; if not, back to stage 2.
(ok, ok, 6. (...), 7. Profit!!!, just to appease the masses)
notice stages 1 and 4 are measurements, stages 2 and 3 are maths.
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
You need to read Michael Faraday, or some of his predecessors.
Math is a relatively late addition to science. Yes, it's proved very useful. But science happened long before they introduced math.
Well, thinking again, this depends on what you mean by math. Leonardo used math to figure out perspective. Does this means that art depends on math? If so, then science depends on math, and so does walking across the room. And I can see a valid argument to be made along those lines, but that's not what people normally mean. If we look at what people normally mean, then science didn't depend on math until around the time of Kepler. Perhaps you want to call everything earlier engineering rather than science, but engineering depends on math just as heavily as science.
What actually happened was that after algebra was invented, and arabic numerals, it became a lot easier to describe things in math, so people gradually switched away from describing things in ordinary language and to describing them in math. This has had both advantages and disadvantages. Certainly precision has improved. But comprehension by "ordinary folk" has declined, and not entirely because of the arcane subject matter, but also because they needed to learn a new language in order to understand what was being talked about.
OTOH, can you imagine talking about computer programming without using "jargon"?
I think we've pushed this "anyone can grow up to be president" thing too far.
Until Shannon type experiments, involving humans doing next character predictions of enwik8, are performed, the bounds of enwik8's entropy range must remain unknown but is likely lower than 0.6 to 1.3. As such an experiment would be expensive, it is going to be difficult to say with any simple bpc measure when the Hutter Prize is breaching the threshold of AI. What the Hutter Prizes bpc metric gives us, however, is a clear measure of progress.
My apologies to the other members of the Hutter Prize Committee and the /. community for this error.
PS: Another area of concern raised by Mahoney is that enwik8, at 10e8 characters, is only as much verbal information as a 2 or 3 year old has encountered so although it is sufficient to demonstrate AI capabilities well beyond the current state of the art, his preference is for a much larger contest with fewer resource restrictions focusing on the 10e9 character enwik9 which is more likely to produce a the AI equivalent of an adult with encyclopedic knowledge.
Seastead this.