Text Compressor 1% Away From AI Threshold

← Back to Stories (view on slashdot.org)

Text Compressor 1% Away From AI Threshold

Posted by kdawson on Monday July 9, 2007 @06:10PM from the second-hutter-prize dept.

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

39 of 442 comments (clear)

interesting program name by digitalderbs · 2007-07-09 18:18 · Score: 5, Funny

paq8hp12. when decompressed, it also serves as the source code for the program.
1. Re:interesting program name by OverlordQ · 2007-07-09 18:26 · Score: 5, Informative
  
  Since I know people are going to be asking about the name, might I suggest the wiki article about PAQ compression for the reasons behind the weird naming scheme.
  
  --
  Your hair look like poop, Bob! - Wanker.
That's cool.. by Rorian · 2007-07-09 18:19 · Score: 4, Interesting

.. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.

--
Will program for karma.
1. Re:That's cool.. by Cato · 2007-07-09 18:38 · Score: 5, Interesting
  
  Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo) - with SDHC format cards, it can do 8 GB today.
  
  Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.
2. Re:That's cool.. by Kadin2048 · 2007-07-09 18:58 · Score: 5, Informative
  
  Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
  
  Mobile use is right out too, at least with current-generation equipment.
  
  Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.
  
  Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...
  
  --
  "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
3. Re:That's cool.. by Hal_Porter · 2007-07-09 19:10 · Score: 4, Funny
  
  a text spk version of wiki shud fit in 8gb i think
  its only becoz people are such grammar noobs that they need to waste $
  dood shud filta to txtspk b4 he compress
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
4. Re:That's cool.. by Anonymous Coward · 2007-07-09 19:29 · Score: 5, Informative
  
  No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
  When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.
  
  This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
5. Re:That's cool.. by neonmonk · 2007-07-09 19:49 · Score: 5, Funny
  
  a txt spk vrsion of wiki shd fit 8gb i fink
  its only becoz ppl r sch grmmr noobs tat tey nid 2 wste $
  dud shd filta 2 txtspk b4 he cmpres
  
  There, fixed that for ya.
6. Re:That's cool.. by arun_s · 2007-07-09 20:31 · Score: 4, Insightful
  
  Maybe someone could sell the whole thing in a book-sized rectangular box with a tiny keyboard and 'DON'T PANIC' inscribed in large, comforting letters in the front.
  Now that'd be cool.
  
  --
  I can explain it for you, but I can't understand it for you.
7. Re:That's cool.. by Archimonde · 2007-07-09 21:03 · Score: 5, Funny
  
  aTxtSpkVrsionOfWikiShdFit8gbIFink
  itsOnlyBecozPplRSchGrmmrNoobsTatTeyNid2Wste$
  dudShdFilta2TxtspkB4HeCmpres
  
  Fixed even more.
  
  --
  Trolls are like broken clocks. They show the truth two times a day. The rest of the day they talk nonsense.
8. Re:That's cool.. by thomasj · 2007-07-09 21:09 · Score: 4, Funny
  
  1txtspk #.#/wiki = 8G!
  ~ppl r grm0.1 -> -$
  |txtspk|gzip
  
  --
  :-) = I am happy
  :^) = I am happy with my big nose
  C:\> = I am happy with my OS
9. Re:That's cool.. by imroy · 2007-07-09 21:29 · Score: 5, Informative
  
  Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.
  
  No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.
  
  The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.
10. Re:That's cool.. by Ed+Avis · 2007-07-09 22:02 · Score: 4, Funny
  
  According to Wikipedia, the average per-character entropy of English text has tripled in the last six months!
  
  --
  -- Ed Avis ed@membled.com
11. Re:That's cool.. by Anonymous Coward · 2007-07-09 22:53 · Score: 5, Funny
  
  Perl: The only language that looks the same before and after RSA encryption.
Where's the Mods? by OverlordQ · 2007-07-09 18:30 · Score: 4, Informative

The link in TFS links to the post about the FIRST payout, here's the link to the second payout (which this article is supposed to be talking about).

--
Your hair look like poop, Bob! - Wanker.
Re:Huh? by headkase · 2007-07-09 18:32 · Score: 4, Informative

Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach.

--
Shh.
Dangerous by mhannibal · 2007-07-09 18:33 · Score: 4, Funny

This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.

Damned scientists!
1. Re:Dangerous by SoulDrift · 2007-07-09 21:28 · Score: 5, Funny
  
  Actually, I can give you 100% compression already. It's just a bit lossy.
2. Re:Dangerous by KylePflug · 2007-07-09 22:25 · Score: 5, Funny
  
  humour
  Humor.
  
  See? American English is actually just essentially lossless compression...
3. Re:Dangerous by smallfries · 2007-07-09 23:30 · Score: 5, Funny
  
  See? American English is actually just essentially lossless compression...
  Sure, sure it is. Not exactly optimal though...
  
  --
  Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
4. Re:Dangerous by Welshalian · 2007-07-10 00:13 · Score: 5, Funny
  
  humour
  Humor. See? American English is actually just essentially lossless compression...
  I respectfully disagree. Most of the fun in British humour gets lost in the translation to American humor.
Re:Artificial Intelligence? by MoonFog · 2007-07-09 18:44 · Score: 4, Informative

Shamelessly copied from the wikipedia article on the Hutter Prize:

The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.

The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.
Lossy compression? by niceone · 2007-07-09 18:45 · Score: 5, Funny

Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?

--
ccalam - acoustic versions of new songs.
1. Re:Lossy compression? by Phat_Tony · 2007-07-10 03:33 · Score: 4, Interesting
  
  That's my opinion of this. By excluding lossy compression, they're also excluding the likelihood of applicability to AI that is the point of the contest.
  
  Humans achieve good compression on things like encyclopedia knowledge because we don't remember the words at all. We remember the idea, and we have our own dictionary in our heads, and we re-apply words to the idea to reconstruct the entry, rather than memorizing the data. That's why we get great compression; we throw out most of the data, and just remember the "gist" of it, the argument, the facts, in an internal structure of raw ideas stored independently of the words to explain them.
  
  By restricting the contest to lossless compression, they eliminate the ability to use any AI-like compression techniques. The machine can not extract the ideas and then re-assign words, because it would have to be able to do so using the exact voice of each of thousands of different Wikipedia contributors. That's hopeless.
  
  So the entrants are restricted to clever algorithms that do endless mathematical optimizations to compress the data, a method of compression that's entirely alien to the methods of our only known intelligence. We don't remember things by figuring out clever tricks to compress the data in our own memory. We don't say "Oscar Schindler saved Jews In WWII" and then say, OK, that data had 5 spaces in it, and 4 "S's," and if I remember the positions of the spaces and the S's, I could use less memory space to store this in my head, and then just think back through the algorithm I used to take the spaces and "s's" out and put them back in where they go, and I'll have the name again, and then sit there and carefully work out in our heads what the original data must have been after our compression methods. It doesn't work that way at all. To us it apparently "just comes to us." The compression probably comes from things like remembering sounds, and then reconstructing the name's exact spelling based upon known rules of grammer. We store the name Oscar Schindler in relation to various facts regarding Jews and WWII, but we store them as ideas, and then pull the words back out, and each time someone asks us about Schindler, we'd be likely to say something similar in meaning but different in expression. So this contest is restricted to the least interesting kind of compression for intelligence; the kind that can't use it.
  
  Interesting compressions are things like JPEG and MP3, where they built the compression model on the human perceptual model, first saying "what about this exact data is less relevant to a human observer, that we can therefore throw away?" For JPEG's, it turns out that (among other things) we're much more sensitive to differences in color than to absolute colors, and among differences in color, we're much more perceptive in the color ranges closer to human skin tone. MIDI is actually probably closer to the compression used by human intelligence than any recorded music standard.
  
  Along these lines, I'd say storing the HTML formatting data exactly borders on ridiculous. It's a hugely inefficient waste of space. For instance, if you just run the HTML through one of the free online utilities that strips irrelevant data, you get the identical presentation of the data, you've only thrown out entirely worthless data. But you've already violated the contest rules. You should be able to strip the HTML entirely, as long as your compression/decompression system ends up with conveniently readable formatting in the end. Reconstructing the actual HTML in a character-identical way is so non-intelligent when you're trying to save space, it seems hard to beleive it's going to lead to intelligence.
  
  Regarding this contest: I'm curious what level of compression you can get if you just histogram the words and then, in order of frequency for anything with enough occurrences to save memory by using a look-up table, you assign sequential numeric values for the words in order of frequency of occurrence. Then start your data with a look-up
  
  --
  Can anyone tell me how to set my sig on Slashdot?
Re:Artificial Intelligence? by qbwiz · 2007-07-09 18:53 · Score: 5, Interesting

Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?

The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.

If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.

--
Ewige Blumenkraft.
Re:ai threshold? by Baldrson · 2007-07-09 18:53 · Score: 4, Informative

non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.
paq8hp12 uses a neural network, ie: it has a connectionist component.

--
Seastead this.
Re:I'll be reading the source... by phatvw · 2007-07-09 18:56 · Score: 5, Interesting

I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?
Huffman Example by headkase · 2007-07-09 18:57 · Score: 4, Informative

See: Explanation. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.

--
Shh.
Re:Program size is 1.02 MB! by Baldrson · 2007-07-09 18:58 · Score: 4, Informative

Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

--
Seastead this.
Re:Artificial Intelligence? by fireboy1919 · 2007-07-09 18:59 · Score: 5, Insightful

The first poster on this topic had a good explanation - it seems like an AI problem, but not why.

Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.

We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.

Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.

--
Mod me down and I will become more powerful than you can possibly imagine!
Obligatory... by Stormwatch · 2007-07-09 19:08 · Score: 4, Funny

- The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.

--
Circumcision is child abuse.
How to win the Hutter Prize by seanyboy · 2007-07-09 19:21 · Score: 5, Funny

1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.

--
Training monkeys for world domination since 1439
which only goes to show... by nanosquid · 2007-07-09 19:34 · Score: 4, Informative

If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.
Not truly == AI just yet by Anonymous Coward · 2007-07-09 20:51 · Score: 5, Interesting

I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).

Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).

Now depending on the text itself, prediction based on previously seen text isn't enough ... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.

Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork" ... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).

Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.

It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process ... the 100MB enwik8 file. A different file will need a different dictionary.

Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components ... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.

Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc ... AND do all this in less than 9 hours I believe it takes for the latest compressor.

This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.

At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.
super-grammar-improved paq8hp12 by superbrose · 2007-07-09 21:42 · Score: 4, Funny

After implementing a few minor tweaks to paq8hp12 and incorporating your grammar optimisation algorithm I managed to compress the above text amazingly to a single character: '&'.

Now you figure out which one it was and how to decompress it.
1. Re:super-grammar-improved paq8hp12 by pla · 2007-07-10 05:35 · Score: 5, Funny
  
  Now you figure out which one it was and how to decompress it.
  
  Well, with only 256 choices, it didn't take long to check all possible decodings for one that makes sense. Ended up working for "}".
  
  Oddly, though, the algorithm not only restored, but improved the original! I get:
  
  "The King's English version of Wikipedia should fit in eight gigabits, I do believe. Only humanity's sphexish adherence to grammatical rules limits the attainable compression ratio; the good gentleman might wish to consider filtering to a more base patois prior to applying his algorithm".
  
  Amazing... This discovery could single-handedly render the next generation (nearly) intelligible!
PAQ8, Hutter Prize branch, version 12 by tepples · 2007-07-09 23:35 · Score: 4, Informative

As far as I can tell given this Wikipedia article, "paq8hp12" means PAQ8, Hutter Prize branch, version 12.
Re:new compression standard by aicrules · 2007-07-10 00:39 · Score: 4, Funny

Dang! You must have enemies if you are the very first post and you get modded redundant. Time to work on some positive karma buddy...
Re:Not made for mobile devices by Yvan256 · 2007-07-10 00:57 · Score: 4, Insightful

At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.
That may be, however we're talking about Wikipedia here. It's not about storing so much text that you can't go through it within a month, it's about storing everything so that you can access it as a reference.

When you look up a word in the dictionary, it takes from 10 to 30 seconds to read the definition. But you did need the whole book/brick to do it.