Slashdot Mirror


Text Compressor 1% Away From AI Threshold

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

12 of 442 comments (clear)

  1. Re:interesting program name by OverlordQ · · Score: 5, Informative

    Since I know people are going to be asking about the name, might I suggest the wiki article about PAQ compression for the reasons behind the weird naming scheme.

    --
    Your hair look like poop, Bob! - Wanker.
  2. Where's the Mods? by OverlordQ · · Score: 4, Informative

    The link in TFS links to the post about the FIRST payout, here's the link to the second payout (which this article is supposed to be talking about).

    --
    Your hair look like poop, Bob! - Wanker.
  3. Re:Huh? by headkase · · Score: 4, Informative

    Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach.

    --
    Shh.
  4. Re:Artificial Intelligence? by MoonFog · · Score: 4, Informative
    Shamelessly copied from the wikipedia article on the Hutter Prize:

    The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.

    The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.
  5. Re:ai threshold? by Baldrson · · Score: 4, Informative
    non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

    paq8hp12 uses a neural network, ie: it has a connectionist component.

  6. Huffman Example by headkase · · Score: 4, Informative

    See: Explanation. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.

    --
    Shh.
  7. Re:Program size is 1.02 MB! by Baldrson · · Score: 4, Informative

    Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

  8. Re:That's cool.. by Kadin2048 · · Score: 5, Informative

    Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)

    Mobile use is right out too, at least with current-generation equipment.

    Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.

    Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  9. Re:That's cool.. by Anonymous Coward · · Score: 5, Informative

    No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
    When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.

    This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
  10. which only goes to show... by nanosquid · · Score: 4, Informative

    If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.

  11. Re:That's cool.. by imroy · · Score: 5, Informative

    Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.

    No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.

    The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.

  12. PAQ8, Hutter Prize branch, version 12 by tepples · · Score: 4, Informative

    As far as I can tell given this Wikipedia article, "paq8hp12" means PAQ8, Hutter Prize branch, version 12.