Text Compressor 1% Away From AI Threshold

← Back to Stories (view on slashdot.org)

Text Compressor 1% Away From AI Threshold

Posted by kdawson on Monday July 9, 2007 @06:10PM from the second-hutter-prize dept.

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

4 of 442 comments (clear)

Min score:

Reason:

Sort:

Re:interesting program name by OverlordQ · 2007-07-09 18:26 · Score: 5, Informative

Since I know people are going to be asking about the name, might I suggest the wiki article about PAQ compression for the reasons behind the weird naming scheme.

--
Your hair look like poop, Bob! - Wanker.
Re:That's cool.. by Kadin2048 · 2007-07-09 18:58 · Score: 5, Informative

Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)

Mobile use is right out too, at least with current-generation equipment.

Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.

Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:That's cool.. by Anonymous Coward · 2007-07-09 19:29 · Score: 5, Informative

No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.

This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
Re:That's cool.. by imroy · 2007-07-09 21:29 · Score: 5, Informative

Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.

No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.

The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.