Compress Wikipedia and Win AI Prize

← Back to Stories (view on slashdot.org)

Compress Wikipedia and Win AI Prize

Posted by ryuzaki0 on Sunday August 13, 2006 @10:50AM from the what-does-this-mean dept.

Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."

46 of 324 comments (clear)

Min score:

Reason:

Sort:

WikiPedia on iPod! by network23 · 2006-08-13 10:51 · Score: 2, Interesting

I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying
info.edu.org - Speedy information and news from the Top 10 educational organisations.
1. Re:WikiPedia on iPod! by Fred+Porry · 2006-08-13 11:08 · Score: 3, Insightful
  
  Then it would be an encyclopedia, not a Wiki, thats another point why I say: forget about it. Would be nice though. ;)
2. Re:WikiPedia on iPod! by CastrTroy · 2006-08-13 11:25 · Score: 3, Insightful
  
  Well, since it's currently only 1 Gig, you could probably put it on a flash card and read it from a handheld. It wouldn't be an ipod. but probably wouldn't require destroying a perfectly good piece of equipment either. You could probably even get weekly updates (hopefully in a diff file) to make sure your copy is in sync with the rest of the internet. Now that I think about it, this would be a really good application. There's lots of times when I'd like to look up something off wikipedia, but not connected to the internet.
  
  --
  
  Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
3. Re:WikiPedia on iPod! by Asztal_ · 2006-08-13 12:16 · Score: 4, Funny
  
  Umm... which of the 5 thousand links is the article?
But captain by Anonymous Coward · 2006-08-13 10:52 · Score: 5, Funny

Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.

But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.
1. Re:But captain by Anonymous Coward · 2006-08-13 11:09 · Score: 5, Funny
  
  You left out the part involving the deflector shield. Remember, the first rule of star trek technobabel is always involve the deflector in some way.
Painful to read by CuriHP · 2006-08-13 10:54 · Score: 3, Insightful

For the love of god, proofread!

--
If it's not on fire, it's a software problem.
1. Re:Painful to read by Threni · 2006-08-13 11:24 · Score: 2, Insightful
  
  > For the love of god, proofread!
  
  Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.
2. Re:Painful to read by ameline · 2006-08-13 14:49 · Score: 2, Insightful
  
  Agreed -- why can they not MOTIVATE us instead?
  
  No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.
  
  --
  Ian Ameline
3. Re:Painful to read by PeeAitchPee · 2006-08-13 21:10 · Score: 4, Funny
  
  He did, but Slashdot's AI compressed it for him.
  
  :-D
lossy compression by RenoRelife · 2006-08-13 10:55 · Score: 5, Insightful

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would
1. Re:lossy compression by Anonymous Coward · 2006-08-13 11:49 · Score: 3, Insightful
  
  Funny? That's most intelligent and insightful remark I've seen here in months, albeit rather naively stated.
  The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
  data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
  is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data reduction. The downsides are that, like us humans, they may be unreliable/non-deterministic in retrieving data because of this fuzzyness. It would also be able make "sideways" associations and draw inferences from the data set, which in essence would be a weak form of artificial intelligence. Now give him his +5 insightful you silly people.
2. Re:lossy compression by Vo0k · 2006-08-13 11:54 · Score: 3, Insightful
  
  that's one piece, but not necessarily - "lossy" nature of human mind compression can be overcome by "additional checks".
  
  Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entries!) then when recalling, you put it back in words, and for exact citation you try to match possible wordings against remembered pattern.
  
  So imagine this: the compressor analyzes a sentence lexically, spots regularities and irregularities, transforms the sentence into a relational set of tokens containing the meaning of the sentence, which are small and easy to store, unambigiously describe the meaning of the sentence but don't contain exact wording. Then an extra checksum of the text of the sentence is added.
  
  Decompressor tries to build a sentence that reflects given idea according to rules of grammar, picking different synonyms, word ordering and such, and then brute-forcing or otherwise matching against the checksum to find which of the created sentences matches exactly.
  
  Look, the best compressor in the world:
  sha1sum /boot/vmlinuz
  647fb0def3809a37f001d26abe58e7f900718c46 /boot/vmlinuz
  
  Linux kernel compressed to set: { string: "647fb0def3809a37f001d26abe58e7f900718c46", info: "it's a Linux kernel for i386" }
  
  You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically. So theoretically you could use the hash to rebuild THE kernel just by brute-force creating random files and checking them if they match both the hash and "general properties of a kernel".
  
  The problem obviously lies in unrealistic "brute force" part. The subset of possible rebuilds of the data must be heavily limited. You can do this by lossy compression that allows for limited "informed guess" results - ones that make sense in context of a linux kernel - style, compiler optimizations, use of macros transformed into repeatable binary code. And have the original analysed using the same methods before compression, storing all inconsistencies with the model separately.
  
  So the compression file would consist of:
  - a set of logical tokens describing meaning of given piece of data (in relation to the rest of "knowledge base"
  - a set of exceptions (where logic fails)
  - a checksum or other pattern allowing to verify an exact match.
  
  Most of lossy compressions are meant to obfuscate the lost data. If you use one that instead allows for rebuilding lost data according to certain limited ruleset ("informed guess" + verification) you'd get a lossless compression of comparable efficiency.
  
  --
  Anagram("United States of America") == "Dine out, taste a Mac, fries"
3. Re:lossy compression by swillden · 2006-08-13 15:14 · Score: 4, Insightful
  
  You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically.
  Mmmm, no. You were fine up until you said "very few will follow the ruleset". That's not true. To see that it's not true, take your kernel, which consists of around 10 million bits. Now find, say, 512 of those bits that can be changed, independently, while still producing a valid-looking kernel executable. The result doesn't even have to be a valid, runnable kernel, but it wouldn't be too hard to do it even with that restriction.
  So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value, and the odds are very, very good that your "real" kernel's hash is also matched by at least one of them. If by some chance it isn't, I can always generate a whole bunch more kernel variants. How about 2^2^10 of them?
  A hash plus a filter ruleset does not constitute a lossless compression of a large file, even if computation power/time is unbounded.
  
  --
  Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
As long as it is Wiki that we are talking about... by gatkinso · 2006-08-13 10:56 · Score: 3, Funny

There. All of wiki, in 31 bytes.

--
I am very small, utmostly microscopic.
Who'da thunk... by blueadept1 · 2006-08-13 10:59 · Score: 5, Funny

Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!
Lossy Compression? by Millenniumman · 2006-08-13 11:06 · Score: 3, Funny

Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
1. Re:Lossy Compression? by blueadept1 · 2006-08-13 11:20 · Score: 2, Insightful
  
  That is actually an interesting idea. What if you added a layer of compression that converted every possible common acronym, made contractions, etc...
2. Re:Lossy Compression? by larry+bagina · 2006-08-13 11:28 · Score: 2, Interesting
  
  1) it wouldn't be lossless and 2) most compression techniques use a dictionary of common used words.
  
  --
  Do you even lift?
  These aren't the 'roids you're looking for.
Comparison by ronkronk · 2006-08-13 11:08 · Score: 2, Informative

There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools
1. Re:Comparison by joshier · 2006-08-13 12:55 · Score: 3, Funny
  
  Well, if I knew that 15 years ago, I would indeed have been a genuis, sadly I realized too late and my genuis talents are wasted yet again.
  
  Have no fear though, I'm working on a new one.
It's a big world out there by Harmonious+Botch · 2006-08-13 11:09 · Score: 4, Interesting

"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.
1. Re:It's a big world out there by gardyloo · 2006-08-13 11:54 · Score: 4, Funny
  
  TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.
  
  Your use of "TFA" is a good compressional technique, but you could change "it's" to "its" and actually GAIN in meaning while losing a character! You're well on your way...
2. Re:It's a big world out there by DrJimbo · 2006-08-13 12:13 · Score: 4, Informative
  
  Harmonious Botch said:
  This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
  I encourage you to read E. T. Jaynes' book: Probability Theory: The Logic of Science. It used to be available on the Web in pdf form before a published version became available.
  
  In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
  
  Since even an optimal decision maker has this undesirable trait, I don't think the existence of this trait is a good criteria for rejecting decision making models.
  
  --
  We don't see the world as it is, we see it as we are.
  -- Anais Nin
3. Re:It's a big world out there by Ignis+Flatus · 2006-08-13 12:22 · Score: 2, Insightful
  
  I think the original premise is wrong. Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective. And our intelligence is incredibly redundant. If you want robust AI, you're going to have to accept redundancy and imperfection. Same goes for data transmission. Sure, you compress, but then you also add in self-error correcting codes with a level on redundancy based on the known reliability of the network.
4. Re:It's a big world out there by Baldrson · 2006-08-13 12:49 · Score: 2, Informative
  
  In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
  I think what Hutter has shown is that there is a solution which unifies the new data with the old within a new optimum, which is most likely unique. I think it is based on the idea that Kolmogorov complexity is a unique value for any string and is most likely represented by a single optimum program (the "self-extracting archive" of the string).
  
  --
  Seastead this.
5. Re:It's a big world out there by kognate · 2006-08-13 13:45 · Score: 3, Interesting
  
  Yeah, but you can use Turbo codes to achieve near Shannon limit, and you don't have to worry too much about the addition of the ECC. Remember kids: study that math, you never know when information theory can suddenly pay off.
  
  Just to help (and so you don't think I made Turbo Codes up -- it's sounds like I did 'cause it's such a bad name)
  http://en.wikipedia.org/wiki/Turbo_code
Re:for those who rtfa by kfg · 2006-08-13 11:15 · Score: 2, Informative

a) how big the compressed size was

18MB

b) how many bytes was wikipedia before it was compressed

A sample of 100MB

Your goal:
.

KFG
Er, I'm not so sure about this. by aiken_d · 2006-08-13 11:15 · Score: 3, Interesting

Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.

It really seems like one of those mistaking-the-map-for-the-territory errors.

-b

--
If I wanted a sig I would have filled in that stupid box.
Re:Can it be "lossy" compression? by richdun · 2006-08-13 11:15 · Score: 4, Funny

Hmmm...well in that case, someone go edit the Wikipedia entry on "computers" and allow them to store data at the bit level. Also, I heard somewhere where computers in Africa have tripled in the past six months!
Re:Can it be "lossy" compression? by Bill+Kilgore · 2006-08-13 11:22 · Score: 5, Funny

I have a program that compresses 100M of Wikipedia to one bit with no loss at all. The program is somewhat special-purpose, and at 100,024,076 bytes, a little chunkier than I'd like.

--
Rediculous: A word indicating the writer is ridiculously ignorant.
Solution. by Funkcikle · 2006-08-13 11:34 · Score: 5, Funny

Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.

Then just apply your personal favourite compression utility.

I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.
Incentivize? by noidentity · 2006-08-13 11:47 · Score: 5, Funny

the intent of which is to incentivize the advancement of AI

Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.
Wrong contest by Baldrson · 2006-08-13 11:55 · Score: 3, Informative

That's another contest that is useless for the reason you cite.
The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.

--
Seastead this.
I'll try: by dcapel · 2006-08-13 11:59 · Score: 5, Funny

echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive

Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!

--
DYWYPI?
1. Re:I'll try: by MarkRose · 2006-08-13 12:37 · Score: 3, Funny
  
  echo "!#/bin/cat /dev/tty0" > archive
  
  Here's one that's even shorter, but you have to type in the decryption key exactly right.
  
  --
  Be relentless!
Is lossless really best by Anonymous Coward · 2006-08-13 12:10 · Score: 2, Interesting

I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.
C++ by The+Bungi · 2006-08-13 12:32 · Score: 2, Funny

Interestingly enough, the source code for the compressor is C++. One would expect the thing to be written in pure C.
A (good) sign of the times, I guess.
Re:Can it be "lossy" compression? by KiloByte · 2006-08-13 12:37 · Score: 2, Informative

Why so? The test file is exactly 10^8 bytes.
I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.

This inconsistency doesn't have any effect on the challenge, though -- that 50kEUR[1] is offered for compressing the given data corpus, not for compressing a string of 100MB.

[1] 1kEUR=1000EUR. 1M EUR=1000000EUR. 1KB=1024B. 1MB=1048576B.
And by the way, what about fixing Slash to finally allow Unicode -- either natively or at least as HTML entities?

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
And there was me thinking.. by baz1860 · 2006-08-13 13:46 · Score: 2, Funny

that the entire knowledge of the world could simply be compressed without loss to

yeah, you guessed it..

42...

--
He who would trade liberty for some temporary security, deserves neither liberty nor security
wikicast by VolciMaster · 2006-08-13 14:14 · Score: 3, Funny

a method for periodically re-syncing...

So, we need a WikiCast - remember folks, you heard it here first!

--
antipaucity
Would be useful for images by aliquis · 2006-08-13 14:14 · Score: 4, Funny

... now all we need is a dictionary for nudity and we could save a lot of bandwidth on the Internet!
Barebones Windows or Linux by Baldrson · 2006-08-13 14:24 · Score: 2, Informative

See the detailed rules for specifics but generally the rules are just what you would expect: The program runs (and completes in a reasonable time) on a relatively recent system running Windows (currently XP) or Linux with no external inputs, eg no dynamically loaded libraries not included in the submission, no net communication and no disk I/O that isn't generated by the program itself.
Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result in future submissions from the same source being ignored.

--
Seastead this.
Hutter's Theory - Disproved by giafly · 2006-08-13 14:57 · Score: 3, Insightful

The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
A "Hutter AI" will be at a disadvantage when competing against an opponent which knows it's acting as above and can do the same calculations. Under these circumstances, the opponent will be one step ahead. The Hutter AI is predictable and so can be outmanoeuvered. Hence the Hutter AI's moves are not optimal.

Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).

Occam's razor (also spelled Ockham's razor) is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham. Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic maxim that advises economy, parsimony, or simplicity in scientific theories. Occam's razor states that the explanation of any phenomenon should make as few assumptions as possible - Wikepedia
Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.

--
Reduce, reuse, cycle
Compress Wikipedia and win a prize? by Dachannien · 2006-08-13 14:59 · Score: 4, Funny

Can't I just punch the monkey for $20 instead?
Re:Not sure if that's a joke. by Andrew+Kismet · 2006-08-13 17:14 · Score: 3, Funny

Of course he was joking. If he was serious he would've said "verbificate".