Compress Wikipedia and Win AI Prize
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying
info.edu.org - Speedy information and news from the Top 10 educational organisations.
Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.
But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.
For the love of god, proofread!
If it's not on fire, it's a software problem.
Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would
There. All of wiki, in 31 bytes.
I am very small, utmostly microscopic.
how about telling us:
a) how big the compressed size was
b) how many bytes was wikipedia before it was compressed
It is 1 a.m. here, I just came home, slightly drunk, slightly high, and I just tried to understand this article. It has indeed blown my mind, so to speak.
If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.
Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!
arj
Non impediti ratione cogitationus.
Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools
"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.
--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.
Just use lzip! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.
One key to the compression of data is to reduce the set of tokens
used such that all sentences can be constructed using the smallest
dictionary. That means not inventing new words and saying "incentivize"
where "motivate" would do.
Seriously, if the Wiki was written using good English and correct spelling
that would be a major step forward to data reduction.
Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.
It really seems like one of those mistaking-the-map-for-the-territory errors.
-b
If I wanted a sig I would have filled in that stupid box.
Sure, the common formats such as RAR and bzip2 produce highly compressed files, but they would also be impractical to use in an interactive sense just because the compression and decompression operations are slooooow.
On the other hand, zlib is a commonly used format since it compresses quick, abliet not as well as other formats. Even quicker is lzo, but again it doesn't compress as well.
Of course, you'll need a ~100MB program to extract it...
== Jez ==
Do you miss Firefox? Try Pale Moon.
Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.
Then just apply your personal favourite compression utility.
I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.
I wrote a script right here. It will programatically generate all of Wikipedia. Eventually.
Quidquid latine dictum sit, altum sonatur.
there it goes!
That's easy, all you have to do is run your program on a computer that users 32-bit bytes. That way you can fit more bits in your bytes, and automatically beat the record by 4 times.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Doesn't the dictionary in PAQ8A,B,C,D result in smaller filesizes if you're talking about a 100M+ large file?
------ The best brain training is now totally free : )
I proofread TFA, and there are no errors that I can see. Ok, the grammar is a bit more complex than Dr Suess, but /.ers have to learn to deal with english sooner or later.
Beside, it IS an article on compression...
This is the type of contest we need to see more of.
Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.
The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.
Seastead this.
echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive
Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!
DYWYPI?
If someone could write a program that simulates 1000 virtual monkeys, all typing on typewriters, eventually they would end up with wikipedia. Or maybe they would end up with the equivalent of wikipedia very quickly.
Register the editry.
Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.
:T:R:A:N:S:
Save Wikipedia millions, win a free t-shirt. (S&H Extra)
Sign me up!
I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.
I can get that sample down to 1 byte. The compression algorithm is so complex, however, that the compiled decompressor binary is about 100M.
Sounds quite similar to Solomonoffs' universal prediction.
"If anything can go wrong, it will." - Murphy
Since it's wikipedia, am I allowed to edit the entries before I compress them? ;-)
This guy's the limit!
The premise of this contest shows a remarkable ignorance of compression theory and technology. True, in theory the more knowledge one has about the world, the better one should be able to compress, but only in the most abstract sense. In practice the vast majority of the bits any "image" of the world like a picture or a piece of text are almost wholly unrelated to high-level features of the world, and thus compression algorithms rightly focus on low-level features of the "image" that are almost wholly separate from the features AI researchers care about.
For example, suppose one wishes to compress two consecutive pictures of a chess tournament. Does knowing the rules of chess help much? Sure, one might use knowledge of chess to deduce a few bits (literally) of information about the second image (what move was made), but when trying to compress two 10-megabit images, who cares? Much better to focus on low-level features such as world smoothnesss, lighting variations, motion flow, et cetera.
Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.
The winner of a Wiki-compression contest is going to be a variation of an ordinary text compression algorithm, and will use techniques that do not in any obvious way translate to "smart" applications, in the same way that the highly successful N-gram models of speech recognition and machine translation do not have any properties one would normally associate with intelligence.
One can build an "intelligent" compressor, but it's the LAST thing any compression researcher is going to do, because they know high-level intelligence doesn't have many bits of information to provide.
Print it out... Then it doesn't use any data space, only physical space
I wonder how much linguistic patterns (word inflictions, synonyms, etc) would help in writing a better compression algorithm. There is a free framework one could use to write sophisticated language patterns to target those found in Wikipedia articles, after all, most Wikipedia articles follow certain conventions (e.g. most start with a ISA clause: A roguelike is a computer game that borrows some of the elements of another computer game). The grammar would definitely take time to build, but the results of such a method might be pretty interesting.
A (good) sign of the times, I guess.
I can compress a significant amount of Wikipedia's information down to two bytes: 0x42 0x53 (BS)
Here's my compression of 100M of Wikipedia: "Always consult more than one source of information" Make check payable to ...
echo I'mright!NoI'mright!You'reanasshole!So'syourmother !! >distilled-wiki.txt
And indeed you can come up with all kinds of rules to help you predict the characters in a stream like Wikipedia, not just linguistic rules, but higher level rules are less likely to become crucial with the 100M contest and may need to wait for the 1G (or complete archive of Wikipedia) contest. I can imagine that certain rules relating to relational similarity (analogy) might come into play at the 100M level, particularly since there are now programs that can perform as well as college bound seniors on the SAT verbal analogies test (a heavily 'g' loaded test).
Seastead this.
# tar cvf wikipedia.tar /wikipediapath
# bzip2 wikipedia.tar
No Sigs!
Prizes have the danger of causing people who could have been focusing their energy in something their true talent is in to pursue and (more often) fail in acheiving what somebody else figured was useful.
I'm not usually a negativist.
That said this could be useful so I dont wish to discourage anyone from pursuing it especially people who'd waste their time otherwise.
Start with: "the" "and" Then remove periods commas semicolons Speling mistkes ok if readable If other words are commonly used those could be removed It would be a first a fill in the blank-as-you-read encyclopedia Sales staff for door to door enclopedias can convince you new edition saves trees
;)
The above text is a demonstration. No punctuation. Left out "and" and "the". Since some entries are hard to read, the fill-in-the-blank interpretation adds to the experience!
If this new compression is added to the existing compression, I'm sure it'll beat everything.
...be a program that creates an "encyclopedia" website and invites humans to contribute to it?
"How to Do Nothing," kids activities, back in print!
42
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
All the articles worth having will be deleted within 4 month's time, anyway.
that the entire knowledge of the world could simply be compressed without loss to
yeah, you guessed it..
42...
He who would trade liberty for some temporary security, deserves neither liberty nor security
For what architecture?
Absolutely! The "right answer" for best *intelligent* compression would only store a *minimal* set of pertinent data points and then would use intellect to flesh out the details on decompression. Although you could end up with http://www.reducedshakespeare.com/, I'll worry that when some AI researcher starts down this path
Minor disagreement with what you said... The details are relevant; they just aren't important enough to store in and of themselves. Sort of like mathematics where you only need to learn the principle (pertinent data) and how to apply it. Then every example is just a specific case (flesh out the details).
It takes real intelligence to know which priciple to apply or that none apply and you're doing real research. Genius is being able to find such a solution where none existed before. The details are relevant but just not important enough to bother storing since they can be reasonably recreated by applying real inteligence. So, real intelligent compression would mean that different people decompressing the same article at different times would *not* get the same text but enough of the pertinent data would be the same that they all come away from reading the article with the same meaning. That would be real intelligent compression and decompression.
Cheers,
Dave
They that can give up essential liberty to obtain a little temporary safety deserve neither safety nor liberty.
Ben
What libraries is this executable allowed to call?
for any given dataset there is a pseudo-random number generator that, with the proper parameters, produce that dataset of bytes in sequence.
The problem is that it is non-trivial to find the proper algorithm and parameters. However, if trial-and-error computation is not a problem, a large library of potential generators can be tried with all possible parameters.
So, we need a WikiCast - remember folks, you heard it here first!
antipaucity
... now all we need is a dictionary for nudity and we could save a lot of bandwidth on the Internet!
The way I see it, there's mainly one good way to compress text (Disclaimer : I do not know if this idea has ever been used before, but if it has i'd be glad to know about it). Let's say you index the values of the most frequently used characters, and that you keep a 'joker' value to insert a custom character after, let's say that it sums up to about 45 characters for these indexed characters. You replace each character by its indexed value, and instead of having it in base 256, you calculate it all in base 45.
If we consider that the unfrequent bytes' weight is unsignificant, we go from 100 MB of text (or HTML, the characters for the tags aren't that many), we get 100/256*45 = 17.5 MB. I guess that one top of that you could compress it using some traditional algorithm, although most of the job must have been done with it.
Is there a major flaw in my reasoning or does it sound like a suitable idea?
You just got troll'd!
douginadress
Ace
Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result in future submissions from the same source being ignored.
Seastead this.
I've managed to get wikipedia to fit on a knoppix ISO image. I started off with the CD iso , turned on mysql and apache , installed mediawiki , and put on the iso a wikipedia database file , which I made from the sql dumps that the wikipedia foundation put out. (although now they don't , its xml only apparently)
So basically you boot the disk , start a web browser , go to http://localhost/wikipedia/index.php , and you can start searching for articles.
That , plus all the images I could fit , is about 4.1G. I've got a first version done , I'm just trying to get the thing uploaded somewhere so people can help with further development. I hope to go to a dual layer dvd , to fit on about 4G more images.
The fact that you have to reboot your machine to use it is a bit of a bummer , hopefully there's some way to run qemu or vmware off the disk , so the disk can boot itself inside a VM.
is anyone interested in this?
What a poorly crafted sentence! Surely that should be "to incentivize to advancize AI and exploitize..."
I hope the prize fund becomes very large and someone like you comes up with an algorithm and gets rich enough to retire.
Seastead this.
two digits....42. Bring on the prize!
Hutter's theory is about an active AI agent seeking to maximize some utility function rather than passive prediction. However, it is probably the case that prizes for human knowledge compression should have been in place a long time ago -- probably as long ago as William of Ockham. IMHO guys like Solomonov provided sufficient formal rigor to justify massive government funding of a prize competition for compression of human knowledge decades ago, but it seems to be the case that people who control money generally just can't bring themselves to fund prize competitions for objective criteria. This generally doesn't speak well of people who control money. The exceptions are highly notable.
Seastead this.
The fact that there are local optimal in utility functions in which you can get stuck is a problem for all learning systems but to varying degrees depending on the details of how they look around the space of "programs". The space of programs measured for Kolmogorov complexity has a lot of discontinuity.
Seastead this.
Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).
Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.
Reduce, reuse, cycle
Can't I just punch the monkey for $20 instead?
I made a mistake in stating the criteria. The size of the compressor is irrelevant. Only the size of the decompressor is measured.
Seastead this.
I mean, I hate it when people "verbize" nouns.
compress: s/%/\\%/g s/ the /%/g
and so on. You just saved a bit.
What ever happened to all the super markov models we were supposed to have now with all the extra powerful computers sitting on everyone's desk? Phil Katz was talking about this almost a decade ago, nothing ever came of it.
Does the smallest resulting file matter if it takes a week to compress and a day to uncompress? Why does (lack of) speed have no factor in this?
Easiest thing I can think of without getting too complicated is to create an index of all words on the site and then do replacement in the page compositions with pointers.
ie: "approximately 990,000" according to WikiPedia itself but with slang, etc. make it a cool million.
This should compress the archive as a whole substantially....
Next would be to use an incremental versioning system for incremental edits. Rather than keeping a copy of all article versions... just keep what's changed and pointers back to the original... though this may already be happening (not too familiar with how wikis store versions).
hmm well that's all I can think of on a Sunday evening. Oh yeah.. use some hashing techniques in there somewhere.. something like what is used in Content Addressed Storage archiving systems... that software manages to get huge compression losslessly across diverse sets of text-type data easily.
A fool throws a stone into a well and a thousand sages can not remove it.
Ultimately, you can see the corpus as a space within which various hypotheses about practical AI can be experimentally compared. That's the crucial importance of choosing something general like Wikipedia.
Seastead this.
There's a real word for this. I'm pretty sure "incite" is what everybody's looking for.
I fail to see how putting monkeys at computers would result in something with quality that is significantly different from the current content of Wikipedia
Interested in a Flash-based MAME front end? Visit mame.danzbb.com
Look, I'm no Phd, just a dude that has read a lot. Why does all the stuff I have ever read about Artificial Intelligence sound like bullshit? I mean, where are the Rudy Rucker type robots? Anything even close to AI? It's all failure. All the theory, all the work, and we can't even build a fucking BUG! I know the whole thing contains some of the hardest problems in science, but shit, where is ANY PROGRESS? Should we be trying build positronic brains or something?
It compresses, decompresses, and is always 248 bytes!
(It's a joke...)Well, I don't know about Hutter's proof, but I do know that the "basic theory" as described above is surely wrong. The simple reason: the smallest program that predicts those observations will depend on the programming language used; the optimal AI move however obviously does not depend on the programming language.
The money handout scheme is written right now is
(1-(new record/previous record))*50000.
The contest is seeded with already a 18 MB entry,
So suppose you have labored to get an amazing compressor to compress to 9 megs. and hand that solution in, you will get 25000 euros.
Even, if you padd your solution so that it only shows a 1% improvement each time, you will get somewhere like 35000 euros.
Oh, it gets so much so worse. Kernels can be of arbitrary length, so determining what they do basically falls under the tragic category of the Acceptance Problem -- developing an algorithm that only accepts candidate kernels that actually do perform the business of being a kernel properly. Unless the OP happens to have solved the Acceptance Problem, his proposal doesn't work.
Strictly speaking, the effectiveness of any compression technique has to be measured by including the algorithm's length as part of the length of the output. So if you compress a 100kb document with a 30kb build of bzip2, and the output file is 70kb, you've got a compression ratio of 1. This isn't so much a concern in practice (particularly with something like bzip2 that you reuse), but when you're discussing theoretical "bests", it becomes supremely important for exactly the reason you describe. Kolmogorov complexity, and all that.
Hutter naming this theory after himself is silly. It's called the theory of "minimum description length" and it's been around for a while (well before the 2004 copyright on Hutter's book). The idea is to find some model which minimizes the sum: size of model + incorrectness of model's predictions. The Linguistica folks use it, and in fact would probably be in the running for this prize.
That's been done, and it didn't work.
All they ever wrote about was bananas.
POV! POV!
Help poke pirates in the eyepatch, arr.
The contest requirements add the length of the decompressor program to the size of the compressed file; therefore a dynamic dictionary you build yourself (and store in the file) will be better than using those generic ones. From the faq:
The typical size of current decompressors is less than 100KB. So obscuring them by making them shorter will give you at most an 0.1% = 100KB/100MB advantage (not enough to be eligible for the prize). On the other hand, for fairness it is necessary to include the size of the decompressor. Take a compressor like PAQ8H that contains 800KB of tokens. Clearly it can achieve better compression than one from scratch. If you're not convinced by this argument, consider a an extreme "decompressor" of size 100MB that simply outputs enwik8 byte by byte (from a zero 0 byte archive), thus achieving a compressed size of 0.
What if I don't want to wait 30 minutes and use 1GB of RAM for a trivial 3MB file I want to compress?
Compressing 1GB... ETA March 2036... please wait.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
Now I've dabbled in compression (enough to write a basic compressor) and what Matt suggests I believe relies on an unproven axiom. He suggests AI would need to think like a human as we have understand the function P(s) where this hasn't yet been modelled in machines. However our understanding of the function P(s) relies on a lossy format, that is (e.g.) we can roughly remember what an image is and is show it again confirm that, but we cannot replicate it from one instance of viewing, even if we could, it wouldn't be a perfect copy (e.g. of an image of the Mona Lisa down to the millimetre). Lossy thinking would mean that some detail is left out, meaning if replicated in AI it would fail the test, so what Matt actually proposes is an improvement on human thinking (our thought process with perfect decompression), which as we don't understand the latter yet properly I believe is not yet possible. It might be our thought processes *rely* on the fact that we store memories lossy and which degrade over time, in which case it may not be possible to pass the benchmark at all (as it is for lossless compression).
42
There. Who is Al, and what is this prize of Al's that I win?
>Removing all the incorrect and inaccurate data...
;)
removing all the redundant and repetitive adjectives will compress it even further
Just zip the file up, and then continue to zip the resultant zipped file until it is really small. Easy.
Training monkeys for world domination since 1439
The Encyclopedia Galactica is one step closer to fruition!
i'm betting you use greasemonkey to automatically insert your extra sig into the comment field whenever you submit a comment on slashdot.
(1.21 gigawatts) / (88 miles per hour) = 30 757 874 newtons
...if ya just hadn't specified "Lossless". :-)
Coder's Stone: The programming language quick ref for iPad
The question in the post above this is still relevant.
Where is my prize
50'000×(1-S/L). I think this prize should be the approaching 50,000E prize. At some point there is a lower possible limit on S.
Herbert Simon, the father of AI, deduced that humans use what he called satisficing (and a more formal definition). That is, they tend to the first available solution that works. Nothing about "optimal moves" or "smallest program". In fact solutions were usually suboptimal.
Since humans are the only examples of intelligence, artificial or otherwise, that most would agree are "intelligent", I think Herbert Simon's work trumps Hutter's.
Psychological studies show no human tendency to develop any "smallest moves" or even "simplest explanation" (a' la' Occam's Razor) without significant training. Instead humans start with a base set of assumptions and then constantly patch it to adapt to new information. An overall revision may often not occur at all without some significant triggering event (death of a child, accident, etc.).
submit a random character generator. it matches wikipedia's accuracy, and will result in no fewer arguments.
go get it
I'm eagerly awaiting your decompressor.
The Tao of math: The numbers you can count are not the real numbers.
We know that putting monkeys on typewriters [for an extremely long period of time ] will result in the works of Shakespeare
You missed my point, which is that the current quality of information (and writing) on Wikipedia is extremely poor. In fact, poorer than what could be produced by the monkeys on computers given a long period to work.
The idea of the "monkeys on typewriters" illustration is that with an unlimited amount of time, a random process could produce something that would appear "structured." At this point, Wikipedia hasn't been around long enough for the monkey-randomness to produce anything like Shakespeare. And my expectation is that it won't live long enough to get anywhere close.
Further, I think that any writer (which is my day job) can tell you that the quality of Wikipedia isn't likely to get better, because it's actually in a worse situation than the monkeys on computers. The non-random, article-by-committee nature of the writing is doomed to read like something, well... written by a committee. Which is to say: poorly. Add to this the "any-12-year-old can edit articles on physics" openness, and you've got something that not even a monkey could love.
Interested in a Flash-based MAME front end? Visit mame.danzbb.com
Well, while it isn't the same thing this guys is looking for, I have created a new semi-random access compression format specifically because I was having to process Wikipedia data. A friend of mine and I were working on some data mining in Wikipedia and found that uncompressing the 300 Gig file to be unrealistic. So I created RAZ (Random Access Zip) to help us out. It is a very simple system. It simply uses another compressor to compress relatively small chunks, strings them together, and prepends a header with index information. Using normal 7-zip, we get the 300 gigabytes down to about 2.1 GB. Using the RAZ format using 7-Zip, we get it down to about 2.4. We've written a Python module that "opens" the file and allows for random access through seek and read. I'm eventually going to put RAZ on source forge when time allows. For now, I've just written a post about it on my blog (sethnielson.com). The code is still experimental, so I haven't posted it, but you can email me if you're interested in it and I'll send you a copy. -- Seth N.
Herewith is my entry to compress all the useful knowledge of Wikipedida down to the minimum possible:
touch wp.gz
Done! Please donate my prize to the Red Cross.
If you think you can compress a 100M sample of Wikipedia better than paq8f
Anyone can write a program that can compress that sample down to zero bytes. The simplest such implementation of the program will be slightly bigger than the sample however and could only be used to decompress that sample.
Down to one byte, it could work with up to 256 different samples, but only those, and would still be slightly bigger than the sum of those 256 samples.
(Basically, given a byte, regurgitate the whole text which was precompiled into the program.)
A condition of the contest should be that the combination of the program and compressed data should be smaller than both the uncompressed data and the combination of paq8f and the compressed data.
Dare I recoin a phrase: Any sufficiently advanced compression algorithm is indistinguishable from a filing system.
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
Rubbish
Calvin: I like to verb words.
Hobbes: What?
Calvin: I take nouns and adjectives and use them as verbs. Remember when "access" was a thing? Now, it's something you do. It got verbed. Verbing weirds language.
Hobbes: Maybe we can eventually make language a complete impediment to understanding.
Snowclone is the new clich
Technical details aside, *any* lossless compression algorithm can only remove the inefficiencies of representation in the original form. It cannot destroy information and then reproduce it at a later time. Even decesion-based "AI" algorithms destroy information. If there was a decision fork in the road and the algorithm chose the one less traveled by, the later decompressed version would only have memory of the road chosen. For some areas of knowledge, that is unacceptable.
How much did you say the prize would be?
It's very easy to write a program that decompresses Wikipedia pages with perfect accuracy from a very small source file.
1. Download the Wikipedia article for, say, Jellied eels.
2. Follow the 'edit' link.
3. Change the page content to a 100 megabyte repeating sequence of 'I am a fish I am a fish'.
4. Generate this as output.
-- Ed Avis ed@membled.com