Compress Wikipedia and Win AI Prize
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying
info.edu.org - Speedy information and news from the Top 10 educational organisations.
Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.
But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.
For the love of god, proofread!
If it's not on fire, it's a software problem.
Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would
There. All of wiki, in 31 bytes.
I am very small, utmostly microscopic.
If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.
Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!
arj
Non impediti ratione cogitationus.
Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools
"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.
--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.
Just use lzip! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.
a) how big the compressed size was
18MB
b) how many bytes was wikipedia before it was compressed
A sample of 100MB
Your goal:
.
KFG
Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.
It really seems like one of those mistaking-the-map-for-the-territory errors.
-b
If I wanted a sig I would have filled in that stupid box.
I mentioned this above in a reply too.
What I dont understand is why we cannot zip up open source programs to an extreme amount.
For example, the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.
Yes, takes a lot more CPU work, but the download is so much smaller.
If you're being serious then, that's briliant, where can I download your app?
Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.
Then just apply your personal favourite compression utility.
I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.
I wrote a script right here. It will programatically generate all of Wikipedia. Eventually.
Quidquid latine dictum sit, altum sonatur.
the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.
Huh? What do you think the compression software is doing right now? It searches through the file for blocks of similar information, and then replaces those blocks with pointers to an index, where it stores the block itself. The more repetition you have in the data, the more compression you can get out of it. Good algorithms use (I believe) variable block sizes, and reset the indices as they work their way through the file based on what is probably optimal.
Putting an extra layer of compression on top of that, by replacing frequently used blocks of code, or replacing English words with abbreviations and re-inflating them at the opposite end (as someone else in this thread suggested), would only make the data being fed into the "real" compression utility a little harder to compress. It's unlikely that you could do a better job than any reasonable compression program does already with the same data.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
That's easy, all you have to do is run your program on a computer that users 32-bit bytes. That way you can fit more bits in your bytes, and automatically beat the record by 4 times.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file. It's the size of that decompressor that counts. So no, you couldn't do that.
Doesn't the dictionary in PAQ8A,B,C,D result in smaller filesizes if you're talking about a 100M+ large file?
------ The best brain training is now totally free : )
Indexing into that extra file table takes bits. For example, if you have 1024 different possible abbreviations, you need to use on average 10 bits (at the least) to index that table for each abbrevation. You could use less bits for the more common ones (like huffman coding), but you are still wasting space indexing.
This is the type of contest we need to see more of.
Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.
And as we advance in AI, we come closer to making computers just like you!
ALERT:
I do not understand your command due to error 1446486428499 (Drunken stupor)
I put the 't' in electrical engineering.
The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.
Seastead this.
echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive
Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!
DYWYPI?
If someone could write a program that simulates 1000 virtual monkeys, all typing on typewriters, eventually they would end up with wikipedia. Or maybe they would end up with the equivalent of wikipedia very quickly.
Register the editry.
Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.
:T:R:A:N:S:
If you're unfamiliar with the current state of the art in a field that's under intensive study, it's a near guarantee that your new ideas either don't work or are already a very basic part of the current technology. There's really no shortcut to doing the background reading and understanding how things are currently done before setting out to improve them. If you want a starting point in this area of study, here's one; once you've gotten through that you can start on the remaining 22 years' worth of research.
I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.
Sounds quite similar to Solomonoffs' universal prediction.
"If anything can go wrong, it will." - Murphy
Since it's wikipedia, am I allowed to edit the entries before I compress them? ;-)
This guy's the limit!
The premise of this contest shows a remarkable ignorance of compression theory and technology. True, in theory the more knowledge one has about the world, the better one should be able to compress, but only in the most abstract sense. In practice the vast majority of the bits any "image" of the world like a picture or a piece of text are almost wholly unrelated to high-level features of the world, and thus compression algorithms rightly focus on low-level features of the "image" that are almost wholly separate from the features AI researchers care about.
For example, suppose one wishes to compress two consecutive pictures of a chess tournament. Does knowing the rules of chess help much? Sure, one might use knowledge of chess to deduce a few bits (literally) of information about the second image (what move was made), but when trying to compress two 10-megabit images, who cares? Much better to focus on low-level features such as world smoothnesss, lighting variations, motion flow, et cetera.
Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.
The winner of a Wiki-compression contest is going to be a variation of an ordinary text compression algorithm, and will use techniques that do not in any obvious way translate to "smart" applications, in the same way that the highly successful N-gram models of speech recognition and machine translation do not have any properties one would normally associate with intelligence.
One can build an "intelligent" compressor, but it's the LAST thing any compression researcher is going to do, because they know high-level intelligence doesn't have many bits of information to provide.
Print it out... Then it doesn't use any data space, only physical space
Since the decimals along Pi are essentially an infinate lookup table (there are phone number lookup programs which find information in pi), couldn't you use the calculation of pi to produce your data tree.
The entire 100mb file and everything else should be identifiable by a single location along pi.
mmmmmmmm pi.
liqbase
A (good) sign of the times, I guess.
wikiped doubleplusungood ref compression rewrite fullwise newspeak upsub antefiling
echo I'mright!NoI'mright!You'reanasshole!So'syourmother !! >distilled-wiki.txt
Nope, the number of digits needed to address any significant data contained in pi would be phenomenally large.
And indeed you can come up with all kinds of rules to help you predict the characters in a stream like Wikipedia, not just linguistic rules, but higher level rules are less likely to become crucial with the 100M contest and may need to wait for the 1G (or complete archive of Wikipedia) contest. I can imagine that certain rules relating to relational similarity (analogy) might come into play at the 100M level, particularly since there are now programs that can perform as well as college bound seniors on the SAT verbal analogies test (a heavily 'g' loaded test).
Seastead this.
# tar cvf wikipedia.tar /wikipediapath
# bzip2 wikipedia.tar
No Sigs!
Start with: "the" "and" Then remove periods commas semicolons Speling mistkes ok if readable If other words are commonly used those could be removed It would be a first a fill in the blank-as-you-read encyclopedia Sales staff for door to door enclopedias can convince you new edition saves trees
;)
The above text is a demonstration. No punctuation. Left out "and" and "the". Since some entries are hard to read, the fill-in-the-blank interpretation adds to the experience!
If this new compression is added to the existing compression, I'm sure it'll beat everything.
...be a program that creates an "encyclopedia" website and invites humans to contribute to it?
"How to Do Nothing," kids activities, back in print!
Very simple way to win this contest:
Couldn't be easier!
I don't care if it's 90,000 hectares. That lake was not my doing.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
All the articles worth having will be deleted within 4 month's time, anyway.
In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.
... hummm ... kind of fun, I suppose.
Also, you would have to either store enough digits of pi (which would defeat any compression ideas), or calculate it "on site", which would be
morcego
"The end result of this is a very fast way to find a seven digit sequence within PI. So have fun, search for your phone number or someone elses."
;-)
I have an eight-digit phone number, you insensitive clod.
"The number 0000000 was found starting at position 3794572 to the right of the decimal point."
So replacing one seven digit number with another one, good work.
--
no sig for you. come back one year.
that the entire knowledge of the world could simply be compressed without loss to
yeah, you guessed it..
42...
He who would trade liberty for some temporary security, deserves neither liberty nor security
For what architecture?
Absolutely! The "right answer" for best *intelligent* compression would only store a *minimal* set of pertinent data points and then would use intellect to flesh out the details on decompression. Although you could end up with http://www.reducedshakespeare.com/, I'll worry that when some AI researcher starts down this path
Minor disagreement with what you said... The details are relevant; they just aren't important enough to store in and of themselves. Sort of like mathematics where you only need to learn the principle (pertinent data) and how to apply it. Then every example is just a specific case (flesh out the details).
It takes real intelligence to know which priciple to apply or that none apply and you're doing real research. Genius is being able to find such a solution where none existed before. The details are relevant but just not important enough to bother storing since they can be reasonably recreated by applying real inteligence. So, real intelligent compression would mean that different people decompressing the same article at different times would *not* get the same text but enough of the pertinent data would be the same that they all come away from reading the article with the same meaning. That would be real intelligent compression and decompression.
Cheers,
Dave
They that can give up essential liberty to obtain a little temporary safety deserve neither safety nor liberty.
Ben
What libraries is this executable allowed to call?
Very simple way to win this contest:...
The only problem is that your index will be as big as your original file.
Let's say you want to store your ten-digit phone number using that PI technique. A ten-digit phone number will be found at an average index of 10^10, or 10,000,000,000, the problem is that index is 11 digits long, so your really not winning. In this particular case, you could hope to find your index around 8^100,000,000, which would take 100,000,000 bytes (or 100 MB, the original size) to store.
Couldn't be easier! ;^)
hehehe, right.
You just got troll'd!
for any given dataset there is a pseudo-random number generator that, with the proper parameters, produce that dataset of bytes in sequence.
The problem is that it is non-trivial to find the proper algorithm and parameters. However, if trial-and-error computation is not a problem, a large library of potential generators can be tried with all possible parameters.
So, we need a WikiCast - remember folks, you heard it here first!
antipaucity
... now all we need is a dictionary for nudity and we could save a lot of bandwidth on the Internet!
The way I see it, there's mainly one good way to compress text (Disclaimer : I do not know if this idea has ever been used before, but if it has i'd be glad to know about it). Let's say you index the values of the most frequently used characters, and that you keep a 'joker' value to insert a custom character after, let's say that it sums up to about 45 characters for these indexed characters. You replace each character by its indexed value, and instead of having it in base 256, you calculate it all in base 45.
If we consider that the unfrequent bytes' weight is unsignificant, we go from 100 MB of text (or HTML, the characters for the tags aren't that many), we get 100/256*45 = 17.5 MB. I guess that one top of that you could compress it using some traditional algorithm, although most of the job must have been done with it.
Is there a major flaw in my reasoning or does it sound like a suitable idea?
You just got troll'd!
douginadress
Ace
Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result in future submissions from the same source being ignored.
Seastead this.
I've managed to get wikipedia to fit on a knoppix ISO image. I started off with the CD iso , turned on mysql and apache , installed mediawiki , and put on the iso a wikipedia database file , which I made from the sql dumps that the wikipedia foundation put out. (although now they don't , its xml only apparently)
So basically you boot the disk , start a web browser , go to http://localhost/wikipedia/index.php , and you can start searching for articles.
That , plus all the images I could fit , is about 4.1G. I've got a first version done , I'm just trying to get the thing uploaded somewhere so people can help with further development. I hope to go to a dual layer dvd , to fit on about 4G more images.
The fact that you have to reboot your machine to use it is a bit of a bummer , hopefully there's some way to run qemu or vmware off the disk , so the disk can boot itself inside a VM.
is anyone interested in this?
I hope the prize fund becomes very large and someone like you comes up with an algorithm and gets rich enough to retire.
Seastead this.
two digits....42. Bring on the prize!
Hutter's theory is about an active AI agent seeking to maximize some utility function rather than passive prediction. However, it is probably the case that prizes for human knowledge compression should have been in place a long time ago -- probably as long ago as William of Ockham. IMHO guys like Solomonov provided sufficient formal rigor to justify massive government funding of a prize competition for compression of human knowledge decades ago, but it seems to be the case that people who control money generally just can't bring themselves to fund prize competitions for objective criteria. This generally doesn't speak well of people who control money. The exceptions are highly notable.
Seastead this.
The fact that there are local optimal in utility functions in which you can get stuck is a problem for all learning systems but to varying degrees depending on the details of how they look around the space of "programs". The space of programs measured for Kolmogorov complexity has a lot of discontinuity.
Seastead this.
Well... maybe. It depends a lot on what data you are trying to compress. As a best-case scenario, it can compress the entire decimal expansion of PI down to a single digit: "0". Infinite digits compressed to a single byte, not bad eh?
I don't care if it's 90,000 hectares. That lake was not my doing.
Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).
Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.
Reduce, reuse, cycle
Can't I just punch the monkey for $20 instead?
I made a mistake in stating the criteria. The size of the compressor is irrelevant. Only the size of the decompressor is measured.
Seastead this.
I mean, I hate it when people "verbize" nouns.
compress: s/%/\\%/g s/ the /%/g
and so on. You just saved a bit.
Easiest thing I can think of without getting too complicated is to create an index of all words on the site and then do replacement in the page compositions with pointers.
ie: "approximately 990,000" according to WikiPedia itself but with slang, etc. make it a cool million.
This should compress the archive as a whole substantially....
Next would be to use an incremental versioning system for incremental edits. Rather than keeping a copy of all article versions... just keep what's changed and pointers back to the original... though this may already be happening (not too familiar with how wikis store versions).
hmm well that's all I can think of on a Sunday evening. Oh yeah.. use some hashing techniques in there somewhere.. something like what is used in Content Addressed Storage archiving systems... that software manages to get huge compression losslessly across diverse sets of text-type data easily.
A fool throws a stone into a well and a thousand sages can not remove it.
Ultimately, you can see the corpus as a space within which various hypotheses about practical AI can be experimentally compared. That's the crucial importance of choosing something general like Wikipedia.
Seastead this.
Well... maybe. It depends a lot on what data you are trying to compress.
Well yes and no. Either your data is the decimals of PI, as you just said, or it is some particular case that is extremely unlikely to happen. Correct me if I'm wrong, but you would have about 1 chance out of 8 to gain a byte in your index data (or, for example, you would have about 1 chance out of 1 billion to save 10 bytes, out of 100 MB, well that's if I didn't make any mistake)
You just got troll'd!
There's a real word for this. I'm pretty sure "incite" is what everybody's looking for.
I fail to see how putting monkeys at computers would result in something with quality that is significantly different from the current content of Wikipedia
Interested in a Flash-based MAME front end? Visit mame.danzbb.com
Look, I'm no Phd, just a dude that has read a lot. Why does all the stuff I have ever read about Artificial Intelligence sound like bullshit? I mean, where are the Rudy Rucker type robots? Anything even close to AI? It's all failure. All the theory, all the work, and we can't even build a fucking BUG! I know the whole thing contains some of the hardest problems in science, but shit, where is ANY PROGRESS? Should we be trying build positronic brains or something?
It compresses, decompresses, and is always 248 bytes!
(It's a joke...)Well, I don't know about Hutter's proof, but I do know that the "basic theory" as described above is surely wrong. The simple reason: the smallest program that predicts those observations will depend on the programming language used; the optimal AI move however obviously does not depend on the programming language.
The money handout scheme is written right now is
(1-(new record/previous record))*50000.
The contest is seeded with already a 18 MB entry,
So suppose you have labored to get an amazing compressor to compress to 9 megs. and hand that solution in, you will get 25000 euros.
Even, if you padd your solution so that it only shows a 1% improvement each time, you will get somewhere like 35000 euros.
In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.
Not only can, but will the vast majority of the time. Most strings are incompressible by *any* algorithm. (Strings consisting of human-readable text are of course exceptions). And yet, past a certain size it is unprovable that any specific string is incompressible. Algorithmic information theory is wacky.
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
Oh, it gets so much so worse. Kernels can be of arbitrary length, so determining what they do basically falls under the tragic category of the Acceptance Problem -- developing an algorithm that only accepts candidate kernels that actually do perform the business of being a kernel properly. Unless the OP happens to have solved the Acceptance Problem, his proposal doesn't work.
Strictly speaking, the effectiveness of any compression technique has to be measured by including the algorithm's length as part of the length of the output. So if you compress a 100kb document with a 30kb build of bzip2, and the output file is 70kb, you've got a compression ratio of 1. This isn't so much a concern in practice (particularly with something like bzip2 that you reuse), but when you're discussing theoretical "bests", it becomes supremely important for exactly the reason you describe. Kolmogorov complexity, and all that.
Hutter naming this theory after himself is silly. It's called the theory of "minimum description length" and it's been around for a while (well before the 2004 copyright on Hutter's book). The idea is to find some model which minimizes the sum: size of model + incorrectness of model's predictions. The Linguistica folks use it, and in fact would probably be in the running for this prize.
POV! POV!
Help poke pirates in the eyepatch, arr.
The contest requirements add the length of the decompressor program to the size of the compressed file; therefore a dynamic dictionary you build yourself (and store in the file) will be better than using those generic ones. From the faq:
The typical size of current decompressors is less than 100KB. So obscuring them by making them shorter will give you at most an 0.1% = 100KB/100MB advantage (not enough to be eligible for the prize). On the other hand, for fairness it is necessary to include the size of the decompressor. Take a compressor like PAQ8H that contains 800KB of tokens. Clearly it can achieve better compression than one from scratch. If you're not convinced by this argument, consider a an extreme "decompressor" of size 100MB that simply outputs enwik8 byte by byte (from a zero 0 byte archive), thus achieving a compressed size of 0.
What if I don't want to wait 30 minutes and use 1GB of RAM for a trivial 3MB file I want to compress?
Compressing 1GB... ETA March 2036... please wait.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
Now I've dabbled in compression (enough to write a basic compressor) and what Matt suggests I believe relies on an unproven axiom. He suggests AI would need to think like a human as we have understand the function P(s) where this hasn't yet been modelled in machines. However our understanding of the function P(s) relies on a lossy format, that is (e.g.) we can roughly remember what an image is and is show it again confirm that, but we cannot replicate it from one instance of viewing, even if we could, it wouldn't be a perfect copy (e.g. of an image of the Mona Lisa down to the millimetre). Lossy thinking would mean that some detail is left out, meaning if replicated in AI it would fail the test, so what Matt actually proposes is an improvement on human thinking (our thought process with perfect decompression), which as we don't understand the latter yet properly I believe is not yet possible. It might be our thought processes *rely* on the fact that we store memories lossy and which degrade over time, in which case it may not be possible to pass the benchmark at all (as it is for lossless compression).
Nahhhhh you misunderstand the concept here,
Let's suppose the binary string I am trying to compress is "141592653589793238462643383279502884", this exists at position 1 to the right of the decimal point.
Since Pi is infinite there will be a matching location for *any* string somewhere within pi, you just have to tell them the index and they can calculate it (negating the fact that they might require a few millennia to decompress the "hello world" message).
The same should also be possible by supplying coordinates and vectors into the mandelbrot set, letting the user regenerate the results based upon this lookup.
Never having to store the dictionary away would enable higher compression ratios.
Its certainly not practical, but it isn't impossible to conceive.
liqbase
>Removing all the incorrect and inaccurate data...
;)
removing all the redundant and repetitive adjectives will compress it even further
Just zip the file up, and then continue to zip the resultant zipped file until it is really small. Easy.
Training monkeys for world domination since 1439
You get the text of Wikipedia by just running this program often enough.
As a bonus, you also get the complete works of Shakespeare.
For greater clarity of the algorithm, you may want to make a symlink to
Now, where can I get my money?
The Tao of math: The numbers you can count are not the real numbers.
i'm betting you use greasemonkey to automatically insert your extra sig into the comment field whenever you submit a comment on slashdot.
(1.21 gigawatts) / (88 miles per hour) = 30 757 874 newtons
...if ya just hadn't specified "Lossless". :-)
Coder's Stone: The programming language quick ref for iPad
The question in the post above this is still relevant.
Well, then the solution is easy: Instead of searching the position in pi for "compression" and reading from for "decompression", do it the opposite way: For compression, look up what is at the corresponding position of pi, and for decompression, find the position in pi where the coded information can be found. Given that the first method vastly increases the size almost always on "compression" and therefore vastly decreases the size on "decompression", doing the reverse should of course vastly decrease the size on compression and vastly increase it on decompression.
The Tao of math: The numbers you can count are not the real numbers.
submit a random character generator. it matches wikipedia's accuracy, and will result in no fewer arguments.
go get it
I'm eagerly awaiting your decompressor.
The Tao of math: The numbers you can count are not the real numbers.
We know that putting monkeys on typewriters [for an extremely long period of time ] will result in the works of Shakespeare
You missed my point, which is that the current quality of information (and writing) on Wikipedia is extremely poor. In fact, poorer than what could be produced by the monkeys on computers given a long period to work.
The idea of the "monkeys on typewriters" illustration is that with an unlimited amount of time, a random process could produce something that would appear "structured." At this point, Wikipedia hasn't been around long enough for the monkey-randomness to produce anything like Shakespeare. And my expectation is that it won't live long enough to get anywhere close.
Further, I think that any writer (which is my day job) can tell you that the quality of Wikipedia isn't likely to get better, because it's actually in a worse situation than the monkeys on computers. The non-random, article-by-committee nature of the writing is doomed to read like something, well... written by a committee. Which is to say: poorly. Add to this the "any-12-year-old can edit articles on physics" openness, and you've got something that not even a monkey could love.
Interested in a Flash-based MAME front end? Visit mame.danzbb.com
Well, while it isn't the same thing this guys is looking for, I have created a new semi-random access compression format specifically because I was having to process Wikipedia data. A friend of mine and I were working on some data mining in Wikipedia and found that uncompressing the 300 Gig file to be unrealistic. So I created RAZ (Random Access Zip) to help us out. It is a very simple system. It simply uses another compressor to compress relatively small chunks, strings them together, and prepends a header with index information. Using normal 7-zip, we get the 300 gigabytes down to about 2.1 GB. Using the RAZ format using 7-Zip, we get it down to about 2.4. We've written a Python module that "opens" the file and allows for random access through seek and read. I'm eventually going to put RAZ on source forge when time allows. For now, I've just written a post about it on my blog (sethnielson.com). The code is still experimental, so I haven't posted it, but you can email me if you're interested in it and I'll send you a copy. -- Seth N.
If you think you can compress a 100M sample of Wikipedia better than paq8f
Anyone can write a program that can compress that sample down to zero bytes. The simplest such implementation of the program will be slightly bigger than the sample however and could only be used to decompress that sample.
Down to one byte, it could work with up to 256 different samples, but only those, and would still be slightly bigger than the sum of those 256 samples.
(Basically, given a byte, regurgitate the whole text which was precompiled into the program.)
A condition of the contest should be that the combination of the program and compressed data should be smaller than both the uncompressed data and the combination of paq8f and the compressed data.
Dare I recoin a phrase: Any sufficiently advanced compression algorithm is indistinguishable from a filing system.
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
Just as soon as the Judge's computer finishes decoding it. It's slow going but - as an unexpected side effect - we are also getting the complete Library of Babel.
Calvin: I like to verb words.
Hobbes: What?
Calvin: I take nouns and adjectives and use them as verbs. Remember when "access" was a thing? Now, it's something you do. It got verbed. Verbing weirds language.
Hobbes: Maybe we can eventually make language a complete impediment to understanding.
Snowclone is the new clich
ERROR: The files are, like, talkin' to me, man! Oh... I get it! The I/O ports are colors! No, they're not representing colors, they are colors! It all makes sense now!
How much did you say the prize would be?
Interestingly I got the correct text in the first try. Unfortunately it was encrypted with a one-time pad, and I'm still struggling to get the key ...
The Tao of math: The numbers you can count are not the real numbers.
It's very easy to write a program that decompresses Wikipedia pages with perfect accuracy from a very small source file.
1. Download the Wikipedia article for, say, Jellied eels.
2. Follow the 'edit' link.
3. Change the page content to a 100 megabyte repeating sequence of 'I am a fish I am a fish'.
4. Generate this as output.
-- Ed Avis ed@membled.com