Compress Wikipedia and Win AI Prize

WikiPedia on iPod! by network23 · 2006-08-13 10:51 · Score: 2, Interesting

I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying

info.edu.org - Speedy information and news from the Top 10 educational organisations.

Re:WikiPedia on iPod! by Fred+Porry · 2006-08-13 11:08 · Score: 3, Insightful

Then it would be an encyclopedia, not a Wiki, thats another point why I say: forget about it. Would be nice though. ;)
Re:WikiPedia on iPod! by CastrTroy · 2006-08-13 11:25 · Score: 3, Insightful

Well, since it's currently only 1 Gig, you could probably put it on a flash card and read it from a handheld. It wouldn't be an ipod. but probably wouldn't require destroying a perfectly good piece of equipment either. You could probably even get weekly updates (hopefully in a diff file) to make sure your copy is in sync with the rest of the internet. Now that I think about it, this would be a really good application. There's lots of times when I'd like to look up something off wikipedia, but not connected to the internet.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:WikiPedia on iPod! by Nutria · 2006-08-13 11:59 · Score: 1

Well, since it's currently only 1 Gig?

You didn't RTFA, did you?

--
"I don't know, therefore Aliens" Wafflebox1
Re:WikiPedia on iPod! by Asztal_ · 2006-08-13 12:16 · Score: 4, Funny

Umm... which of the 5 thousand links is the article?
Re:WikiPedia on iPod! by Millenniumman · 2006-08-13 13:20 · Score: 1

The 3897th one.

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
Re:WikiPedia on iPod! by timboc007 · 2006-08-13 13:42 · Score: 1

Then it would be an encyclopedia, not a Wiki

Not if it were combined with a method for periodically re-syncing... then you would still get the best of wiki (regular updates as they happen) but have it available offline also.
Re:WikiPedia on iPod! by Fred+Porry · 2006-08-13 14:03 · Score: 1

combined with a method for periodically re-syncing
If it would be doable then this would be just perfect- I guess, thats the task...
Re:WikiPedia on iPod! by Fordiman · 2006-08-13 14:11 · Score: 1

I'm thinking, why not just store a copy of Wikipedia (as just webpages - a full mirror) in a squashfs (or similarly block-compressed) archive.

It's the whole instant-access thing.

--
110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
Re:WikiPedia on iPod! by Kyoushu · 2006-08-14 00:52 · Score: 1, Funny

I like to hitchhike, and I could use a guide like this on my travels.
Re:WikiPedia on iPod! by andrewman327 · 2006-08-14 01:52 · Score: 1

That sounds a lot like AvantGo, only that uses Palm OS so you can use the bigger touchscreen. If you are going to have an encyclopedia put on your iPod, why not use a real one like Britannica instead of Wikipedia? That way you could cite it,

--
Information wants a fueled airplane waiting at the hangar and no one gets hurt.
Re:WikiPedia on iPod! by Nuskrad · 2006-08-14 05:53 · Score: 1

You should never cite any encyclopedia in a university level paper. Encyclopedia never provide the subject depth needed to write an essay.
Re:WikiPedia on iPod! by andrewman327 · 2006-08-14 06:06 · Score: 1

I am aware of that which is why I have a library card with the Library of Congress. There are times, however, when you need to use a specific fact and citing an encyclopedia makes sense. The years that major treaties are enacted, for example.

--
Information wants a fueled airplane waiting at the hangar and no one gets hurt.
Re:WikiPedia on iPod! by trentblase · 2006-08-14 18:38 · Score: 1

Can't you just cite the treaty itself?

But captain by Anonymous Coward · 2006-08-13 10:52 · Score: 5, Funny

Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.

But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.

Re:But captain by Anonymous Coward · 2006-08-13 11:09 · Score: 5, Funny

You left out the part involving the deflector shield. Remember, the first rule of star trek technobabel is always involve the deflector in some way.
Re:But captain by bcat24 · 2006-08-13 12:00 · Score: 1

Come on, just reverse the damn polarity already.
Re:But captain by WilliamSChips · 2006-08-13 12:25 · Score: 1, Offtopic

I thought the first rule of Star Trek was that the redshirts always die.

--
Please, for the good of Humanity, vote Obama.
Re:But captain by Anonymous Coward · 2006-08-13 13:05 · Score: 0

Captain, without the help of the deflector when we reverse the tachyon inverter drives, we won't have insufficient dilithium crystals to traverse the neutrino warp. Unfortunately, the deflector was last seen wearing a red shirt and beaming down to the planet with the away team.
Re:But captain by ScrewMaster · 2006-08-13 13:06 · Score: 1

And the second rule is to always use a verteron pulse wherever possible, preferably transmitted through subspace.

--
The higher the technology, the sharper that two-edged sword.
Re:But captain by Amouth · 2006-08-13 13:41 · Score: 1

first.. it is always first.. then you might have the chance of killing of a yellow shirt 0or maming a blue one

--
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
Re:But captain by Anonymous Coward · 2006-08-13 14:33 · Score: 0

Damnit, the first rule is don't talk about it.
Re:But captain by Anonymous Coward · 2006-08-13 14:36 · Score: 1, Funny

Off Topic, I know... but I must ask...

Do you think that Shatner, while just standing around, like in the supermarket, or waiting in line at the post office, just kinda gets this faraway look on his face, and screams out in a tortured voice...

KAAAAHHHHHHHHNNNNN!!! ...Then just goes back to what he was doing like nothing happened?
Re:But captain by Anonymous Coward · 2006-08-13 14:47 · Score: 1, Funny

No, but I suppose people do that *to* him!
Re:But captain by StarkRG · 2006-08-13 19:40 · Score: 1

Star Trek may have popularised technobabble but Dr. Who did it before! The worst of it was usually reversing the tacheon pulse. At least that was what Pertwee always did, not sure what they did before him...
Re:But captain by D-Cypell · 2006-08-13 20:25 · Score: 1

Not in TNG, then it was the command officers in the red shirts. My theory, is that by this point in time, the federation were beginning to run a little low on the red shirts and figured they better start issuing the remainder to the less expendable members of the crew. Either that or it was a gesture of the memorial to all the red shirted folks that Kirk lead to their deaths.
Re:But captain by escher · 2006-08-14 06:08 · Score: 1

You can only do that in the last five minutes. Trying to reverse the polarity before then will result in a new catastrophy that you will have to fix some other way in the last five minutes unless it's a "hidden obvious solution", in which case you just reverse the polarity again in the last five minutes.

Painful to read by CuriHP · 2006-08-13 10:54 · Score: 3, Insightful

For the love of god, proofread!

--
If it's not on fire, it's a software problem.

Re:Painful to read by Threni · 2006-08-13 11:24 · Score: 2, Insightful

> For the love of god, proofread!

Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.
Re:Painful to read by Anonymous Coward · 2006-08-13 11:55 · Score: 0

Agreed - what a dreadful summary. It has senseless grammar, invented words, and nine links. Good job, Taco - you've phoned another one in.
Re:Painful to read by Anonymous Coward · 2006-08-13 12:03 · Score: 1, Interesting

Mahoney is a professor at my college. I've taken his classes. He talks just how he writes.
Re:Painful to read by maximthemagnificent · 2006-08-13 13:44 · Score: 0, Redundant

Amen!
Re:Painful to read by ameline · 2006-08-13 14:49 · Score: 2, Insightful

Agreed -- why can they not MOTIVATE us instead?

No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.

--
Ian Ameline
Re:Painful to read by PeeAitchPee · 2006-08-13 21:10 · Score: 4, Funny

He did, but Slashdot's AI compressed it for him.

:-D
Re:Painful to read by maxwell+demon · 2006-08-14 00:42 · Score: 1

I'm still trying to find out the difference between verbing a noun, verbizing it, verbifying it and verbificating it ...

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Painful to read by Crayon+Kid · 2006-08-14 03:55 · Score: 1

Missa is Mahoiney. Missa gonna teach you the numbers.

--
i ate crayons when i was a kid and now i have two braincells and the blue ones taste nicer
Re:Painful to read by doctrbl · 2006-08-14 06:24 · Score: 1

Agreed -- why can they not MOTIVATE us instead?

No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.

I disagree. My understanding is that a person might be motivated by carrot (something good that you want), stick (something negative that you want to avoid), or a combination of these. To incentivize is to motivate with the carrot alone and NOT the stick.

Very different in my opinion, but I work in an office and maybe my brain just works that way now :(
Re:Painful to read by PriceIke · 2006-08-14 06:34 · Score: 1

> No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.

Not exactly. To motivate someone, you are making someone want to do something. You motivate a person. You incentivize a goal, meaning you provide an incentive for whomever achieves that goal, which is one way to motivate someone but not the only way. Likewise you can incent someone to perform a task for you by offering an incentive, which is a form of motivation, but the concept is not always interchangable with "to motivate".

I'm not crazy about either word (incent or incentivize) but they do have their uses in certain rare circumstances .. regardless, whenever I find myself tempted to use either of the words, I figure there's probably a better sentence out there to deliver the idea I'm trying to express, and I start over.

--
It's not a lie. It's the truth with lossy compression.
Re:Painful to read by PriceIke · 2006-08-14 06:37 · Score: 1

Heh. I've found a thread where my sig actually sounds relevant.

--
It's not a lie. It's the truth with lossy compression.
Re:Painful to read by ameline · 2006-08-14 09:46 · Score: 1

So the desire not to be beaten with a stick (possibly painted orange to look like a carrot :-) is not motivation?

--
Ian Ameline
Re:Painful to read by ameline · 2006-08-14 09:48 · Score: 1

oops I meant "an incentive", not "motivation" in that post. Apologies for typing too fast.

--
Ian Ameline

lossy compression by RenoRelife · 2006-08-13 10:55 · Score: 5, Insightful

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would

Re:lossy compression by Anonymous Coward · 2006-08-13 11:49 · Score: 3, Insightful

Funny? That's most intelligent and insightful remark I've seen here in months, albeit rather naively stated.
The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data reduction. The downsides are that, like us humans, they may be unreliable/non-deterministic in retrieving data because of this fuzzyness. It would also be able make "sideways" associations and draw inferences from the data set, which in essence would be a weak form of artificial intelligence. Now give him his +5 insightful you silly people.
Re:lossy compression by Vo0k · 2006-08-13 11:54 · Score: 3, Insightful

that's one piece, but not necessarily - "lossy" nature of human mind compression can be overcome by "additional checks".

Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entries!) then when recalling, you put it back in words, and for exact citation you try to match possible wordings against remembered pattern.

So imagine this: the compressor analyzes a sentence lexically, spots regularities and irregularities, transforms the sentence into a relational set of tokens containing the meaning of the sentence, which are small and easy to store, unambigiously describe the meaning of the sentence but don't contain exact wording. Then an extra checksum of the text of the sentence is added.

Decompressor tries to build a sentence that reflects given idea according to rules of grammar, picking different synonyms, word ordering and such, and then brute-forcing or otherwise matching against the checksum to find which of the created sentences matches exactly.

Look, the best compressor in the world:
sha1sum /boot/vmlinuz
647fb0def3809a37f001d26abe58e7f900718c46 /boot/vmlinuz

Linux kernel compressed to set: { string: "647fb0def3809a37f001d26abe58e7f900718c46", info: "it's a Linux kernel for i386" }

You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically. So theoretically you could use the hash to rebuild THE kernel just by brute-force creating random files and checking them if they match both the hash and "general properties of a kernel".

The problem obviously lies in unrealistic "brute force" part. The subset of possible rebuilds of the data must be heavily limited. You can do this by lossy compression that allows for limited "informed guess" results - ones that make sense in context of a linux kernel - style, compiler optimizations, use of macros transformed into repeatable binary code. And have the original analysed using the same methods before compression, storing all inconsistencies with the model separately.

So the compression file would consist of:
- a set of logical tokens describing meaning of given piece of data (in relation to the rest of "knowledge base"
- a set of exceptions (where logic fails)
- a checksum or other pattern allowing to verify an exact match.

Most of lossy compressions are meant to obfuscate the lost data. If you use one that instead allows for rebuilding lost data according to certain limited ruleset ("informed guess" + verification) you'd get a lossless compression of comparable efficiency.

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"
Re:lossy compression by Lord+Kano · 2006-08-13 12:11 · Score: 1

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would

I have an idea, if you can take the algorithm from this program and put it into a library we'd be a huge step closer to an AI that mimicks traditional intelligence.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
Re:lossy compression by Anonymous Coward · 2006-08-13 12:27 · Score: 0

> You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches.

no. No. NO. There are only 2^128 possible hashes, and the number of possible kernel executables is vastly larger than that. The general idea of your post is interesting, but this part is bullshit.
Re:lossy compression by MindStalker · 2006-08-13 13:01 · Score: 1

I think the idea is to do a lossy compression combined with the hash could reproduce the original by guessing... Maybe..
Re:lossy compression by swillden · 2006-08-13 15:14 · Score: 4, Insightful

You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically.
Mmmm, no. You were fine up until you said "very few will follow the ruleset". That's not true. To see that it's not true, take your kernel, which consists of around 10 million bits. Now find, say, 512 of those bits that can be changed, independently, while still producing a valid-looking kernel executable. The result doesn't even have to be a valid, runnable kernel, but it wouldn't be too hard to do it even with that restriction.
So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value, and the odds are very, very good that your "real" kernel's hash is also matched by at least one of them. If by some chance it isn't, I can always generate a whole bunch more kernel variants. How about 2^2^10 of them?
A hash plus a filter ruleset does not constitute a lossless compression of a large file, even if computation power/time is unbounded.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:lossy compression by Eivind · 2006-08-13 18:01 · Score: 1

It's worse than that. Actually with 2^512 kernels and 2^128 possible hash-values there are (about) 2^384 kernels matching each possible hash. That is 39402006196394479212279040100143613805079739270465 44666794829340424572177149721061141426625488491564 0806627990306816 which you will notice is *sligthly* more than your "4".
Re:lossy compression by AxelBoldt · 2006-08-13 18:26 · Score: 1

So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value,
Make that 2^384 kernels for each hash value, on average.
Re:lossy compression by OverflowingBitBucket · 2006-08-13 18:33 · Score: 1

A probabilistic reconstruction of an executable from a hash and general properties sounds very cool, but why on earth should you want to use a hash that is designed to be resistant to reconstructing the original data? Also, you might want to look at the size of the space for possible executables. There may be small amounts of structure within an executable file, but the code and data will not be so helpful.
Re:lossy compression by Vo0k · 2006-08-13 21:52 · Score: 1

There are only 2^128 possible hashes, and the number of possible kernel executables is vastly larger than that.
Not really. Of course if you took that dummy brute-force approach you'd come up with a large number of completely bogus kernel-like files from which some could match. But if you took the "informed guess" approach, you'd limit the number to number of versioned kernels x number of possible .config setups x number of compilers. The resulting number would be possibly somewhat greater than 2^128 but then you put additional restrains: the info didn't say it's some unusual kernel (it would likely say that if it was) so discard all obscure architectures and odd hardware and modules. Then start from most likely, most recent versions of kernel and compiler and move down the probability tree picking more obscure setups. There are maybe 2^20 of "vanilla kernels" floating around the market and the match would be within them.

sha1sum is a very bad hash for this purpose too, because it doesn't in any way indicate approaching the right solution. You have only 1-bit "correct/incorrect" flag. A hash that gets "more similar" if you change your solution in good direction would be much better.

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"
Re:lossy compression by Vo0k · 2006-08-13 21:58 · Score: 1

I picked it because this is what I had at hand. I know it's horrible for this task, but it simply IS. I didn't want to use a made-up or very obscure example. Of course a hash that helps to reconstruct the file placing certain visible restrains would be much better.

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"
Re:lossy compression by OverflowingBitBucket · 2006-08-13 22:30 · Score: 1

I picked it because this is what I had at hand. I know it's horrible for this task, but it simply IS. I didn't want to use a made-up or very obscure example. Of course a hash that helps to reconstruct the file placing certain visible restrains would be much better.

I think your idea is the seed to something very groovy. Perhaps it would be a good thing to use the hash to check the final result, and some other easily reversible function to "compress" the code in a lossy way such that you have a good chance (say 90%) to reconstruct it. Whilst people are going to wonder what a non-guaranteed compression will offer, one place lossy compression could be useful is with sending data across a link with terrible bandwidth but decent latency. If the result is small enough, it could be sent in advance of the real data. You could even check the result to see if you can "decompress" on the sending side before sending it, and keep dropping back to successively worse schemes each time it fails until you find a good one. If the result is too bad, just send the uncompressed version. Add one bit to the start signaling "uncompressed" or "compressed" and you're set. ;)
Re:lossy compression by OverflowingBitBucket · 2006-08-13 22:39 · Score: 1

I just reread your original comment and you were hinting towards this sort of thing anyway. Sorry that I didn't notice this initially.
Re:lossy compression by swillden · 2006-08-14 11:17 · Score: 1

LOL.
Duh. Excuse me while I go slap myself.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:lossy compression by Eivind · 2006-08-14 21:20 · Score: 1

Consider yourself excused. Happens to the best of us. A friend of mine made that particular gaffe at the oral examination for his master thesis in cryptography. (on improvements to elliptic curve cryptography making the keys more compact without reducing the computational complexity of breaking them)
He still managed to somehow utter that 2^128 was twice as large a number as 2^64. Didn't even fail him, as closer asking showed that he didn't really believe that, it's just that it sorta "looks" that way, cause everyone knows 128 is twice as large as 64.

As long as it is Wiki that we are talking about... by gatkinso · 2006-08-13 10:56 · Score: 3, Funny

There. All of wiki, in 31 bytes.

--
I am very small, utmostly microscopic.

for those who rtfa by Anonymous Coward · 2006-08-13 10:57 · Score: 0

how about telling us:

a) how big the compressed size was

b) how many bytes was wikipedia before it was compressed

Re:for those who rtfa by kfg · 2006-08-13 11:15 · Score: 2, Informative

a) how big the compressed size was

18MB

b) how many bytes was wikipedia before it was compressed

A sample of 100MB

Your goal:
.

KFG
Re:for those who rtfa by maxwell+demon · 2006-08-14 01:05 · Score: 1

Well, it's quite simple. Here's a decompressor for Linux:
#!/bin/sh dd if=/dev/random of=wikipedia bs=1024 size=102400
Instructions for decompression:

You get the text of Wikipedia by just running this program often enough.
As a bonus, you also get the complete works of Shakespeare.
For greater clarity of the algorithm, you may want to make a symlink to /dev/random named /dev/monkey and use that in the program.

Now, where can I get my money? :-)

--
The Tao of math: The numbers you can count are not the real numbers.
Re:for those who rtfa by JamesGecko · 2006-08-14 04:19 · Score: 1

Just as soon as the Judge's computer finishes decoding it. It's slow going but - as an unexpected side effect - we are also getting the complete Library of Babel.
Re:for those who rtfa by maxwell+demon · 2006-08-14 07:30 · Score: 1

Interestingly I got the correct text in the first try. Unfortunately it was encrypted with a one-time pad, and I'm still struggling to get the key ...

--
The Tao of math: The numbers you can count are not the real numbers.

wow! by g253 · 2006-08-13 10:57 · Score: 0, Redundant

It is 1 a.m. here, I just came home, slightly drunk, slightly high, and I just tried to understand this article. It has indeed blown my mind, so to speak.

Re:wow! by XnavxeMiyyep · 2006-08-13 11:47 · Score: 1

And as we advance in AI, we come closer to making computers just like you!

ALERT:
I do not understand your command due to error 1446486428499 (Drunken stupor)

--
I put the 't' in electrical engineering.
Re:wow! by escher · 2006-08-14 06:16 · Score: 1

ERROR: The files are, like, talkin' to me, man! Oh... I get it! The I/O ports are colors! No, they're not representing colors, they are colors! It all makes sense now!

Can it be "lossy" compression? by RicheB · 2006-08-13 10:58 · Score: 1

If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.

Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 11:08 · Score: 0

No you can't, computers manipulate and store data by the byte.
Re:Can it be "lossy" compression? by richdun · 2006-08-13 11:15 · Score: 4, Funny

Hmmm...well in that case, someone go edit the Wikipedia entry on "computers" and allow them to store data at the bit level. Also, I heard somewhere where computers in Africa have tripled in the past six months!
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 11:20 · Score: 0

If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.
No, it's not. The deliverable for the competition is an SFX that reproduces the 100 MB test file.
Re:Can it be "lossy" compression? by kfg · 2006-08-13 11:21 · Score: 1

No you can't, computers manipulate and store data by the byte.

That may well be true of your computer, but I've got one right in front of me that manipulates data by the bit.

It's even got several of them.

KFG
Re:Can it be "lossy" compression? by Bill+Kilgore · 2006-08-13 11:22 · Score: 5, Funny

I have a program that compresses 100M of Wikipedia to one bit with no loss at all. The program is somewhat special-purpose, and at 100,024,076 bytes, a little chunkier than I'd like.

--
Rediculous: A word indicating the writer is ridiculously ignorant.
Re:Can it be "lossy" compression? by KiloByte · 2006-08-13 11:32 · Score: 1

Compared to 104857600, you at least got some compression.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 11:49 · Score: 0

Why so? The test file is exactly 10^8 bytes.
Re:Can it be "lossy" compression? by bcat24 · 2006-08-13 12:03 · Score: 1

Now that's an elephant of a tale.
Re:Can it be "lossy" compression? by StikyPad · 2006-08-13 12:22 · Score: 1

That's ELEPHANTS. Computers in ELEPHANTS have tripled in the past 6 months. Get it right.

--
https://www.eff.org/https-everywhere
Re:Can it be "lossy" compression? by KiloByte · 2006-08-13 12:37 · Score: 2, Informative

Why so? The test file is exactly 10^8 bytes.
I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.

This inconsistency doesn't have any effect on the challenge, though -- that 50kEUR[1] is offered for compressing the given data corpus, not for compressing a string of 100MB.

[1] 1kEUR=1000EUR. 1M EUR=1000000EUR. 1KB=1024B. 1MB=1048576B.
And by the way, what about fixing Slash to finally allow Unicode -- either natively or at least as HTML entities?

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 12:39 · Score: 0

When I was reading Wikipedia this morning, I learned that Digg is Slashdot's Portugal.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 12:51 · Score: 0

10^8 bytes = 100 MB = 95.367 MiB.

"100M" is so ambiguous it is meaningless.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 13:09 · Score: 0

No, computers manipulate and store data by the word.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 13:49 · Score: 0

I'd like to see you make a one-bit-long file. Go ahead, try!
Re:Can it be "lossy" compression? by kfg · 2006-08-13 14:00 · Score: 1

I'd like to see you make a one-bit-long file.

This not the appropriate media for that.

Go ahead, try!

But I tried, succeeded; and when run the display indicates a value of "true."

The way your computer works is not indicative of what computers can or cannot do.

KFG
Re:Can it be "lossy" compression? by ameline · 2006-08-13 15:07 · Score: 1

That is definitely a needlessly large compressor. I can losslesly compress 100meg of wikipedia down to 1 bit, and my compressor program is only a small handful of bytes. My *decompressor*, on the other hand is large. (but conveniently automatically generated by the compressor program :-)

--
Ian Ameline
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 15:36 · Score: 0

Some computers.
Re:Can it be "lossy" compression? by Anonymous Coward · 2006-08-13 22:09 · Score: 0

You should talk to this guy:

http://www.geocities.com/patchnpuki/other/compress ion.htm
Re:Can it be "lossy" compression? by Lorkki · 2006-08-14 01:02 · Score: 1

I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.

Not incorrect, merely ambiguous. 100 MB in SI terms is exactly 10^8 bytes, and you can run that by just about any mass storage manufacturer. Powers of two are commonly used to refer to memory, and even to mass storage in software, but it's not uncommon for the "standards" to clash.
Re:Can it be "lossy" compression? by Cryptnotic · 2006-08-14 05:38 · Score: 1

100M means 100 meters, which has meaning, but doesn't make any sense in the context of a wikipedia corpus.

--
My other first post is car post.
Re:Can it be "lossy" compression? by L7_ · 2006-08-14 06:13 · Score: 1

Actually 100M is undefined, as 'meters' are always abbreviated lowercase. Only those units that are named after scientists are capatalized in thier abbreviation (be it english or latin letters). Think about it.

Who'da thunk... by blueadept1 · 2006-08-13 10:59 · Score: 5, Funny

Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!

Re:Who'da thunk... by mobby_6kl · 2006-08-13 11:15 · Score: 1

Damn you and your WinRAR! When is the deadline? WinRK says it needs 3 days 14 hours, you might be finished before then, but I'll surely take the prize... when it's done®
Re:Who'da thunk... by bky1701 · 2006-08-13 14:12 · Score: 1

Make sure to set it to best...

--
Great Intellect...
Re:Who'da thunk... by 3D+Lover · 2006-08-13 15:53 · Score: 1

The the file size goes right back up when your usenet clients re-encodes with yenc. At least it's not as bad as uuencode was.

Easy! by RyuuzakiTetsuya · 2006-08-13 10:59 · Score: 1, Funny

arj

--
Non impediti ratione cogitationus.

Lossy Compression? by Millenniumman · 2006-08-13 11:06 · Score: 3, Funny

Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.

Re:Lossy Compression? by blueadept1 · 2006-08-13 11:20 · Score: 2, Insightful

That is actually an interesting idea. What if you added a layer of compression that converted every possible common acronym, made contractions, etc...
Re:Lossy Compression? by larry+bagina · 2006-08-13 11:28 · Score: 2, Interesting

1) it wouldn't be lossless and 2) most compression techniques use a dictionary of common used words.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:Lossy Compression? by blueadept1 · 2006-08-13 14:41 · Score: 1

It would be lossless. You could create an algorithm to re-expand those acronyms, and who cares about the contractions, they still mean the same thing.
Re:Lossy Compression? by Anonymous Coward · 2006-08-13 18:58 · Score: 0

It would be lossless. You could create an algorithm to re-expand those acronyms, and who cares about the contractions, they still mean the same thing.
Huh? So you're saying it's *not* lossless, it's an acceptable loss. That's not the same thing.

But if you were reproducing a famous speech verbatim, or for something like poetry or Shakespeare where the rhythm of the words is key, abbreviations are clearly unacceptable.
Re:Lossy Compression? by Mac+Degger · 2006-08-14 01:18 · Score: 1

"and 2) most compression techniques use a dictionary of common used words."

Well, shit. I'm no compressionist/encryptologist/whatever, but that was my first thought when I read the article/blurb. I even started looking into median word lenghts for a language and bit lenghts versus word letter lenghts in a language and average texts. Did a nice little plain unecoded word-bitlenght/corresponding dictionary-number bitlenght comparison to see when you would get compression (and some looking into encoding to see how best to keep short words short when encoded)... ...and you tell me I coulda just checked j.random compression link found on google to see that has been implemented long before :(

--
-- Waht? Tehr's a preveiw buottn?

Comparison by ronkronk · 2006-08-13 11:08 · Score: 2, Informative

There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ gives some impressive results, but the latest benchmark figures are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers, of which there are lots with many older formats still not supported by open source tools

Re:Comparison by joshier · 2006-08-13 11:14 · Score: 1

If many of the same words are repeated, why not abbreviate them, and get the end computer to compile the structure where they belong?

for example, "the" "it" "has" will be repeated millions of times, why not just write a script to put X amount of 'the' is put in at x x x position in this paragraph.
Simple.
Re:Comparison by DarkProphet · 2006-08-13 11:23 · Score: 1

erm... isn't that already a part of how data compression algorithms like ZIP work right now?

--
What could possibly hurt the security of the American people more than giving our own government the ability to hide its
Re:Comparison by Breakfast+Pants · 2006-08-13 11:29 · Score: 1

Whoa, you are an inventive genious! Oh wait, that's kinda how nearly all compression works.

--

--

WHO ATE MY BREAKFAST PANTS?
Re:Comparison by joshier · 2006-08-13 12:55 · Score: 3, Funny

Well, if I knew that 15 years ago, I would indeed have been a genuis, sadly I realized too late and my genuis talents are wasted yet again.

Have no fear though, I'm working on a new one.

It's a big world out there by Harmonious+Botch · 2006-08-13 11:09 · Score: 4, Interesting

"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.

Re:It's a big world out there by Baldrson · 2006-08-13 11:16 · Score: 1

But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This is precisely the assumption of Hutter's theory.
Chapter 2 of his book "Simplicity & Uncertainty" deals with this in more detail but the link provided does do an adequate job of stating:
The universal algorithmic agent AIXI. AIXI is a universal theory of sequential decision making akin to Solomonoff's celebrated universal theory of induction. Solomonoff derived an optimal way of predicting future data, given previous observations, provided the data is sampled from a computable probability distribution. AIXI extends this approach to an optimal decision making agent embedded in an unknown environment.

--
Seastead this.
Re:It's a big world out there by Baldrson · 2006-08-13 11:32 · Score: 1

This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
There is no allowance for lossy compression. The requirement of lossless compression is there for precisely the reason you state.

--
Seastead this.
Re:It's a big world out there by nacturation · 2006-08-13 11:46 · Score: 1

... but it's progeny will never be able to leave the lab.

"it is progeny"? Damn, I thought we'd fixed that bug. Back to the lab with you!

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:It's a big world out there by Kjella · 2006-08-13 11:50 · Score: 1

But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.

No, that's just deductive science. I (or we, as a society) haven't tested that every cup, glass and plate in my kitchen (or the world) is affected by gravity, but I'm preeeeeeetty sure they are.

The problem - and the really hard AI problem - is that there is no single "program", there's in fact several billion independent "programs" running. These "programs" operate in many different ways, and would act differently in the same situation. Without understanding the individual "programs", at best it'll find some "common sense", which would be as useful as the average family having 2.4 children. It'll get completely thrown off track by individuality.

--
Live today, because you never know what tomorrow brings
Re:It's a big world out there by gardyloo · 2006-08-13 11:54 · Score: 4, Funny

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

Your use of "TFA" is a good compressional technique, but you could change "it's" to "its" and actually GAIN in meaning while losing a character! You're well on your way...
Re:It's a big world out there by DrJimbo · 2006-08-13 12:13 · Score: 4, Informative

Harmonious Botch said:
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
I encourage you to read E. T. Jaynes' book: Probability Theory: The Logic of Science. It used to be available on the Web in pdf form before a published version became available.

In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.

Since even an optimal decision maker has this undesirable trait, I don't think the existence of this trait is a good criteria for rejecting decision making models.

--
We don't see the world as it is, we see it as we are.
-- Anais Nin
Re:It's a big world out there by Anonymous Coward · 2006-08-13 12:19 · Score: 0

Cyclic reinforcement is redundant.

--
I figured out how to get a sig as an AC. Mod parent down and I'll tell you how!
Re:It's a big world out there by Ignis+Flatus · 2006-08-13 12:22 · Score: 2, Insightful

I think the original premise is wrong. Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective. And our intelligence is incredibly redundant. If you want robust AI, you're going to have to accept redundancy and imperfection. Same goes for data transmission. Sure, you compress, but then you also add in self-error correcting codes with a level on redundancy based on the known reliability of the network.
Re:It's a big world out there by Baldrson · 2006-08-13 12:49 · Score: 2, Informative

In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
I think what Hutter has shown is that there is a solution which unifies the new data with the old within a new optimum, which is most likely unique. I think it is based on the idea that Kolmogorov complexity is a unique value for any string and is most likely represented by a single optimum program (the "self-extracting archive" of the string).

--
Seastead this.
Re:It's a big world out there by smug_lisp_weenie · 2006-08-13 13:08 · Score: 1

> Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective.

Right- then all you need to do is run the data through the AI system and make a list of the few times it is wrong- This would be a small list. Then add this list to the end of the compressed data- If the AI is any good then you should still have fantastic compression.

...so now you've taken intelligence that is "lossy" and made it "lossless" in a highly efficient manner.

...I can't see how your argument disproves the central premise...
Re:It's a big world out there by jpellino · 2006-08-13 13:19 · Score: 1

"The belief system that explains observations initially is used to filter observations later. "

Hoo-ray for empiricism! Which as Hume pointed out is circular reasoning.
It does keep one from getting hit by buses, and is the driving force behind the alarm clock industry.
So it's not a total waste, nevermind the nasty philosophical bit.

--
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
Re:It's a big world out there by kognate · 2006-08-13 13:45 · Score: 3, Interesting

Yeah, but you can use Turbo codes to achieve near Shannon limit, and you don't have to worry too much about the addition of the ECC. Remember kids: study that math, you never know when information theory can suddenly pay off.

Just to help (and so you don't think I made Turbo Codes up -- it's sounds like I did 'cause it's such a bad name)
http://en.wikipedia.org/wiki/Turbo_code
Re:It's a big world out there by Short+Circuit · 2006-08-13 14:01 · Score: 1

No, that's just deductive science. I (or we, as a society) haven't tested that every cup, glass and plate in my kitchen (or the world) is affected by gravity, but I'm preeeeeeetty sure they are.

I thought that was inductive logic.

--
tasks(723) drafts(105) languages(484) examples(29106)
Re:It's a big world out there by DrJimbo · 2006-08-13 14:15 · Score: 1

Be that as it may (and I highly doubt there will always be unique solutions) it is a separate issue entirely from what the original poster and I were talking about. We were talking about the problem of deciders becoming prejudiced over time, falling into a rut where new information tends to confirm and reinforce existing "beliefs".

This effect is often seem clearly when as I said before and you quoted:
... optimal observers who have reached opposite conclusions (due to differing initial sets of data)
The concept of a unique optimal solution giving one set particular set of new and old data doesn't really enter into the discussion we were having.

But that's okay, you still get a wonderful parting gift. Thanks for playing!

--
We don't see the world as it is, we see it as we are.
-- Anais Nin
Re:It's a big world out there by NichG · 2006-08-13 15:59 · Score: 1

Not necessarily. One has to consider what information is meaningful in the context of the AI and what information is not meaningful. Take something in which you have a signal that contains certain ordered information and a certain gaussian noise term. You are interested in the ordered information, so extracting that and storing that is lossy - it may reduce the signal from a megabyte to 5 bytes which correspond to five symbols transmitted over the lossy medium. But then even if you write 'add gaussian noise with this standard deviation and that amplitude' you have a lossy compressor because byte-for-byte you get a different signal even if its statistically similar. To make it lossless, you have to include a record of a lot of useless random information which the AI may not care about because it is meaningless in the AI's context.

Every intelligence we know of emerged in an environment where stresses made certain factors important and other factors unimportant. By taking that into account, intelligences can be excellent compressors because they know what to discard. This doesn't even have anything to do with clever decomposition of data into generators - you achieve a lot simply by throwing away what isn't necessary.
Re:It's a big world out there by squoozer · 2006-08-14 03:31 · Score: 1

It seems to still be available in e format - not that I have the faintest clue what it's going on about.

--
I used to have a better sig but it broke.

lzip! by w00d · 2006-08-13 11:11 · Score: 1

Just use lzip! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.

Re:lzip! by Jeremi · 2006-08-13 13:04 · Score: 1

Just use lzip! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.

Bah, lzip is nothing compared to azip, which has all the infinite-compression goodness of lzip, but also supports lossless decompression. The downside is that it currently only runs under BeOS, but it comes with source so it should be easy enough to port to (whatever).

--

I don't care if it's 90,000 hectares. That lake was not my doing.

Easy compression rule by Anonymous Coward · 2006-08-13 11:12 · Score: 0

One key to the compression of data is to reduce the set of tokens
used such that all sentences can be constructed using the smallest
dictionary. That means not inventing new words and saying "incentivize"
where "motivate" would do.

Seriously, if the Wiki was written using good English and correct spelling
that would be a major step forward to data reduction.

Re:Easy compression rule by joshier · 2006-08-13 11:24 · Score: 1

I mentioned this above in a reply too.

What I dont understand is why we cannot zip up open source programs to an extreme amount.

For example, the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.

Yes, takes a lot more CPU work, but the download is so much smaller.
Re:Easy compression rule by Kadin2048 · 2006-08-13 11:37 · Score: 1

the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.

Huh? What do you think the compression software is doing right now? It searches through the file for blocks of similar information, and then replaces those blocks with pointers to an index, where it stores the block itself. The more repetition you have in the data, the more compression you can get out of it. Good algorithms use (I believe) variable block sizes, and reset the indices as they work their way through the file based on what is probably optimal.

Putting an extra layer of compression on top of that, by replacing frequently used blocks of code, or replacing English words with abbreviations and re-inflating them at the opposite end (as someone else in this thread suggested), would only make the data being fed into the "real" compression utility a little harder to compress. It's unlikely that you could do a better job than any reasonable compression program does already with the same data.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:Easy compression rule by rm999 · 2006-08-13 11:41 · Score: 1

Indexing into that extra file table takes bits. For example, if you have 1024 different possible abbreviations, you need to use on average 10 bits (at the least) to index that table for each abbrevation. You could use less bits for the more common ones (like huffman coding), but you are still wasting space indexing.
Re:Easy compression rule by Anonymous Coward · 2006-08-13 12:03 · Score: 1, Insightful

If you're unfamiliar with the current state of the art in a field that's under intensive study, it's a near guarantee that your new ideas either don't work or are already a very basic part of the current technology. There's really no shortcut to doing the background reading and understanding how things are currently done before setting out to improve them. If you want a starting point in this area of study, here's one; once you've gotten through that you can start on the remaining 22 years' worth of research.
Re:Easy compression rule by LiquidCoooled · 2006-08-13 12:29 · Score: 1

Since the decimals along Pi are essentially an infinate lookup table (there are phone number lookup programs which find information in pi), couldn't you use the calculation of pi to produce your data tree.

The entire 100mb file and everything else should be identifiable by a single location along pi.
mmmmmmmm pi.

--
liqbase :: faster than paper
Re:Easy compression rule by ampathee · 2006-08-13 12:33 · Score: 1

wikiped doubleplusungood ref compression rewrite fullwise newspeak upsub antefiling
Re:Easy compression rule by bhaberman · 2006-08-13 12:56 · Score: 1

Nope, the number of digits needed to address any significant data contained in pi would be phenomenally large.
Re:Easy compression rule by Jeremi · 2006-08-13 13:15 · Score: 1
It's unlikely that you could do a better job than any reasonable compression program does already with the same data.

Very simple way to win this contest:
1. Write a program that generates the digits of PI
2. Have the program run until it comes across a sequence of digits in PI that are equivalent to the Wikipedia file
3. At that point, the program can just print out the index/offset into PI where the sequence starts, and that is your compressed output. 'Decompressing' it again is just a matter of looking up that sequence of digits in PI.
Couldn't be easier! ;^)
--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:Easy compression rule by morcego · 2006-08-13 13:41 · Score: 1

In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.

Also, you would have to either store enough digits of pi (which would defeat any compression ideas), or calculate it "on site", which would be ... hummm ... kind of fun, I suppose.

--
morcego
Re:Easy compression rule by lazybeam · 2006-08-13 13:42 · Score: 1

"The end result of this is a very fast way to find a seven digit sequence within PI. So have fun, search for your phone number or someone elses."

I have an eight-digit phone number, you insensitive clod.

"The number 0000000 was found starting at position 3794572 to the right of the decimal point."

So replacing one seven digit number with another one, good work. ;-)

--
-- no sig for you. come back one year.
Re:Easy compression rule by 4D6963 · 2006-08-13 13:57 · Score: 1

Very simple way to win this contest:...
The only problem is that your index will be as big as your original file.
Let's say you want to store your ten-digit phone number using that PI technique. A ten-digit phone number will be found at an average index of 10^10, or 10,000,000,000, the problem is that index is 11 digits long, so your really not winning. In this particular case, you could hope to find your index around 8^100,000,000, which would take 100,000,000 bytes (or 100 MB, the original size) to store.
Couldn't be easier! ;^)
hehehe, right.

--
You just got troll'd!
Re:Easy compression rule by Anonymous Coward · 2006-08-13 14:53 · Score: 0

On a similar note, I wonder how much of Wikipedia consists of typos. I tend to find one every 10 or 15 articles, and I'm not even looking for that sort of thing when I'm reading. Maybe Wikipedia should run articles through a spell check before letting you submit anything.
Re:Easy compression rule by Jeremi · 2006-08-13 14:54 · Score: 1

The only problem is that your index will be as big as your original file.

Well... maybe. It depends a lot on what data you are trying to compress. As a best-case scenario, it can compress the entire decimal expansion of PI down to a single digit: "0". Infinite digits compressed to a single byte, not bad eh? ;^)

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:Easy compression rule by 4D6963 · 2006-08-13 16:04 · Score: 1

Well... maybe. It depends a lot on what data you are trying to compress.
Well yes and no. Either your data is the decimals of PI, as you just said, or it is some particular case that is extremely unlikely to happen. Correct me if I'm wrong, but you would have about 1 chance out of 8 to gain a byte in your index data (or, for example, you would have about 1 chance out of 1 billion to save 10 bytes, out of 100 MB, well that's if I didn't make any mistake)

--
You just got troll'd!
Re:Easy compression rule by bnenning · 2006-08-13 17:51 · Score: 1

In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.

Not only can, but will the vast majority of the time. Most strings are incompressible by *any* algorithm. (Strings consisting of human-readable text are of course exceptions). And yet, past a certain size it is unprovable that any specific string is incompressible. Algorithmic information theory is wacky.

--
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
Re:Easy compression rule by LiquidCoooled · 2006-08-13 23:10 · Score: 1

Nahhhhh you misunderstand the concept here,
Let's suppose the binary string I am trying to compress is "141592653589793238462643383279502884", this exists at position 1 to the right of the decimal point.

Since Pi is infinite there will be a matching location for *any* string somewhere within pi, you just have to tell them the index and they can calculate it (negating the fact that they might require a few millennia to decompress the "hello world" message).

The same should also be possible by supplying coordinates and vectors into the mandelbrot set, letting the user regenerate the results based upon this lookup.

Never having to store the dictionary away would enable higher compression ratios.

Its certainly not practical, but it isn't impossible to conceive.

--
liqbase :: faster than paper
Re:Easy compression rule by maxwell+demon · 2006-08-14 02:27 · Score: 1

In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.

Not only can, but will the vast majority of the time.

Well, then the solution is easy: Instead of searching the position in pi for "compression" and reading from for "decompression", do it the opposite way: For compression, look up what is at the corresponding position of pi, and for decompression, find the position in pi where the coded information can be found. Given that the first method vastly increases the size almost always on "compression" and therefore vastly decreases the size on "decompression", doing the reverse should of course vastly decrease the size on compression and vastly increase it on decompression. :-)

--
The Tao of math: The numbers you can count are not the real numbers.

Er, I'm not so sure about this. by aiken_d · 2006-08-13 11:15 · Score: 3, Interesting

Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.

It really seems like one of those mistaking-the-map-for-the-territory errors.

-b

--
If I wanted a sig I would have filled in that stupid box.

Re:Er, I'm not so sure about this. by Baldrson · 2006-08-13 12:12 · Score: 1, Offtopic

Wikipedia is a representation of accumulated human knowledge (experience) presented primarily in natural language. The smallest self-extracting archive will necessarily have rules that imply that knowledge and in such a way that the rules of natural language are represented as well.
The distinction between compressed experience and rules is an illusion. Rules _are_ compressed experience in the same sense that "x+y=z" is a compressed representation of the table of all ordered pairs (x,y) of numbers and their sum (z). If one has 2**32 distinct rows in such a table then one can physically represent it in a very small program. Even if one has only a small sample of the rows, one may still quite profitably compress those rows with the program.

--
Seastead this.

Too vague to consider whats best... by Anonymous Coward · 2006-08-13 11:24 · Score: 0

Sure, the common formats such as RAR and bzip2 produce highly compressed files, but they would also be impractical to use in an interactive sense just because the compression and decompression operations are slooooow.

On the other hand, zlib is a commonly used format since it compresses quick, abliet not as well as other formats. Even quicker is lzo, but again it doesn't compress as well.

I can convert the data to 1 bit. by jez9999 · 2006-08-13 11:24 · Score: 0, Redundant

Of course, you'll need a ~100MB program to extract it...

--
== Jez ==
Do you miss Firefox? Try Pale Moon.

Re:I can convert the data to 1 bit. by joshier · 2006-08-13 11:27 · Score: 1

If you're being serious then, that's briliant, where can I download your app?
Re:I can convert the data to 1 bit. by gkhan1 · 2006-08-13 11:39 · Score: 1

If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file. It's the size of that decompressor that counts. So no, you couldn't do that.

Solution. by Funkcikle · 2006-08-13 11:34 · Score: 5, Funny

Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.

Then just apply your personal favourite compression utility.

I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.

Re:Solution. by Anonymous Coward · 2006-08-13 11:45 · Score: 0

lharc according to the Wikipedia.
Re:Solution. by Dr.Ruud · 2006-08-14 01:10 · Score: 1

All your millibits are belong to us.

got it by syrinx · 2006-08-13 11:35 · Score: 1

I wrote a script right here. It will programatically generate all of Wikipedia. Eventually.

--
Quidquid latine dictum sit, altum sonatur.

WHOOSH! by Anonymous Coward · 2006-08-13 11:37 · Score: 0

there it goes!

That's easy by CastrTroy · 2006-08-13 11:39 · Score: 1

That's easy, all you have to do is run your program on a computer that users 32-bit bytes. That way you can fit more bits in your bytes, and automatically beat the record by 4 times.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.

Re:That's easy by Anonymous Coward · 2006-08-13 13:35 · Score: 0

Why stop at 4-bit bytes? Design a computer (emulated, unless you have a chip fab in your garage or something...) that uses something like 1 terabit bytes. Let's see someone do better than 1 byte! ;)
Re:That's easy by fireboy1919 · 2006-08-13 14:43 · Score: 1

Bytes are always eight bits. Nibbles are always four bits. Kilobytes are always 1024 bytes. Are you seeing a pattern?

The word you should be using is, well, "word." That's what we used to call the bus width of the system. Today, though, the world is much more complicated. Lots of different buses, lots of different bus widths. CISC instruction sets are more CISC than they used to be (outside of mainframes).

So really, that doesn't apply very well even in the way you intended it.

--
Mod me down and I will become more powerful than you can possibly imagine!
Re:That's easy by Anonymous Coward · 2006-08-13 18:19 · Score: 0

Octals are always eight bits. Bytes are *usually* (not always) eight bits.
Re:That's easy by vidarh · 2006-08-13 21:41 · Score: 1

Either you're a newbie or you have poor memory. The word "byte" most certainly has been applied to a wide variety of number of bits.

Well.. doesn't the dictionary make it smaller??? by popo · 2006-08-13 11:40 · Score: 1

Doesn't the dictionary in PAQ8A,B,C,D result in smaller filesizes if you're talking about a 100M+ large file?

--
------ The best brain training is now totally free : )

Painful to read - only if you expect simple stuff by Anonymous Coward · 2006-08-13 11:40 · Score: 0

I proofread TFA, and there are no errors that I can see. Ok, the grammar is a bit more complex than Dr Suess, but /.ers have to learn to deal with english sooner or later.
Beside, it IS an article on compression...

Finally a Positive Slashdot article by theuser22 · 2006-08-13 11:45 · Score: 1

This is the type of contest we need to see more of.

Incentivize? by noidentity · 2006-08-13 11:47 · Score: 5, Funny

the intent of which is to incentivize the advancement of AI

Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.

Re:Incentivize? by DavidD_CA · 2006-08-13 13:31 · Score: 1

Please join me in my personal crusade to eliminate the word "incent" and "incentivize" from our culture.

The root word is "incentive" and it wasn't until the last few decades that people came up with "incent", "incentivize", and -- god forbid -- "disincent".

May I suggest these alternate words:
- encourage
- give incentive
- influence
- motivate
- stimulate

Hell, even "prod" is a better word. Now let us raise our torches and pitch forks and put these rogue words to rest.

--
-David
Re:Incentivize? by ockegheim · 2006-08-13 15:52 · Score: 1

Incentivation was a word bandied about Australian political circles for a while. I can hardly believe it began 19 years ago, but I remember the word got short shrift from anyone who cared about the English language.

--
I’m old enough to remember 16K of memory being described as “whopping”
Re:Incentivize? by nrlightfoot · 2006-08-13 17:29 · Score: 1

Give him a break, he's a physicist gone computer scientist. They are notorious for inventing words. Why, the last time I had a class (fortran) with one he used the word deliminator several times.

--
what sig?

Wrong contest by Baldrson · 2006-08-13 11:55 · Score: 3, Informative

That's another contest that is useless for the reason you cite.

The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.

--
Seastead this.

I'll try: by dcapel · 2006-08-13 11:59 · Score: 5, Funny

echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive

Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!

--
DYWYPI?

Re:I'll try: by WaltFrench · 2006-08-13 12:21 · Score: 1

> There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.

New computer math: learn to count to 3. Pascal programmers -- people, almost all of 'em -- easily write things like, VAR a: ARRAY[-128..127] OF REAL; Much harder to screw up array bounds when both lower and upper bounds are explicit, of any countable type and run-time checked.

--
"Inquiring Minds Want to Know!"
Re:I'll try: by MarkRose · 2006-08-13 12:37 · Score: 3, Funny

echo "!#/bin/cat /dev/tty0" > archive

Here's one that's even shorter, but you have to type in the decryption key exactly right.

--
Be relentless!

Good idea by wetfeetl33t · 2006-08-13 12:01 · Score: 1

If someone could write a program that simulates 1000 virtual monkeys, all typing on typewriters, eventually they would end up with wikipedia. Or maybe they would end up with the equivalent of wikipedia very quickly.

--
Register the editry.

Re:Good idea by Feyr · 2006-08-13 13:37 · Score: 1

no that's wrong. the posit is for an INFINITE number of monkeys typing for infinity. 1000 isn't enough
Re:Good idea by psm321 · 2006-08-13 13:42 · Score: 1

No, even 1 should be enough, given infinite time.

Reductionary Transformation Theory by transami · 2006-08-13 12:02 · Score: 1

Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.

--
:T:R:A:N:S:

Re:Reductionary Transformation Theory by Anonymous Coward · 2006-08-13 12:15 · Score: 0

Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.

Since no one has ever heard of you or your theory (and never will), it seems like you've got fair license to name it whatever the hell you want.
Re:Reductionary Transformation Theory by greg1104 · 2006-08-13 20:02 · Score: 1

Not to be concited, but I thought of that over a decade ago

No problem there, conceited people have better spelling than you.

Sweet! by Anonymous Coward · 2006-08-13 12:09 · Score: 0

Save Wikipedia millions, win a free t-shirt. (S&H Extra)

Sign me up!

Is lossless really best by Anonymous Coward · 2006-08-13 12:10 · Score: 2, Interesting

I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.

Re:Is lossless really best by vidarh · 2006-08-13 21:37 · Score: 1

You're right. But such a competition would be near impossible to judge because it would require reading through the original and the result and ensuring they correspond sufficiently. Lossless compression is a tradeoff.

my algorithm by Anonymous Coward · 2006-08-13 12:12 · Score: 0

I can get that sample down to 1 byte. The compression algorithm is so complex, however, that the compiled decompressor binary is about 100M.

Hutter's theory? by Prof.Phreak · 2006-08-13 12:17 · Score: 1

Sounds quite similar to Solomonoffs' universal prediction.

--

"If anything can go wrong, it will." - Murphy

wikipedia by gEvil+(beta) · 2006-08-13 12:20 · Score: 1

Since it's wikipedia, am I allowed to edit the entries before I compress them? ;-)

--
This guy's the limit!

Good compression != intelligence by Anonymous Coward · 2006-08-13 12:21 · Score: 1, Insightful

The premise of this contest shows a remarkable ignorance of compression theory and technology. True, in theory the more knowledge one has about the world, the better one should be able to compress, but only in the most abstract sense. In practice the vast majority of the bits any "image" of the world like a picture or a piece of text are almost wholly unrelated to high-level features of the world, and thus compression algorithms rightly focus on low-level features of the "image" that are almost wholly separate from the features AI researchers care about.

For example, suppose one wishes to compress two consecutive pictures of a chess tournament. Does knowing the rules of chess help much? Sure, one might use knowledge of chess to deduce a few bits (literally) of information about the second image (what move was made), but when trying to compress two 10-megabit images, who cares? Much better to focus on low-level features such as world smoothnesss, lighting variations, motion flow, et cetera.

Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.

The winner of a Wiki-compression contest is going to be a variation of an ordinary text compression algorithm, and will use techniques that do not in any obvious way translate to "smart" applications, in the same way that the highly successful N-gram models of speech recognition and machine translation do not have any properties one would normally associate with intelligence.

One can build an "intelligent" compressor, but it's the LAST thing any compression researcher is going to do, because they know high-level intelligence doesn't have many bits of information to provide.

Re:Good compression != intelligence by Anonymous Coward · 2006-08-13 14:46 · Score: 0

This is the best AC post I've seen in months. Why do you post AC when you have something insightful to say? ( I save AC for trivial stuff like this )
Re:Good compression != intelligence by vidarh · 2006-08-13 21:33 · Score: 1

Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.
This is true up to a point. However, if I know that the discussion is related to a specific concept, I have a context that allow me to leave out a lot of details that can be reconstructed from that context later. If I know that the persons in the conversation are trying to impress each other with repeating as much as they can remember or calculate in their heads of the fibonacci sequence, pi, e, for instance, I can trivially outperform a typical text compression algorithm. But to be able to do that I need to recognise more than the word patterns - I need to recognise the progression and know what the correct way to continue is.
Ultimately the optimal compression a large enough database of human knowledge will include lots of special case information representing concepts needed to understand and handle those special cases in an efficient way. The size of the database will decide how many special cases can be added without a net loss when factoring in the size of the compressor.
How much can be gained this way for Wikipedia? I don't know. That in itself is an interesting research question, as it would involve some sort of measure of information density or Wikipedia.
Of course, a lot better results could be achieved if semantic equivalence was sufficient - but then judging the result would be near impossible.
Re:Good compression != intelligence by Anonymous Coward · 2006-08-13 23:09 · Score: 1, Insightful

We should make a distinction between practical compression algorithms that operate under time constraints, and the hypothetical compressor that uses an extremely complicated ("AI") model.

Consider the compression of text using the best available N-gram models and whatever is the state of the art right now. If you use the same stochastic model to generate a random text of moderate length, you will end up with nonsense with a probability of approximately 100%. This means that most of the potential compressed bitstrings that can be output by the state-of-the-art compressor are useless, and that a better compressor could in principle be more efficient.

Of course, the fact that better compression is possible in principle does not mean that it is achievable in reasonable time on anything resembling our current computer technology.

Out of the box... by MerrickStar · 2006-08-13 12:26 · Score: 1

Print it out... Then it doesn't use any data space, only physical space

Lingusitic Pattern Matching by Anonymous Coward · 2006-08-13 12:27 · Score: 0

I wonder how much linguistic patterns (word inflictions, synonyms, etc) would help in writing a better compression algorithm. There is a free framework one could use to write sophisticated language patterns to target those found in Wikipedia articles, after all, most Wikipedia articles follow certain conventions (e.g. most start with a ISA clause: A roguelike is a computer game that borrows some of the elements of another computer game). The grammar would definitely take time to build, but the results of such a method might be pretty interesting.

C++ by The+Bungi · 2006-08-13 12:32 · Score: 2, Funny

Interestingly enough, the source code for the compressor is C++. One would expect the thing to be written in pure C.

A (good) sign of the times, I guess.

Re:C++ by Anonymous Coward · 2006-08-13 14:29 · Score: 0

Don't worry, they can make it go faster by wrapping the code with this:
#ifdef _cplusplus
extern "C" {
#endif

Another suggestion I have is to shorten those lengthy comments which must take the program a significant amount of time to read; of course, you can't cut them too short or else the computer won't know what it's supposed to do.

I win! by Anonymous Coward · 2006-08-13 12:35 · Score: 0

I can compress a significant amount of Wikipedia's information down to two bytes: 0x42 0x53 (BS)

I win! by multimediavt · 2006-08-13 12:37 · Score: 0, Redundant

Here's my compression of 100M of Wikipedia: "Always consult more than one source of information" Make check payable to ...

My entry by mbstone · 2006-08-13 12:39 · Score: 1

echo I'mright!NoI'mright!You'reanasshole!So'syourmother !! >distilled-wiki.txt

Yes by Baldrson · 2006-08-13 12:59 · Score: 1

And indeed you can come up with all kinds of rules to help you predict the characters in a stream like Wikipedia, not just linguistic rules, but higher level rules are less likely to become crucial with the 100M contest and may need to wait for the 1G (or complete archive of Wikipedia) contest. I can imagine that certain rules relating to relational similarity (analogy) might come into play at the 100M level, particularly since there are now programs that can perform as well as college bound seniors on the SAT verbal analogies test (a heavily 'g' loaded test).

--
Seastead this.

how about.... by ChrisGilliard · 2006-08-13 13:00 · Score: 1

# tar cvf wikipedia.tar /wikipediapath
# bzip2 wikipedia.tar

--
No Sigs!

Re:how about.... by MadMidnightBomber · 2006-08-13 13:16 · Score: 1

# bzip2 wikipedia.tar

Don't forget to bzip2 it a couple more times to make it even smaller!

--
"It doesn't cost enough, and it makes too much sense."

Prizes by Anonymous Coward · 2006-08-13 13:02 · Score: 0

Prizes have the danger of causing people who could have been focusing their energy in something their true talent is in to pursue and (more often) fail in acheiving what somebody else figured was useful.

I'm not usually a negativist.

That said this could be useful so I dont wish to discourage anyone from pursuing it especially people who'd waste their time otherwise.

Remove words and punctuation by failedlogic · 2006-08-13 13:03 · Score: 1

Start with: "the" "and" Then remove periods commas semicolons Speling mistkes ok if readable If other words are commonly used those could be removed It would be a first a fill in the blank-as-you-read encyclopedia Sales staff for door to door enclopedias can convince you new edition saves trees

The above text is a demonstration. No punctuation. Left out "and" and "the". Since some entries are hard to read, the fill-in-the-blank interpretation adds to the experience! ;)

If this new compression is added to the existing compression, I'm sure it'll beat everything.

Re:Remove words and punctuation by maxwell+demon · 2006-08-14 03:07 · Score: 1

Ok, please write the decompressor for this.

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Remove words and punctuation by failedlogic · 2006-08-14 11:21 · Score: 1

Decompress? I was just planning to leave it alone.

But then again, I guess there are rules to said competition. ;)

Could the winning entry... by dpbsmith · 2006-08-13 13:04 · Score: 1

...be a program that creates an "encyclopedia" website and invites humans to contribute to it?

--

"How to Do Nothing," kids activities, back in print!

Two bytes. by Anonymous Coward · 2006-08-13 13:07 · Score: 0

42

I've got it! by Millenniumman · 2006-08-13 13:35 · Score: 1

I've got it!

while (result != wikipedia) { result = [randomGenerator randomStringOfLength:[wikipedia length]]; }

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.

Re:I've got it! by ValiantSoul · 2006-08-13 14:35 · Score: 1

You must be a Mac guy :) Me too!
Re:I've got it! by Anonymous Coward · 2006-08-13 15:02 · Score: 0

how about this:

result = [randomGenerator randomStringOfLength:[wikipedia length]];
if (result != wikipedia)
{
destroyUniverse();
}

guaranteed to work on the first iteration.
Re:I've got it! by bnenning · 2006-08-13 17:57 · Score: 1

You need to define the wikipedia variable, but that's easy enough to do by first executing another instance of the program and setting wikipedia to the output.

--
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
Re:I've got it! by Anonymous Coward · 2006-08-13 19:00 · Score: 0

Wrong. You want:

while (![result isEqualTo:wikipedia])

2 lines of code, and a bug. Amazing. :)
Re:I've got it! by Angostura · 2006-08-13 21:50 · Score: 1

Well, I thought it was funny.

No Need for Lossless Compression by DrunkenTerror · 2006-08-13 13:39 · Score: 1

All the articles worth having will be deleted within 4 month's time, anyway.

And there was me thinking.. by baz1860 · 2006-08-13 13:46 · Score: 2, Funny

that the entire knowledge of the world could simply be compressed without loss to

yeah, you guessed it..

42...

--
He who would trade liberty for some temporary security, deserves neither liberty nor security

Self-extracting on what platform? by tepples · 2006-08-13 13:47 · Score: 1

The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive

For what architecture?

Bingo by DaveAtFraud · 2006-08-13 13:48 · Score: 1

Absolutely! The "right answer" for best *intelligent* compression would only store a *minimal* set of pertinent data points and then would use intellect to flesh out the details on decompression. Although you could end up with http://www.reducedshakespeare.com/, I'll worry that when some AI researcher starts down this path

Minor disagreement with what you said... The details are relevant; they just aren't important enough to store in and of themselves. Sort of like mathematics where you only need to learn the principle (pertinent data) and how to apply it. Then every example is just a specific case (flesh out the details).

It takes real intelligence to know which priciple to apply or that none apply and you're doing real research. Genius is being able to find such a solution where none existed before. The details are relevant but just not important enough to bother storing since they can be reasonably recreated by applying real inteligence. So, real intelligent compression would mean that different people decompressing the same article at different times would *not* get the same text but enough of the pertinent data would be the same that they all come away from reading the article with the same meaning. That would be real intelligent compression and decompression.

Cheers,
Dave

--
They that can give up essential liberty to obtain a little temporary safety deserve neither safety nor liberty.
Ben

Re:Bingo by maxwell+demon · 2006-08-14 02:58 · Score: 1

So, real intelligent compression would mean that different people decompressing the same article at different times would *not* get the same text but enough of the pertinent data would be the same that they all come away from reading the article with the same meaning.

I don't think your criterion could be met even if all those people would read the very same original text.

--
The Tao of math: The numbers you can count are not the real numbers.

libwikipedia? by tepples · 2006-08-13 13:50 · Score: 1

If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file.

What libraries is this executable allowed to call?

Re:libwikipedia? by gkhan1 · 2006-08-13 17:18 · Score: 1

This is the exact rule (from http://cs.fit.edu/~mmahoney/compression/textrules. html ):
"The decompressor must be a single Windows (x86 32 bit) or Linux (x86 32 or 64 bit) executable file. It is not zipped. The size of the source code (if available) is not used. The decompressor must not depend on any other files not normally part of Windows or Linux such as dictionaries, configuration files, etc.
Which means that you cannot make a 100mb library to go with your file, I guess.
Another thing to note is that the file, enwiki8, isn't actually 100mb, it's 100,000,000 byte, ie 10^8 byte, not 100*2^20 byte. So not really 100mb. To me this is strange, I mean any real CS guy would have gone with binary, right? Only a real newbie would go with the exact decimal number that make very little sense in computer terms.
Re:libwikipedia? by maxwell+demon · 2006-08-14 02:40 · Score: 1

Another thing to note is that the file, enwiki8, isn't actually 100mb, it's 100,000,000 byte, ie 10^8 byte, not 100*2^20 byte. So not really 100mb. To me this is strange, I mean any real CS guy would have gone with binary, right? Only a real newbie would go with the exact decimal number that make very little sense in computer terms.

Of course the file is not just 100 millibits (100mb); that would amount to 1/10 of a bit. It is however 100 MB (100 000 000 bytes). It is, of course, not 100 MiB (104 857 600 Bytes).
See here to get enlightened.

--
The Tao of math: The numbers you can count are not the real numbers.
Re:libwikipedia? by bogado · 2006-08-14 05:11 · Score: 1

This means that it can depend on the dictionary of english, words that is present on several linux distributions? Using pre installed dictionaries on your system seems to me a good start to a good compressor for english texts. simply using indexes for each word is a good way to compress many texts. :-)

--
[]'s Victor Bogado da Silva Lins
^[:wq
Re:libwikipedia? by gkhan1 · 2006-08-14 09:09 · Score: 1

Yes, thank you, I am very aware of the different unit standards mebibytes and all, but I typed "mb" instead of "MB" for two reasons, a) I think it looks better, b) I'm lazy and c) absolutely no one is confused because a "millibit" is impossible by definition (and if you bring up this thing I'm gonna slap you hard). It's incredibly rude and petty to point something like that out. I suggest you mind your manners in the future.
Re:libwikipedia? by maxwell+demon · 2006-08-15 01:13 · Score: 1

It's incredibly rude and petty to point something like that out. I suggest you mind your manners in the future.

Please re-read the last paragraph of your post which I was replying to, and then reconsider if you are really in a position to complain about rudeness.

--
The Tao of math: The numbers you can count are not the real numbers.

There is a theory that by cdrguru · 2006-08-13 14:13 · Score: 1

for any given dataset there is a pseudo-random number generator that, with the proper parameters, produce that dataset of bytes in sequence.

The problem is that it is non-trivial to find the proper algorithm and parameters. However, if trial-and-error computation is not a problem, a large library of potential generators can be tried with all possible parameters.

Re:There is a theory that by ltbarcly · 2006-08-13 17:13 · Score: 1

This is false. There is not such a generator for any dataset, such that the generator can be described using less bytes than the size of the dataset in question. In more clear terms, the cardinality of the set of datasets of size N-bits is larger than the cardinality of the set of psuedo-random number generators which can be described using less than N-bits.

I'll even include a proof. Suppose you could find a psuedo random number generating program for any dataset of size N-bits, and each pseudo random number generator could be described in less than N-bits. Now take that psuedo number generator and take it to be the dataset, and repeat this process to find a smaller pseudo number generating program which generates this pseudo number generating program. Since by hypothesis this can be repeated for any dataset, and therefore any pseudo number generating program, by applying this process over and over we eventually get a program which is one byte long, or one bit long. This is clearly a contradiction.

I apologize for any embarrassment this posting may have caused.
Re:There is a theory that by Mac+Degger · 2006-08-14 01:32 · Score: 1

's not neccesarily a proof....it's incomplete. There may be limits (your proof might be wrong, until you reach a minimum lenght, at which point your proof does hold). Disproof is the fact that I can write a program of a few bytes which creates a larger program.

--
-- Waht? Tehr's a preveiw buottn?
Re:There is a theory that by ltbarcly · 2006-08-14 12:24 · Score: 1

Ok, you don't understand my posting, but that is ok.

My proof does not specify any set size, therefore it is valid for a any size dataset. However, my proof does not claim what you think it claims. You think my proof says that you can't find a program to create a dataset which is smaller than the dataset, for some specified dataset X. Clearly this is false, as you can compress many datasets. However, there are some datasets which cannot be compressed at all using any method. My proof, although probably not nearly perfect, does show this.

Once again, suppose you could compress any dataset of length N to size N-M, where M >=1. This is just saying that you can shrink it by compressing it by at least one byte (the program used to compress the dataset should be included in the total length of the compressed dataset). Now you can repeat this process and get an even smaller dataset. By hypothesis, (this just means the thing we assumed to begin with, that is that we can compress to size N-M a dataset of size N) we can repeat this process and get a dataset of size N-M-M2, where M2 >= 1. Repeating this N times leaves us with a size of equal to or less than 0, proving that you cannot in fact compress any dataset.

This is a complete proof that there exists some dataset which cannot be compressed using any method. In fact, this proves that there are infinitely many such sets, at least one of each size N. This of course assumes that you include the decompression program as a component of the length of the compressed dataset, which is reasonable because otherwise you could just have the decompression program copy the dataset, giving perfect compression (although this is highly dubious it might be argued by the pedantic).
Re:There is a theory that by Mac+Degger · 2006-08-20 12:49 · Score: 1

Your original statement: "There is not such a generator for any dataset, such that the generator can be described using less bytes than the size of the dataset in question."

Now you say: "You think my proof says that you can't find a program to create a dataset which is smaller than the dataset, for some specified dataset X. Clearly this is false, as you can compress many datasets. However, there are some datasets which cannot be compressed at all using any method. My proof, although probably not nearly perfect, does show this."

"Any" dataset or "some"...which is it? That is a crucial (and somewhat monumentally huge) difference. I did understand your post, btw. I had an exam on the subject of mathematical proofs of a computanional nature just two months back. I scored an 8/10, which in european terms is pretty damn good (US equivalent would be an A). Maybe because I was slightly drunk at the time of my posting I wasn't as clear in what I was pointing out. Sorry about that.

--
-- Waht? Tehr's a preveiw buottn?
Re:There is a theory that by ltbarcly · 2006-08-26 03:40 · Score: 1

Ok, you don't even know the basic terms, don't lecture me.

I proved that the statement "every dataset is compressable" is false, using contradiction. In other words "there exists a dataset (possibly many) which cannot be compressed". These statements are logically identical.

To restate, there are some datasets which cannot be compressed. This directly proves your statement that every dataset can be generated by a pseudorandom number generator FALSE, at least in those cases where the pseudo random number generator can itself be described with less data than the dataset you are trying to produce.

To recap yet again, my proof shows that there are some datasets which cannot be created using a pseudorandom number generator of length less than that of the dataset, proving your assertion that there is such a generator for every possible dataset FALSE.

Once more: Your initial posting was incorrect. I corrected it and supplied a mathematical proof that it was incorrect, which you didn't (and apparently still do not) understand, regardless of your score on your quiz.

wikicast by VolciMaster · 2006-08-13 14:14 · Score: 3, Funny

a method for periodically re-syncing...

So, we need a WikiCast - remember folks, you heard it here first!

--
antipaucity

Would be useful for images by aliquis · 2006-08-13 14:14 · Score: 4, Funny

... now all we need is a dictionary for nudity and we could save a lot of bandwidth on the Internet!

Compressing text by 4D6963 · 2006-08-13 14:16 · Score: 1

The way I see it, there's mainly one good way to compress text (Disclaimer : I do not know if this idea has ever been used before, but if it has i'd be glad to know about it). Let's say you index the values of the most frequently used characters, and that you keep a 'joker' value to insert a custom character after, let's say that it sums up to about 45 characters for these indexed characters. You replace each character by its indexed value, and instead of having it in base 256, you calculate it all in base 45.

If we consider that the unfrequent bytes' weight is unsignificant, we go from 100 MB of text (or HTML, the characters for the tags aren't that many), we get 100/256*45 = 17.5 MB. I guess that one top of that you could compress it using some traditional algorithm, although most of the job must have been done with it.

Is there a major flaw in my reasoning or does it sound like a suitable idea?

--
You just got troll'd!

Re:Compressing text by Anonymous Coward · 2006-08-13 14:26 · Score: 0

try log2(45)/log2(256) instead ... you'll actually get about a 25% reduction (rounding log2(45) to 6).
Re:Compressing text by 4D6963 · 2006-08-13 15:59 · Score: 1

try log2(45)/log2(256) instead ... you'll actually get about a 25% reduction (rounding log2(45) to 6).
Excuse me but I don't quite understand what you're trying to say. What one should do log2(45)/log2(256) for?

--
You just got troll'd!
Re:Compressing text by Euler · 2006-08-13 16:15 · Score: 1

yes, sounds like Huffman encoding.
Re:Compressing text by Anonymous Coward · 2006-08-13 16:33 · Score: 0

A base 256 character uses 8 bits. A base 45 characters uses 6 bits. So you only get 25% compression.
Re:Compressing text by vidarh · 2006-08-13 21:13 · Score: 1

I see where you're going, and you're far off. Storing a single base 45 symbol requires 6 bits when you convert it to binary. Hence you get at most a 25% compression ratio.
On top of that it would work only for latin scripts (in other scripts you'll find you need far more than 45 characters on a frequent basis).
For US-ASCII or latin-X charsets, sure, it would achieve something, however it would also mask a lot of other similarities from any byte oriented compressors, and so you'd risk that it would reduce the compression ratio with more than you've gained if you try to use it in combination with some types of other compressors (of course a byte oriented compressor will never be optimal in the first place exactly because it can get affected badly by stuff like this)
Re:Compressing text by 4D6963 · 2006-08-14 08:02 · Score: 1

you mean that Huffman encoding is all about indexing values? I would have rather thought it was mainly about eliminating redundancy.

--
You just got troll'd!
Re:Compressing text by Euler · 2006-08-14 10:20 · Score: 1

it's both. In a nutshell, you make a tree of the most common sequences then index those values using sequences of bits (possibly shorter than a byte.) The shortest bit sequences get assigned to the most commonly-occuring characters and strings.

Doug in a dress? by Ace905 · 2006-08-13 14:21 · Score: 1

douginadress

--

Ace

Barebones Windows or Linux by Baldrson · 2006-08-13 14:24 · Score: 2, Informative

See the detailed rules for specifics but generally the rules are just what you would expect: The program runs (and completes in a reasonable time) on a relatively recent system running Windows (currently XP) or Linux with no external inputs, eg no dynamically loaded libraries not included in the submission, no net communication and no disk I/O that isn't generated by the program itself.

Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result in future submissions from the same source being ignored.

--
Seastead this.

I can compress it to about 4.1 gig... by jasontheking · 2006-08-13 14:24 · Score: 1

I've managed to get wikipedia to fit on a knoppix ISO image. I started off with the CD iso , turned on mysql and apache , installed mediawiki , and put on the iso a wikipedia database file , which I made from the sql dumps that the wikipedia foundation put out. (although now they don't , its xml only apparently)

So basically you boot the disk , start a web browser , go to http://localhost/wikipedia/index.php , and you can start searching for articles.

That , plus all the images I could fit , is about 4.1G. I've got a first version done , I'm just trying to get the thing uploaded somewhere so people can help with further development. I hope to go to a dual layer dvd , to fit on about 4G more images.

The fact that you have to reboot your machine to use it is a bit of a bummer , hopefully there's some way to run qemu or vmware off the disk , so the disk can boot itself inside a VM.

is anyone interested in this?

incentivize by Anonymous Coward · 2006-08-13 14:26 · Score: 0

the intent of which is to incentivize the advancement of AI through the exploitation

What a poorly crafted sentence! Surely that should be "to incentivize to advancize AI and exploitize..."

Cool... by Baldrson · 2006-08-13 14:27 · Score: 1

I hope the prize fund becomes very large and someone like you comes up with an algorithm and gets rich enough to retire.

--
Seastead this.

Re:Cool... by gardyloo · 2006-08-14 04:40 · Score: 1

I hope the prize fund becomes very large and someone like you comes up with an algorithm and gets rich enough to retire.

Yeah, me too! For large "me" values of "you".

I can compress the answer to all questions down to by dalesmatrix · 2006-08-13 14:29 · Score: 1

two digits....42. Bring on the prize!

He cites Solomonov by Baldrson · 2006-08-13 14:42 · Score: 1

Hutter's theory is about an active AI agent seeking to maximize some utility function rather than passive prediction. However, it is probably the case that prizes for human knowledge compression should have been in place a long time ago -- probably as long ago as William of Ockham. IMHO guys like Solomonov provided sufficient formal rigor to justify massive government funding of a prize competition for compression of human knowledge decades ago, but it seems to be the case that people who control money generally just can't bring themselves to fund prize competitions for objective criteria. This generally doesn't speak well of people who control money. The exceptions are highly notable.

--
Seastead this.

Local optima by Baldrson · 2006-08-13 14:48 · Score: 1

The fact that there are local optimal in utility functions in which you can get stuck is a problem for all learning systems but to varying degrees depending on the details of how they look around the space of "programs". The space of programs measured for Kolmogorov complexity has a lot of discontinuity.

--
Seastead this.

Re:Local optima by DrJimbo · 2006-08-13 16:51 · Score: 1

Gosh I hope this is a Turing test.

I really do think you are talking about something entirely different. AFAIK, the fact that two identical, optimal deciders/reasoners can have diverging views of the same world when they are seeded with different a priori information is not the same thing as getting stuck in local maxima.

I looked at your resume. You seem to be a very bright and creative guy. I really encourage you to take a look at Jaynes' book. I'm pretty sure you can still get most of it from pdf's floating around the Web so you don't have to invest any money to take a look at it and find out if it interests you. I think it provides some very interesting mathematical insight into how people reason. Here is the opening paragraph:

Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? Let us first take a leisurely look at the general nature of such problems.
It starts out with ideas similar to those of Polya in Mathematics and Plausible Reasoning but where Polya does a lot of qualitative hand waving, Jaynes is almost 100% quantitative.

--
We don't see the world as it is, we see it as we are.
-- Anais Nin
Re:Local optima by scottyokim · 2006-08-15 04:48 · Score: 1

There are links to other books by Jaynes here: http://www.singinst.org/action/readinglist.html/

Hutter's Theory - Disproved by giafly · 2006-08-13 14:57 · Score: 3, Insightful

The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.

A "Hutter AI" will be at a disadvantage when competing against an opponent which knows it's acting as above and can do the same calculations. Under these circumstances, the opponent will be one step ahead. The Hutter AI is predictable and so can be outmanoeuvered. Hence the Hutter AI's moves are not optimal.

Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).

Occam's razor (also spelled Ockham's razor) is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham. Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic maxim that advises economy, parsimony, or simplicity in scientific theories. Occam's razor states that the explanation of any phenomenon should make as few assumptions as possible - Wikepedia

Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.

--
Reduce, reuse, cycle

Re:Hutter's Theory - Disproved by NoOneInParticular · 2006-08-13 19:03 · Score: 1

As the opponent is part of the AI's environment, the Hutter AI would be able to outmanouver the opponent in this particular scheme. Given that the Hutter AI needs complete knowledge of the environment, this is of no great difficulty. Actually calculating a Hutter AI is difficult however, as the thing is intrinsically incomputable.
The scenario that you sketch is actually addressed by Hutter in his technical tutorial-like report.
Re:Hutter's Theory - Disproved by khallow · 2006-08-13 19:10 · Score: 1

Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).

Huh, I thought you were going to be more profound than that. My take is that the game in question is the player against the environment. Ie, you don't have a second player to anticipate you. And adding randomness does thwart this limited form of competition. But you have a deeper problem. Namely, the rules of the game can change especially if one of the players knows that will throw off the Hutter AI. Eg, everyone else starts playing gin rummy instead of poker and then switches to something else when the AI starts to catch on.

Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.

One doesn't "count" assumptions in mathematics. Too often they are simply incomparable. Instead, one can speak of "finer" and "coarser" theories. Ie, if a second theory is just the first theory with additional nontrivial assumptions, then the second theory is "finer" than the first while the first is "coarser" than the second. Then given a series of observations, any theory which explains the observations and is coarsest possible (ie, no coarser theory exists) might satisfy the conditions of Occam's razor. At this point, you end up with one or more theories (possibly an infinite number) with no obvious way to sort them. So far no cultural bias has happened. Introducing heuristics at this point also introduces the "cultural bias" you refer to.
My take is that most of the time, it's not that hard a problem. The final coarsest theories end up with one or a few simple natural theories, a bunch of complex natural theories that depend on the observation being at the threshhold of detection for those theories, and then a host of theories dependent on supernatural causes.
Re:Hutter's Theory - Disproved by asolipsist · 2006-08-13 22:39 · Score: 1

"As the opponent is part of the AI's environment, the Hutter AI would be able to outmanouver the opponent in this particular scheme. Given that the Hutter AI needs complete knowledge of the environment, this is of no great difficulty. "

So the Hutter AI is a compression of the complete environment as well, down to every single atom of every single neuron in it's opponents brains? To my untrained eye Hutter AI appears indistinguishable from mathematical flatulence, rather than any attempt at real AI.

"Actually calculating a Hutter AI is difficult however, as the thing is intrinsically incomputable."

Of course it is, if it were computable it might actually be useful :) It seems like Hutter AI is just a somewhat tautological and uphelpful definition of problem solving.
Re:Hutter's Theory - Disproved by giafly · 2006-08-13 23:28 · Score: 1

Huh, I thought you were going to be more profound than that. My take is that the game in question is the player against the environment. Ie, you don't have a second player to anticipate you. And adding randomness does thwart this limited form of competition.
I guess I'm flattered. I think randomness also helps in player-vs-environment games btw, because it introduces an element of experimentation and feedback. If an AI always make an allegedly perfect decision (as a Hutter AI will do), there's little opportunity to learn. But if it keeps varying its decisions, it gets additional information as to what works better. Thus by making slightly random decisions now, it can make better decisions in future.

--
Reduce, reuse, cycle
Re:Hutter's Theory - Disproved by Scarblac · 2006-08-14 01:09 · Score: 1

Sure. If there's a game of A vs B, in which it's important to know your opponent's intentions (like rock paper scissors), then when player A can predict how play B will play while B can't do the same, A will always have an advantage. However, that says nothing about how good the AI of B is.

In a game like chess, an optimal move is an optimal move, even if the opponent knows that it's going to be played.

In general, Hutter aims for optimal performance given all the available information (and not just for games - for any situation). If your opponent has better information (like your source code), he can still be better in a game like poker, but that cannot be helped.

I bet that Hutter's AI will crack a human's "randomness" rather quickly, by the way. We're not good at being random. Meanwhile, you still don't know the machine's cards.

--
I believe posters are recognized by their sig. So I made one.
Re:Hutter's Theory - Disproved by Scarblac · 2006-08-14 01:35 · Score: 1

Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions.

The way Hutter uses it is in the sense of Kolmogorov complexity.

Let's say you have a series of inputs 1 2 3 4 5 6 7 8 and want to predict how it will continue from there.

Kolmogorv complexity says that the most likely explanation for this series is given by the shortest universal Turing machine program that will produce that series.

--
I believe posters are recognized by their sig. So I made one.
Re:Hutter's Theory - Disproved by Anonymous Coward · 2006-08-14 03:18 · Score: 0

The Hutter AI *changes* its model based on what it observes. Thus, if the environment was trying to do the same calculations and plan according to that, the Hutter AI would "realize" this. Of course two Hutter AI's going against each other would lead to a very complicated series of moves, but I think the AI is meant for interacting with the environment, not for playing games, so it will do just fine for most tasks.
Re:Hutter's Theory - Disproved by khallow · 2006-08-14 05:34 · Score: 1

I guess I'm flattered. I think randomness also helps in player-vs-environment games btw, because it introduces an element of experimentation and feedback. If an AI always make an allegedly perfect decision (as a Hutter AI will do), there's little opportunity to learn. But if it keeps varying its decisions, it gets additional information as to what works better. Thus by making slightly random decisions now, it can make better decisions in future.
Optimal decisions usually have some randomness in them. Sometimes, they have a lot as in the case of "Rock, Paper, Sissors" where the optimal strategy is randomly chosing between the three choices with equal weight.
Re:Hutter's Theory - Disproved by Anonymous Coward · 2006-08-14 07:52 · Score: 0

> A "Hutter AI" will be at a disadvantage when competing against an opponent which knows it's acting as above and can do the same calculations. Under these circumstances, the opponent will be one step ahead. The Hutter AI is predictable and so can be outmanoeuvered. Hence the Hutter AI's moves are not optimal.

You could phrase that more succinctly as saying that being predictable is a disadvantage, except, of course, when the opponent can't do anything about it (e.g. games like tic-tac-toe are easily solvable, and end in a draw for any sensible player... the decision tree can easily be mapped out by hand, being unpredictable could only put you at a disadvantage).

Compress Wikipedia and win a prize? by Dachannien · 2006-08-13 14:59 · Score: 4, Funny

Can't I just punch the monkey for $20 instead?

Re:Compress Wikipedia and win a prize? by MarkRose · 2006-08-13 20:20 · Score: 1

You could probably get your monkey spanked for about that price.

--
Be relentless!

Erratum: compressor - decompressor by Baldrson · 2006-08-13 15:18 · Score: 1

I made a mistake in stating the criteria. The size of the compressor is irrelevant. Only the size of the decompressor is measured.

--
Seastead this.

Not sure if that's a joke. by monoqlith · 2006-08-13 15:20 · Score: 1

I mean, I hate it when people "verbize" nouns.

Re:Not sure if that's a joke. by Andrew+Kismet · 2006-08-13 17:14 · Score: 3, Funny

Of course he was joking. If he was serious he would've said "verbificate".
Re:Not sure if that's a joke. by Bloke+down+the+pub · 2006-08-13 23:29 · Score: 1

I'm whooooshing you.

--
It's true I tell you, feller at work's next door neighbour read it in the paper.
Re:Not sure if that's a joke. by escher · 2006-08-14 06:22 · Score: 1

I always liked "verbitronize".
Re:Not sure if that's a joke. by ameline · 2006-08-14 09:55 · Score: 1

If I was serious, it I would have used "verbify" :-)

However, don't let that discourage you -- verbificate is a perfectly cromulent word.

I just wish people would stop cromulizing the language. :-)

--
Ian Ameline

Easy. by eosp · 2006-08-13 15:28 · Score: 1

compress: s/%/\\%/g s/ the /%/g and so on. You just saved a bit.

super markov models? by Anonymous Coward · 2006-08-13 15:35 · Score: 0

What ever happened to all the super markov models we were supposed to have now with all the extra powerful computers sitting on everyone's desk? Phil Katz was talking about this almost a decade ago, nothing ever came of it.

Does the smallest resulting file matter if it takes a week to compress and a day to uncompress? Why does (lack of) speed have no factor in this?

Indexed by word set by foniksonik · 2006-08-13 15:44 · Score: 1

Easiest thing I can think of without getting too complicated is to create an index of all words on the site and then do replacement in the page compositions with pointers.

ie: "approximately 990,000" according to WikiPedia itself but with slang, etc. make it a cool million.

This should compress the archive as a whole substantially....

Next would be to use an incremental versioning system for incremental edits. Rather than keeping a copy of all article versions... just keep what's changed and pointers back to the original... though this may already be happening (not too familiar with how wikis store versions).

hmm well that's all I can think of on a Sunday evening. Oh yeah.. use some hashing techniques in there somewhere.. something like what is used in Content Addressed Storage archiving systems... that software manages to get huge compression losslessly across diverse sets of text-type data easily.

--
A fool throws a stone into a well and a thousand sages can not remove it.

Re:Indexed by word set by vidarh · 2006-08-13 20:52 · Score: 1

The problem you'd run into is that there are plenty of common compression algorithms around that does what you have described as an implicit effect of how they work, without making arbitrary decisions, like encoding "words" (in most documents there will be more efficient sequences of bytes that doesn't have to be split on word boundaries) or limit deltas to just incremental edits.
Re:Indexed by word set by foniksonik · 2006-08-14 02:02 · Score: 1

Hmmm I didn't read the article but i was under the impression that this was for online compression of the database itself... ie: there ARE lots of ways to compress it offline and why would they be asking for something so general which applies to all data eveywhere offline.... so with that assumption I was suggesting ways to reduce the data set within the live database.

--
A fool throws a stone into a well and a thousand sages can not remove it.
Re:Indexed by word set by foniksonik · 2006-08-14 02:08 · Score: 1

Just read the article.... hmmm what a dumb task request... i was wrong, they are looking for something that applies to all data everywhere. Billions of dollars have been spent on this in the past 20 years.... there are several patented algorithms from EMC, from DataDomain, from Avamar Technologies and others which deal specifically with this question... commonality factoring is what they call them.

Why not just call up the experts then and ask them to demonstrate what they can do.

--
A fool throws a stone into a well and a thousand sages can not remove it.

Experiments by Baldrson · 2006-08-13 15:58 · Score: 1

The markup language portions of the corpus will be very rapidly reduced to a set of rules governing their syntax leaving mainly their "raw" meaning (as meta knowledge in the case of the XML markup of the corpus and human interface knowledge in the case of the Wiki markup). Although there is good reason for these to be the subject of research for their own merit, researchers who want to win money via compression of more generally valuable knowledge will find lots of opportunities and needn't be distracted from primary research routes.

Ultimately, you can see the corpus as a space within which various hypotheses about practical AI can be experimentally compared. That's the crucial importance of choosing something general like Wikipedia.

--
Seastead this.

Incite by XanC · 2006-08-13 16:11 · Score: 1

There's a real word for this. I'm pretty sure "incite" is what everybody's looking for.

Re:Incite by maxwell+demon · 2006-08-14 02:45 · Score: 1

Mod parent inciteful!

--
The Tao of math: The numbers you can count are not the real numbers.

Yes - Wikipedia is already monkey-generated by blueZ3 · 2006-08-13 16:22 · Score: 1

I fail to see how putting monkeys at computers would result in something with quality that is significantly different from the current content of Wikipedia

--
Interested in a Flash-based MAME front end? Visit mame.danzbb.com

Re:Yes - Wikipedia is already monkey-generated by maxwell+demon · 2006-08-14 02:49 · Score: 1

Let's analyze your statement.

We know that putting monkeys on typewriters will result in the works of Shakespeare.
We know that Shakespeare has significantly better quality than Wikipedia.
You claim that monkeys on computers will produce something with comparable quality to Wikipedia.

So what makes you think that monkeys would produce less quality texts on computers than on typewriters?

--
The Tao of math: The numbers you can count are not the real numbers.

AI: Why does it always sound like bullshit? by fracskul · 2006-08-13 16:27 · Score: 1

Look, I'm no Phd, just a dude that has read a lot. Why does all the stuff I have ever read about Artificial Intelligence sound like bullshit? I mean, where are the Rudy Rucker type robots? Anything even close to AI? It's all failure. All the theory, all the work, and we can't even build a fucking BUG! I know the whole thing contains some of the hardest problems in science, but shit, where is ANY PROGRESS? Should we be trying build positronic brains or something?

Re:AI: Why does it always sound like bullshit? by njdj · 2006-08-13 18:38 · Score: 1

but shit, where is ANY PROGRESS?

There has been a lot of progress, but as soon as a problem in AI is solved, that problem is re-classified as "not AI". Thus, although a lot of AI problems have been solved, there are no solved AI problems. (If that seems nonsensical to you, re-read the sentence preceding it.)

An example is the problem of playing a good game of chess. 50 years ago, this was definitely seen as a problem in AI. Now it is solved - we have chess programs that play about as well as the current World Chess Champion. But it is not seen as "AI".
Re:AI: Why does it always sound like bullshit? by vidarh · 2006-08-13 20:47 · Score: 1

In a sense this actually does make sense - most of the problems that are seen as part of AI research involve solutions that doesn't have anything to do with intelligence. The problem is rightfully part of AI research, but the solutions largely aren't.
Imagine being an architect in a world where no building material are known. You'd spend most of your career inventing the basic building blocks and the techniques to use them to allow you to start designing houses. Doing that work would be part of being an architect, but once the work is done, the techniques you've invented would be part of a different field.
Of course the downside is that people asking what your successes as an architect are, based on the "proper" definition of an architect will think you're a perpetual loser, as you haven't completed any house designs, since you need to know what materials and constraints you have to work with first.

Windows to the rescue! by SimplyI · 2006-08-13 16:42 · Score: 1

I knew I'd find a use for Alternate Data Streams...

@ECHO OFF if (%1) == (de) GOTO DECOMPRESS echo Compressing %1... more < %1 > compress.bat:comp echo Compression complete. GOTO END :DECOMPRESS echo Decompressing... more < compress.bat:comp > %2 echo Decompression complete. :END

It compresses, decompresses, and is always 248 bytes!

(It's a joke...)

Hutter's basic theory obviously wrong by AxelBoldt · 2006-08-13 17:27 · Score: 1

The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program.

Well, I don't know about Hutter's proof, but I do know that the "basic theory" as described above is surely wrong. The simple reason: the smallest program that predicts those observations will depend on the programming language used; the optimal AI move however obviously does not depend on the programming language.

Re:Hutter's basic theory obviously wrong by Cederic · 2006-08-13 20:46 · Score: 1

You're thinking of size in code terms. Realign your size measurement to use (e.g.) function points, or processing cycles, or some other non-language-dependent factor.
Re:Hutter's basic theory obviously wrong by AxelBoldt · 2006-08-14 04:29 · Score: 1

Realign your size measurement to use (e.g.) function points, or processing cycles, or some other non-language-dependent factor.
Whatever notion of size measurement you pick, its formal definition will always have to depend on some underlying formally defined framework, such as Turing machine, lambda calculus, random access machine, or some programming language. These formal frameworks all contain lots of arbitrary conventions, and your notion of size will by necessity depend on those conventions. The notion of "optimal AI move" does not depend on those conventions however.
Re:Hutter's basic theory obviously wrong by Cederic · 2006-08-14 05:44 · Score: 1

Then I guess we agree that the measurement mechanism doesn't matter. I still disagree with your conclusion that this invalidates Hutter's theory.

He is expressing the optimal AI move as the smallest program possible; I think that how you measure that size is not relevant to that theory. I don't see that lacking an appropriate measurement approach free from arbitrary constraints invalidates the theory.
Re:Hutter's basic theory obviously wrong by AxelBoldt · 2006-08-14 11:24 · Score: 1

The measurement process matters insofar as different measurement processes will yield different minimal programs and therefore, according to Hutter, different optimal AI moves. But in the real world there is only one, or at best a very small number of, optimal AI moves, not the plethora produced by Hutter.
Re:Hutter's basic theory obviously wrong by Cederic · 2006-08-15 03:10 · Score: 1

Then the smallest program as measured by any given method may not the actual smallest. The argument exist that it is the measurement method that is flawed rather than the theory.

Admittedly this makes the theory particularly complicated to prove..

no one will get the 50000 by savage_panda · 2006-08-13 17:37 · Score: 1

The money handout scheme is written right now is
(1-(new record/previous record))*50000.

The contest is seeded with already a 18 MB entry,
So suppose you have labored to get an amazing compressor to compress to 9 megs. and hand that solution in, you will get 25000 euros.

Even, if you padd your solution so that it only shows a 1% improvement each time, you will get somewhere like 35000 euros.

Re:no one will get the 50000 by maxwell+demon · 2006-08-14 03:26 · Score: 1

Even, if you padd your solution so that it only shows a 1% improvement each time, you will get somewhere like 35000 euros.

That's an interesting observation. Given that the size of the decompressor is also weighted in, you can even optimize this scheme by first writing the decompressor in an non-size-optimal way (e.g. expanding function calls into their calling functions, etc.) and then gradually improve your code without changing one bit of the algorithm!

Of course this assumes that you find that well-compressing algorithm first :-)

--
The Tao of math: The numbers you can count are not the real numbers.

Reconstruction by Mark_MF-WN · 2006-08-13 18:21 · Score: 1

Oh, it gets so much so worse. Kernels can be of arbitrary length, so determining what they do basically falls under the tragic category of the Acceptance Problem -- developing an algorithm that only accepts candidate kernels that actually do perform the business of being a kernel properly. Unless the OP happens to have solved the Acceptance Problem, his proposal doesn't work.

Re:Reconstruction by Anonymous Coward · 2006-08-13 19:42 · Score: 0

In this case make your program be the init that is ran by the kernel at startup.

Compression by Mark_MF-WN · 2006-08-13 18:26 · Score: 1

Strictly speaking, the effectiveness of any compression technique has to be measured by including the algorithm's length as part of the length of the output. So if you compress a 100kb document with a 30kb build of bzip2, and the output file is 70kb, you've got a compression ratio of 1. This isn't so much a concern in practice (particularly with something like bzip2 that you reuse), but when you're discussing theoretical "bests", it becomes supremely important for exactly the reason you describe. Kolmogorov complexity, and all that.

Minimum Description Length by Anonymous Coward · 2006-08-13 18:27 · Score: 1, Interesting

Hutter naming this theory after himself is silly. It's called the theory of "minimum description length" and it's been around for a while (well before the 2004 copyright on Hutter's book). The idea is to find some model which minimizes the sum: size of model + incorrectness of model's predictions. The Linguistica folks use it, and in fact would probably be in the running for this prize.

Not really a good idea by Anonymous Coward · 2006-08-13 18:52 · Score: 0

That's been done, and it didn't work.
All they ever wrote about was bananas.

Monkeys throwing feces at each other... by patio11 · 2006-08-13 19:24 · Score: 1

... are far too refined to ever simulate the inanity of a wiki revert war.

POV! POV!

--
Help poke pirates in the eyepatch, arr.

Re:Well.. doesn't the dictionary make it smaller?? by greg1104 · 2006-08-13 19:58 · Score: 1

The contest requirements add the length of the decompressor program to the size of the compressed file; therefore a dynamic dictionary you build yourself (and store in the file) will be better than using those generic ones. From the faq:

The typical size of current decompressors is less than 100KB. So obscuring them by making them shorter will give you at most an 0.1% = 100KB/100MB advantage (not enough to be eligible for the prize). On the other hand, for fairness it is necessary to include the size of the decompressor. Take a compressor like PAQ8H that contains 800KB of tokens. Clearly it can achieve better compression than one from scratch. If you're not convinced by this argument, consider a an extreme "decompressor" of size 100MB that simply outputs enwik8 byte by byte (from a zero 0 byte archive), thus achieving a compressed size of 0.

Yipes! by Duncan3 · 2006-08-13 20:09 · Score: 1

What if I don't want to wait 30 minutes and use 1GB of RAM for a trivial 3MB file I want to compress?

Compressing 1GB... ETA March 2036... please wait.

--
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/

Re:Yipes! by maxwell+demon · 2006-08-14 03:34 · Score: 1

What if I don't want to wait 30 minutes and use 1GB of RAM for a trivial 3MB file I want to compress?

Easy: Use the disk to store all data, then you'll need less than 1GB of RAM. This will not solve your time problem (quite the opposite!), but since your problem was to have to wait 30 minutes and use 1GB of RAM, your problem is solved by solving just one of the two partial problems.

--
The Tao of math: The numbers you can count are not the real numbers.

Analog Lossy to Digital Compression possible? by shibbie · 2006-08-13 21:14 · Score: 1

Now I've dabbled in compression (enough to write a basic compressor) and what Matt suggests I believe relies on an unproven axiom. He suggests AI would need to think like a human as we have understand the function P(s) where this hasn't yet been modelled in machines. However our understanding of the function P(s) relies on a lossy format, that is (e.g.) we can roughly remember what an image is and is show it again confirm that, but we cannot replicate it from one instance of viewing, even if we could, it wouldn't be a perfect copy (e.g. of an image of the Mona Lisa down to the millimetre). Lossy thinking would mean that some detail is left out, meaning if replicated in AI it would fail the test, so what Matt actually proposes is an improvement on human thinking (our thought process with perfect decompression), which as we don't understand the latter yet properly I believe is not yet possible. It might be our thought processes *rely* on the fact that we store memories lossy and which degrade over time, in which case it may not be possible to pass the benchmark at all (as it is for lossless compression).

Done! by Anonymous Coward · 2006-08-13 21:21 · Score: 0

42

There. Who is Al, and what is this prize of Al's that I win?

how to compress it even smaller by commodoresloat · 2006-08-13 23:23 · Score: 1

>Removing all the incorrect and inaccurate data...

removing all the redundant and repetitive adjectives will compress it even further ;)

This is easy to win. by seanyboy · 2006-08-13 23:37 · Score: 1

Just zip the file up, and then continue to zip the resultant zipped file until it is really small. Easy.

--
Training monkeys for world domination since 1439

Finally! by Anonymous Coward · 2006-08-14 00:45 · Score: 0

The Encyclopedia Galactica is one step closer to fruition!

extra sig by Ginger+Unicorn · 2006-08-14 01:09 · Score: 1

i'm betting you use greasemonkey to automatically insert your extra sig into the comment field whenever you submit a comment on slashdot.

--
(1.21 gigawatts) / (88 miles per hour) = 30 757 874 newtons

I could have done it... by slapout · 2006-08-14 01:22 · Score: 1

...if ya just hadn't specified "Lossless". :-)

--
Coder's Stone: The programming language quick ref for iPad

Re:Tough call... by Oligonicella · 2006-08-14 01:45 · Score: 1

The question in the post above this is still relevant.

I can get it down to 0% by Anonymous Coward · 2006-08-14 01:49 · Score: 0

Where is my prize

sorta 50,000Euro by Boanerge · 2006-08-14 02:01 · Score: 0

50'000×(1-S/L). I think this prize should be the approaching 50,000E prize. At some point there is a lower possible limit on S.

Herbert Simon's Research Contradicts Hutter by Anonymous Coward · 2006-08-14 02:34 · Score: 0

The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program.

Herbert Simon, the father of AI, deduced that humans use what he called satisficing (and a more formal definition). That is, they tend to the first available solution that works. Nothing about "optimal moves" or "smallest program". In fact solutions were usually suboptimal.

Since humans are the only examples of intelligence, artificial or otherwise, that most would agree are "intelligent", I think Herbert Simon's work trumps Hutter's.

Psychological studies show no human tendency to develop any "smallest moves" or even "simplest explanation" (a' la' Occam's Razor) without significant training. Instead humans start with a base set of assumptions and then constantly patch it to adapt to new information. An overall revision may often not occur at all without some significant triggering event (death of a child, accident, etc.).

simple by macsox · 2006-08-14 03:10 · Score: 1

submit a random character generator. it matches wikipedia's accuracy, and will result in no fewer arguments.

--
go get it

Re:I can compress the answer to all questions down by maxwell+demon · 2006-08-14 03:11 · Score: 1

I'm eagerly awaiting your decompressor.

--
The Tao of math: The numbers you can count are not the real numbers.

Not even close by blueZ3 · 2006-08-14 03:16 · Score: 1

We know that putting monkeys on typewriters [for an extremely long period of time ] will result in the works of Shakespeare

You missed my point, which is that the current quality of information (and writing) on Wikipedia is extremely poor. In fact, poorer than what could be produced by the monkeys on computers given a long period to work.

The idea of the "monkeys on typewriters" illustration is that with an unlimited amount of time, a random process could produce something that would appear "structured." At this point, Wikipedia hasn't been around long enough for the monkey-randomness to produce anything like Shakespeare. And my expectation is that it won't live long enough to get anywhere close.

Further, I think that any writer (which is my day job) can tell you that the quality of Wikipedia isn't likely to get better, because it's actually in a worse situation than the monkeys on computers. The non-random, article-by-committee nature of the writing is doomed to read like something, well... written by a committee. Which is to say: poorly. Add to this the "any-12-year-old can edit articles on physics" openness, and you've got something that not even a monkey could love.

--
Interested in a Flash-based MAME front end? Visit mame.danzbb.com

Re:Not even close by Baldrson · 2006-08-14 04:21 · Score: 1

This turns out to be a virtue for the Hutter Prize actually. The AI emerging from the Hutter Prize will be better able to model ambiguities arising from poor usage and hence be more capable of understanding people. This, of course, makes the compression more difficult but the point of the Hutter Prize is to make it simple for funding sources to create ever greater incentives for solving the critical problems for AI by literally throwing money at the problem and being guaranteed results for their money.

--
Seastead this.
Re:Not even close by maxwell+demon · 2006-08-14 07:27 · Score: 1

You missed my point

Whoosh!

--
The Tao of math: The numbers you can count are not the real numbers.

Not the same thing, but related. by sethjn · 2006-08-14 03:34 · Score: 1

Well, while it isn't the same thing this guys is looking for, I have created a new semi-random access compression format specifically because I was having to process Wikipedia data. A friend of mine and I were working on some data mining in Wikipedia and found that uncompressing the 300 Gig file to be unrealistic. So I created RAZ (Random Access Zip) to help us out. It is a very simple system. It simply uses another compressor to compress relatively small chunks, strings them together, and prepends a header with index information. Using normal 7-zip, we get the 300 gigabytes down to about 2.1 GB. Using the RAZ format using 7-Zip, we get it down to about 2.4. We've written a Python module that "opens" the file and allows for random access through seek and read. I'm eventually going to put RAZ on source forge when time allows. For now, I've just written a post about it on my blog (sethnielson.com). The code is still experimental, so I haven't posted it, but you can email me if you're interested in it and I'll send you a copy. -- Seth N.

The winner! by Anonymous Coward · 2006-08-14 03:34 · Score: 0

Herewith is my entry to compress all the useful knowledge of Wikipedida down to the minimum possible:

touch wp.gz

Done! Please donate my prize to the Red Cross.

Total compression by HTH+NE1 · 2006-08-14 03:36 · Score: 1, Interesting

If you think you can compress a 100M sample of Wikipedia better than paq8f

Anyone can write a program that can compress that sample down to zero bytes. The simplest such implementation of the program will be slightly bigger than the sample however and could only be used to decompress that sample.

Down to one byte, it could work with up to 256 different samples, but only those, and would still be slightly bigger than the sum of those 256 samples.

(Basically, given a byte, regurgitate the whole text which was precompiled into the program.)

A condition of the contest should be that the combination of the program and compressed data should be smaller than both the uncompressed data and the combination of paq8f and the compressed data.

Dare I recoin a phrase: Any sufficiently advanced compression algorithm is indistinguishable from a filing system.

--
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?

Re:Total compression by imsabbel · 2006-08-14 05:23 · Score: 1

Read the rules:
Either the archive has to be self extracting, or to be tarred with the decompressor.
You are not the first genius to get that idea...

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?

My attempt attached ... (7 chars) ... beat that! by Anonymous Coward · 2006-08-14 03:39 · Score: 0

Rubbish

Calvin said it best: by Cattus+Curiosus · 2006-08-14 04:49 · Score: 1

Calvin: I like to verb words.
Hobbes: What?
Calvin: I take nouns and adjectives and use them as verbs. Remember when "access" was a thing? Now, it's something you do. It got verbed. Verbing weirds language.
Hobbes: Maybe we can eventually make language a complete impediment to understanding.

--
Snowclone is the new clich

The obvious by Anonymous Coward · 2006-08-14 06:48 · Score: 0

Technical details aside, *any* lossless compression algorithm can only remove the inefficiencies of representation in the original form. It cannot destroy information and then reproduce it at a later time. Even decesion-based "AI" algorithms destroy information. If there was a decision fork in the road and the algorithm chose the one less traveled by, the later decompressed version would only have memory of the road chosen. For some areas of knowledge, that is unacceptable.

Telescope, Tweezers, and a little box. by lcsjk · 2006-08-14 06:57 · Score: 1

You take a telescope and turn it around backwards and look at Wikipedia. Then, while it is very small you reach out with the tweezers, pick it up and put it in the little box. Close the box and put the telescope away.

How much did you say the prize would be?

Re:Painful to read - only if you expect simple stu by Raenex · 2006-08-14 10:08 · Score: 1

there are no errors that I can see

then you might want to try winning win

How to compress Wikipedia by Ed+Avis · 2006-08-15 02:47 · Score: 1

It's very easy to write a program that decompresses Wikipedia pages with perfect accuracy from a very small source file.

1. Download the Wikipedia article for, say, Jellied eels.
2. Follow the 'edit' link.
3. Change the page content to a 100 megabyte repeating sequence of 'I am a fish I am a fish'.
4. Generate this as output.

--
-- Ed Avis ed@membled.com

Slashdot Mirror

Compress Wikipedia and Win AI Prize

324 comments