Slashdot Mirror


Text Compressor 1% Away From AI Threshold

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

442 comments

  1. I wonder ... by iknowcss · · Score: 2, Funny

    How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create.

    The horror.

    --
    Life is rarely fair. Cherish the moments when there is a right answer.
    1. Re:I wonder ... by Anonymous Coward · · Score: 3, Funny

      "How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create."

      "The horror."

      I've been typing everything I ever knew into Slashdot since the day it started, you insensitive clod!
          -- Cmdr Taco

    2. Re:I wonder ... by Anonymous Coward · · Score: 0

      At least 10 Football fields, or 33.4 metric Volkswagon Beetles for European readers.

    3. Re:I wonder ... by manowar821 · · Score: 1

      Why was this marked as off topic? I thought it was funny. JERKFACES D:

      --
      Internet: Serious Business
    4. Re:I wonder ... by Anonymous Coward · · Score: 0

      At least 10 Football fields, or 33.4 metric Volkswagon Beetles We use laptop miles here.
    5. Re:I wonder ... by ichigo+2.0 · · Score: 1

      1. Create AI which contains all the knowledge of slashdot.
      2. Watch as AI spends all day telling Soviet Russia jokes, and when you ask it to do something it says 'I, for one, welcome our new meatbag overlords.'
      3. Build a beowulf cluster of the AIs
      4. ???
      5. Profit!

    6. Re:I wonder ... by PPH · · Score: 1

      Bad car analogy? Like comparing the compression of Slashdot's total knowledge to the way they squeeze old cars into neat little cubes?

      --
      Have gnu, will travel.
    7. Re:I wonder ... by Smauler · · Score: 2, Funny

      I wouldn't exactly call that lossless compression though...

    8. Re:I wonder ... by Anonymous Coward · · Score: 0

      All the relevant material is still there. If you were able to manipulate the material on a molecular level and you had a template, you would be able to reconstruct the vehicle. This is sorta the way a lot of compression utilities work. Store a template, then store the data.

    9. Re:I wonder ... by phoenixwade · · Score: 1

      I wouldn't exactly call that lossless compression though... Depends on the car......

      --
      A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort.
  2. new compression standard by tronicum · · Score: 3, Informative

    so...wikipedia dumps will now be using paq8hp12 instead of l33t 7zip ?

    1. Re:new compression standard by aicrules · · Score: 4, Funny

      Dang! You must have enemies if you are the very first post and you get modded redundant. Time to work on some positive karma buddy...

    2. Re:new compression standard by Anonymous Coward · · Score: 0

      Dang! You must have enemies if you are the very first post and you get modded redundant. Well his post was third, not first. Other than that, your entire point was spot on!
    3. Re:new compression standard by imsabbel · · Score: 1

      If you want your weekly dump to need approx. 3 year, then yes. This program compresses (on a state of the art computer) at about the reading speed of a floppy drive.

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    4. Re:new compression standard by tronicum · · Score: 1

      thanks to mods and threads below, it was modded up!

    5. Re:new compression standard by whereiswaldo · · Score: 1

      I've tried a variant of the paq compressors (paq8p?) and it does a super job of compressing things very small. However, it does this by using a lot of memory and a lot of time. Decompression requires as much memory as you used to compress, apparently. What compression program you use depends on your needs and your constraints. I'm not sure what 7zip's memory requirements are, but with respect to time, it does a great job compressing in a time comparable to gzip, but much smaller (based on my limited testing).

  3. interesting program name by digitalderbs · · Score: 5, Funny

    paq8hp12. when decompressed, it also serves as the source code for the program.

    1. Re:interesting program name by OverlordQ · · Score: 5, Informative

      Since I know people are going to be asking about the name, might I suggest the wiki article about PAQ compression for the reasons behind the weird naming scheme.

      --
      Your hair look like poop, Bob! - Wanker.
    2. Re:interesting program name by arivanov · · Score: 0, Troll

      And I also suggest a revisit of this load of horrid tripe recently prominently featured on slashdot:http://developers.slashdot.org/article.pl ?sid=07/07/08/0547234

      Compare it to the reasons behind this guy achievement. Sit back. Reminisce. Enjoy der blinkenlichten.

      --
      Baker's Law: Misery no longer loves company. Nowadays it insists on it
      http://www.sigsegv.cx/
    3. Re:interesting program name by Anonymous Coward · · Score: 1, Interesting

      The fact that a program to solve an essentially mathematical problem requires a knowledge of mathematics to create does not invalidate the theory that Computer Science is not really a branch of maths (or if it is, it only is because mathematicians made it that way).

      Get over yourselves everyone - not every computing problem is a mathematical one, some are, some aren't.
      Besides, some of the branches of Maths relevant to computing are really only "Maths" because that's what someone decided to lump it in with - a name does not define the nature of something.

    4. Re:interesting program name by arivanov · · Score: 2, Interesting

      From the notebooks of Lazarus Long: If it can't be expressed in figures, it is not science; it is opinion.

      --
      Baker's Law: Misery no longer loves company. Nowadays it insists on it
      http://www.sigsegv.cx/
    5. Re:interesting program name by Red+Pointy+Tail · · Score: 1

      Wait till the end of the month.

      I've been back from the future, and there's another way better compression algorithm coming up that will knock the socks off this one ;)

    6. Re:interesting program name by name*censored* · · Score: 1
      >> If it can't be expressed in figures, it is not science; it is opinion.

      1337 has completely ruined this.
      --
      Commodore64_love: I don't comprehend people who're so frightened of death that they'll bankrupt themselves to stay alive
    7. Re:interesting program name by Schemat1c · · Score: 1

      I found it cute that this fantastic new compression program is stored as an rar file.

      --

      "Nobody knows the age of the human race, but everybody agrees that it is old enough to know better." - Unknown
    8. Re:interesting program name by smittyoneeach · · Score: 2, Funny

      If the name 'paq8hp12' falls out of some tree in the forest, and no one here can tell the difference in the state of the tree/paq8hp12 system, does gravity exist?

      --
      Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
    9. Re:interesting program name by WilliamSChips · · Score: 2, Interesting

      Lazarus Long is consistently wrong. He claims that peace and freedom are mutually exclusive but if you took a graph of our freedom you'd find that the greatest drops are during wartime.

      --
      Please, for the good of Humanity, vote Obama.
    10. Re:interesting program name by DavidShor · · Score: 1

      (If "It can't be expressed in figures" --> "it is not science")!=(If "it is not science" --> "It can't be expressed in figures")

    11. Re:interesting program name by jaymzru · · Score: 1

      Logically that's really not attacking his statement. Mutually exclusive means two things can't be true at the same time, they COULD both be false though.

    12. Re:interesting program name by RealGrouchy · · Score: 1

      Wait, is that in the first 100,000,000 bytes of Wikipedia?

      If so, then NO RECURSING!

      - RG>

      --
      Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
    13. Re:interesting program name by Citizen+of+Earth · · Score: 1

      If it can't be expressed in figures, it is not science; it is opinion.

      Why should I believe this opinion? It's not expressed in figures.

    14. Re:interesting program name by TheoMurpse · · Score: 1

      I didn't see an explanation for the weird naming scheme anywhere in the wiki article.

    15. Re:interesting program name by tkw954 · · Score: 1

      Lazarus Long is consistently wrong. He claims that peace and freedom are mutually exclusive but if you took a graph of our freedom you'd find that the greatest drops are during wartime.

      Peace and freedom AND freedom and war can both be mutually exclusive.

    16. Re:interesting program name by JoshJ · · Score: 1

      Uh... if it was compressed by itself, how would anyone else ever decompress it to use it?

    17. Re:interesting program name by NuShrike · · Score: 1

      It's right there in the article: http://en.wikipedia.org/wiki/PAQ#Hutter_Prizes

      It's forked from the PAQ8H series of compressors, stripped of anything but text handling, and is the 12th iteration of it.

    18. Re:interesting program name by TheoMurpse · · Score: 1

      Sorry, what I meant was that there's no place that explains what the "PAQ" in PAQ8H means. Obviously 8H would have been versioning. However, after reading the article, I'm still at a loss as to what PAQ stands for. If it is in the article, it's buried somewhere I cannot find.

    19. Re:interesting program name by Vintermann · · Score: 1

      Oh come on, do you think writing compression programs is fun? They must have some source of amusement!

      --
      xkcd is not in the sudoers file. This incident will be reported.
    20. Re:interesting program name by Schemat1c · · Score: 1

      Uh... if it was compressed by itself, how would anyone else ever decompress it to use it? Uh....... maybe a self-extracting option? Uh......

      I was just commenting on the irony, relax.
      --

      "Nobody knows the age of the human race, but everybody agrees that it is old enough to know better." - Unknown
  4. That's cool.. by Rorian · · Score: 4, Interesting

    .. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.

    --
    Will program for karma.
    1. Re:That's cool.. by The+Great+Pretender · · Score: 1

      It going to be for wiki-mobile

      --
      A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort.
    2. Re:That's cool.. by Cato · · Score: 5, Interesting

      Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo) - with SDHC format cards, it can do 8 GB today.

      Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.

    3. Re:That's cool.. by Kadin2048 · · Score: 5, Informative

      Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)

      Mobile use is right out too, at least with current-generation equipment.

      Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.

      Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...

      --
      "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
    4. Re:That's cool.. by RuBLed · · Score: 2, Informative

      The question is does a mobile handheld device got enough processing power to decompress it? in a reasonable time?

    5. Re:That's cool.. by Hal_Porter · · Score: 4, Funny

      a text spk version of wiki shud fit in 8gb i think
      its only becoz people are such grammar noobs that they need to waste $
      dood shud filta to txtspk b4 he compress

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    6. Re:That's cool.. by Anonymous Coward · · Score: 5, Informative

      No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
      When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.

      This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
    7. Re:That's cool.. by neonmonk · · Score: 5, Funny

      a txt spk vrsion of wiki shd fit 8gb i fink
      its only becoz ppl r sch grmmr noobs tat tey nid 2 wste $
      dud shd filta 2 txtspk b4 he cmpres

      There, fixed that for ya.

    8. Re:That's cool.. by Threni · · Score: 1

      > No decompression figure is given but I don't see any reason why it would be asymmetric.

      Perhaps because most decompressions are asymmetric?

    9. Re:That's cool.. by arun_s · · Score: 4, Insightful

      Maybe someone could sell the whole thing in a book-sized rectangular box with a tiny keyboard and 'DON'T PANIC' inscribed in large, comforting letters in the front.
      Now that'd be cool.

      --
      I can explain it for you, but I can't understand it for you.
    10. Re:That's cool.. by Solder+Fumes · · Score: 2, Interesting

      This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.

      Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away. That takes a lot more horsepower than picking up the already-sorted pieces and tossing them onto a display.

    11. Re:That's cool.. by Anonymous Coward · · Score: 1, Funny

      Read the sentence before the one you've quoted - this is the GP's point exactly.

    12. Re:That's cool.. by Archimonde · · Score: 5, Funny

      aTxtSpkVrsionOfWikiShdFit8gbIFink
      itsOnlyBecozPplRSchGrmmrNoobsTatTeyNid2Wste$
      dudShdFilta2TxtspkB4HeCmpres

      Fixed even more.

      --
      Trolls are like broken clocks. They show the truth two times a day. The rest of the day they talk nonsense.
    13. Re:That's cool.. by thomasj · · Score: 4, Funny

      1txtspk #.#/wiki = 8G!
      ~ppl r grm0.1 -> -$
      |txtspk|gzip

      --
      :-) = I am happy
      :^) = I am happy with my big nose
      C:\> = I am happy with my OS
    14. Re:That's cool.. by jaavaaguru · · Score: 3, Funny

      Is that Perl? ;-)

    15. Re:That's cool.. by Anonymous Coward · · Score: 0

      wtf?

    16. Re:That's cool.. by imroy · · Score: 5, Informative

      Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.

      No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.

      The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.

    17. Re:That's cool.. by MichaelSmith · · Score: 1

      As long as it is slightly cheaper than the nearest competitor

    18. Re:That's cool.. by Ed+Avis · · Score: 4, Funny

      According to Wikipedia, the average per-character entropy of English text has tripled in the last six months!

      --
      -- Ed Avis ed@membled.com
    19. Re:That's cool.. by Rakshasa+Taisab · · Score: 1

      What do you mean, you see no reason it would be asymmetric? Pretty much all decompression algorithms are.

      Choosing the optimal way to compress is probably NP-hard or close to it, but once it has been compressed it's just a matter of recreating the data.

      --
      - These characters were randomly selected.
    20. Re:That's cool.. by spiderbitendeath · · Score: 1

      And include wireless updates anywhere in the universe.

      --
      Sometimes when I'm working on projects things disappear, I suspect gremlins.
    21. Re:That's cool.. by Anonymous Coward · · Score: 0

      That's funny. Could you provide a link?

    22. Re:That's cool.. by Anonymous Coward · · Score: 5, Funny

      Perl: The only language that looks the same before and after RSA encryption.

    23. Re:That's cool.. by chenjeru · · Score: 1

      a txt spk vrsn of wki shd ft 8gb i tnk
      is nly bcz ppl r sch grmr nubz tat tey nd 2 wst $
      dud shd fltr 2 txtspk b4 he cmprss

      Thr, fxd tat 4 u.

      --
      Even if you're on the right track, you'll get run over if you just sit there. - Will Rogers
    24. Re:That's cool.. by davetv · · Score: 1

      Funniest thing I've read all day.

    25. Re:That's cool.. by peterpi · · Score: 1

      Heheh, wonderful! :)

    26. Re:That's cool.. by sexybomber · · Score: 1

      Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo)
      But then they'd have to start calling it "The Guide".
    27. Re:That's cool.. by Anonymous Coward · · Score: 0

      You should sell that on a T-shirt.

    28. Re:That's cool.. by Joe+Decker · · Score: 1

      My guess, and it's not much more than a guess, is that the neural net will dominate the compression/decompression time, and you'll have to run that both during compress/decompress time the same way, as well as turning it's outputs into probabilities for arithmetic coding. If I'm right, then yeah, compress and decompress would have fairly similar performance.

    29. Re:That's cool.. by roman_mir · · Score: 1

      TxtSpkVrshnOfWikiShdFit8GBIFinkItsOnlBcozPplRSchGr mrNubsTeyNid2Wste$DudShdFit2TxtSpkB4Cmpres

      Why are you, guys, wasting all those page breaks? Oh, and it's 'Nubs', not 'Noobs' and you can skip most of the pronouns and articles.

    30. Re:That's cool.. by chhamilton · · Score: 1

      PPM and context based algorithms generally have fairly similar compress and decompress times as well as similar resource usage. The algorithms work by keeping a huge table of statistics based on text already seen (and maybe some initial statistics based on some external corpus) and using that to come up with a table of probabilities as to what the compressor thinks the likelihood is of the next 'symbol' being any of its possible values. This table of probabilities is then used to encode the symbol using an arithmetic encoder. To decode the same symbol, the arithmetic decoder needs to be in the exact same state as the encoder was. So, the same table's must have been generated in the same manner, using the same resources. Also, arithmetic encoding is pretty much symmetric, so there is no real speed gain there. As such, I imagine the algorithm requires a fairly lengthy decompress time as well.

      Matt Mahoney lists compression time, decompression time and memory usage (as well as other statistics) for the best Wikipedia compressors: http://www.cs.fit.edu/~mmahoney/compression/text.h tml. A brief look shows that most PPM and CM based algorithms are quite symmetric in performance.

      Assymetric compression algorithms are ones that tend to spend a configurable amount of time looking for patterns in the data. The compressed stream ends up being an encoding of the pattern, and the decompressor simply grabs data according to the pattern and outputs it, which is quite fast. Think LZ based algorithms that spend a lot of time looking for string matches, and deciding the optimal way to output them. Or video encoding that spends a lot of time calculating how the scene has moved (how blocks in one frame are related spatially to those in another).

    31. Re:That's cool.. by vigmeister · · Score: 1

      here.cc = NULL i++

      --
      Atheist: Buddhist in a Prius
    32. Re:That's cool.. by smartdreamer · · Score: 1

      0

      Fixed to maximum compression. Don't ask me to decompress though.

    33. Re:That's cool.. by SnapShot · · Score: 1

      Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously. Shall we call it the Compression Event Horizon: the point where an infinite amount of processing is required to remove last extraneous bit from the compressed file?
      --
      Waltz, nymph, for quick jigs vex Bud.
    34. Re:That's cool.. by aprilsound · · Score: 1

      Choosing the optimal way to compress is probably NP-hard or close to it Not to quibble, but something is either NP-hard or it isn't. There is no 'close'. O(n^1e1000) may take a longer than O(2^n) for non-gargantuan values of n, but that doesn't change the nature of the problem. NP stands for "Non-Deterministic Polynomial Time." It describes the type of automata you would need, not how long it might actually take.
    35. Re:That's cool.. by Dr.+Smoove · · Score: 1

      My CPU is a neural net prah-cessah, a learning cump-you-tah.

      --
      "If you plant ice, you're gonna harvest wind."
    36. Re:That's cool.. by p3d0 · · Score: 1

      Not to quibble, but something is either NP-hard or it isn't. There is no 'close'. Sure there is. For instance, k-colourability is NP-hard. 2-colourability is a tiny tweak of k-colourability (ie. setting k to 2), and it's not NP-hard, but it's damn close because if you had picked 3 instead of 2, then whoops, you're NP-hard. If you give the GP the benefit of the doubt, I think you'll see this is what he might have meant.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    37. Re:That's cool.. by Baldrson · · Score: 1
      Look up Kolmogorov Complexity. It's been proven that the general task of optimal compression is not just "hard", it is uncomputable. Think of it like this:

      Take a random bit string of some reasonably large length -- say 1 million bits. Answer the question: What is the shortest program that can output that bit string?

      It's been proven you can't compute that optimal program as the output of another program.

      Human text is not the same as a random bit string, however much we might joke about it.

    38. Re:That's cool.. by Archimonde · · Score: 1

      I understand what are you saying, although I just wanted to use a simple compression method I occasionally use when writing sms messages. On my mobile phone it is just as easy to enter normal text as is "CompressedText". So when you know you are going to write an sms longer than 160 characters (including spaces) you can compress the text as you write it (in real time) and by eliminating the the space characters you can gain quite a few additional words and pay only for one (instead or two) message.

      --
      Trolls are like broken clocks. They show the truth two times a day. The rest of the day they talk nonsense.
    39. Re:That's cool.. by chuckymonkey · · Score: 1

      What's sad is that I could read that almost without missing a beat.

      --
      "Some books contain the machinery required to create and sustain universes."-Tycho
    40. Re:That's cool.. by Oktober+Sunset · · Score: 1
      Dude, it's a phone, why not just use Wap or whatever to just go to the site instead of trying to fit it all on a memory card.

      What's your next idea, get a computer with a big hard drive and Download Teh Internets.

    41. Re:That's cool.. by Citizen+of+Earth · · Score: 1

      &parent;

    42. Re:That's cool.. by TheoMurpse · · Score: 1

      Three revisions of the text and not a single one made the simple Noobs->Nubs transformation!

    43. Re:That's cool.. by Richthofen80 · · Score: 1

      This is going to sound incredibly trollish, but why not use the internet capabilities of your phone?

      --
      Reason, free market capitalism, and individualism
    44. Re:That's cool.. by imsabbel · · Score: 1

      Sorry, pretty NO compression algorithms are asymetric. The only one i would know of on a hat that is signifficantly asymetric would be LZ77 and 7zip

      BWT,RAR,PPM, PAQ and others are more or less symetrical.
      PAQ is an especially bad offender. Look here: http://www.cs.fit.edu/~mmahoney/compression/text.h tml . Some version of paq decompressed with 1mbyte/HOUR.

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    45. Re:That's cool.. by Anonymous Coward · · Score: 0

      you forgot: because (becoz) -> cause (coz)

    46. Re:That's cool.. by Anonymous Coward · · Score: 0

      Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...

      Apple Lossless Audio Compression will playback on an iPod just as well as a home computer. I always thought that was neat. It benefits from resources for compression but doesn't require as many resources for decompression.

    47. Re:That's cool.. by Solder+Fumes · · Score: 1

      Not really...my main point was that MPEG4 is a lossy protocol, which is useless to bring up in discussions about text compression.

    48. Re:That's cool.. by tajmahall · · Score: 1

      Why do you want a static version of Wikipedia? Isn't it easier just to access the actual Wikipedia, which is constantly being updated? I often rely on the assumption that someone will have instantly posted about recent events. Naturally there are practical uses for good compression, but as fer usin' the darn thing, seems like the original is best.

    49. Re:That's cool.. by toddestan · · Score: 1

      I think he was saying that since MPEG4 is lossy, that the compressor has to deal with a bunch of data that decompressor does not have to (because the compression algorithm threw it away). I can't say I know enough about MPEG4 to say that's really the reason why compression takes a lot longer than decompression though.

    50. Re:That's cool.. by knorthern+knight · · Score: 1

      if u cn rd ths, ur prbbly a lnux geek.

      --

      I'm not repeating myself
      I'm an X window user; I'm an ex-Windows user
    51. Re:That's cool.. by Ed+Avis · · Score: 1

      No I can't provide a link... and I suppose I should not be too surprised that some moderator has flagged the comment 'informative'...

      --
      -- Ed Avis ed@membled.com
    52. Re:That's cool.. by Cato · · Score: 1

      The issue isn't really CPU speed - my Treo 680 has an Intel XScale (ARM) at 312 MHz, so it's easily capable of uncompressing a single page (ideal case but might lose too much compression efficiency) or 100-1000 pages in a cluster (with index) fairly quickly. I've used desktop PCs that were slower, and uncompressing from SD Card will be way faster than using the slow GPRS.

      The main issue is compressing the Wikipedia data in a form that's randomly accessible and ideally has a searchable index. At a pinch, and index of page titles only would be OK, but you really want a full keyword index, at which point the overall size goes up quite a lot.

    53. Re:That's cool.. by Cato · · Score: 1

      Because I have terrible mobile coverage at home and in many local pubs (although that will improve with 3G femtocells, and also the fact that two major mobile phone operators are merging their radio access [cell tower] networks) - yes, I do live in the sticks...

      I do frequently look up things on the Net from my phone, but it would be quicker to have Wikipedia online rather than waiting 5-10 seconds or longer for each page to load (Opera Mini is pretty good at compressing data to speed transfer over GPRS but it's still not as fast as being off-net, and it's not a native app).

  5. It's called WHAT? by martinX · · Score: 0, Flamebait

    Alexander Ratushnyak's open-sourced GPL program is called paq8hp12.
    With names like this, it's no wonder OSS isn't well known. Doesn't exactly roll off the tongue... Hang on, is it l33t sp33k?
    --
    When they came for the communists, I said "He's next door. Take him away. Goddam commies."
    1. Re:It's called WHAT? by Anonymous Coward · · Score: 0, Flamebait

      no wonder OSS isn't well known
      Don't you people ever shut the fuck up? Is that *really* such an interesting topic to you? And what about the text compressor? Got anything to say about that? No? Fuck off then.
    2. Re:It's called WHAT? by Anonymous Coward · · Score: 0

      it's pronounced "Pack-eigth-piz"

    3. Re:It's called WHAT? by Anonymous Coward · · Score: 0

      Fucking right, OP is a brainless fucker. Get a job flipping burgers you miscreant.

    4. Re:It's called WHAT? by Anonymous Coward · · Score: 0

      oh shut up you karma-whoring twerp ! This was about a scientific competition and not about some new open source utility. The author only happened to release it as GPL.

    5. Re:It's called WHAT? by newr00tic · · Score: 1

      Id guess quite a few burgers were burnt due to him making that post.. ;)

      --
      A horse can't be sick, you know, even if he wants to.
    6. Re:It's called WHAT? by Anonymous Coward · · Score: 0

      Such a hater...

  6. Huh? by Anonymous Coward · · Score: 0

    Could someone please explain the third link in English. How does compression relate to AI?

    1. Re:Huh? by headkase · · Score: 4, Informative

      Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach.

      --
      Shh.
    2. Re:Huh? by Kenoli · · Score: 1

      This is what I want to know.

    3. Re:Huh? by DavidpFitz · · Score: 3, Informative

      One of the best introductions to AI is Artificial Intelligence: A Modern Approach.

      Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.
    4. Re:Huh? by Dan+D. · · Score: 1
      Well its not just that compression is knowledge representation... its intrinsically linked to learning... which implies inference and planning... although IMO its easier to see it with lossy compression

      this is hard to do without a picture, but imagine you had a cluster of points which you knew was a noisy measurement of a line. Well you may have a million points but once you "know" that you need a line then all you need are two values the slope and intercept. You've turned 1 million values into 2 (please forgive the imprecision.) Now that you know its a line you can do inference, "The next point will land here" and planning "Because I know where the next point will land I will react to it before it happens..."

      This is where it relates to information theory, you might have a random jumble of points but the mutual information of the "Line" prior and the data posterior is very low. That is if you "know" that its a linear model then its very easy to represent.

      I keep saying "know" because its very unlikely you will know that its a line. And the line is so specialized (high bias) that its unlikely for any given data that the its actually a line generating it. But you could imagine if you knew the perfect representation behind the data maybe its not a line maybe its a set of lines each with a different slope (peicewise linear)... anyway in text compression some use the idea of lines of text much like the peicewise linear representation. But assuming you knew the perfect representation... the perfect model for generating the text then it would take very few parameters to generate the whole of text. One could say the model is DNA and the parameters are the particular DNA variables that make up a particular human. Solving for those parameters could allow you to infer all the text that human would ever say... with some error :)

      --
      People who quote themselves bug the crap out of me -- Me.
  7. Re:but... by Anonymous Coward · · Score: 0

    The rar files holds an exe, so no, sorry.

  8. Where's the Mods? by OverlordQ · · Score: 4, Informative

    The link in TFS links to the post about the FIRST payout, here's the link to the second payout (which this article is supposed to be talking about).

    --
    Your hair look like poop, Bob! - Wanker.
    1. Re:Where's the Mods? by Baldrson · · Score: 1

      Thanks for correcting that error of mine.

  9. Dangerous by mhannibal · · Score: 4, Funny

    This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.

    Damned scientists!

    1. Re:Dangerous by SoulDrift · · Score: 5, Funny

      Actually, I can give you 100% compression already. It's just a bit lossy.

    2. Re:Dangerous by dintech · · Score: 0, Offtopic

      Hehe, that appeals to my sense of humour.

    3. Re:Dangerous by Tuoqui · · Score: 1

      I can already compress a text file down to 0 bytes.

      cp *.txt /dev/null

      Can I claim the prize now?

      --
      09F911029D74E35BD84156C5635688C0
      +2 Troll is Slashdot's way of saying groupthink is confused
    4. Re:Dangerous by KylePflug · · Score: 5, Funny

      humour
      Humor.

      See? American English is actually just essentially lossless compression...
    5. Re:Dangerous by Sobrique · · Score: 1
      The utility you're looking for is lzip. It's great. I've compressed a load of files on my server, and have _loads_ of disk space free now.

      Sadly, the sourceforge page appears to have been taken down. I'm some what disappointed that there wasn't enough interest to sustain this project.

    6. Re:Dangerous by Sigma+7 · · Score: 1

      Sadly, the sourceforge page appears to have been taken down. I'm some what disappointed that there wasn't enough interest to sustain this project. No, it's just been compressed.
    7. Re:Dangerous by Martin+Spamer · · Score: 1

      [i]I can give you 100% compression already. It's just a bit lossy.[/i]

      A lot like a black hole then!

    8. Re:Dangerous by lazybeam · · Score: 1

      It's actually lossy as American English is often horribly mispronounced. ("Hoomor" anyone?)

      --
      --
      no sig for you. come back one year.
    9. Re:Dangerous by smallfries · · Score: 5, Funny

      See? American English is actually just essentially lossless compression...
      Sure, sure it is. Not exactly optimal though...
      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    10. Re:Dangerous by Welshalian · · Score: 5, Funny

      humour
      Humor. See? American English is actually just essentially lossless compression...
      I respectfully disagree. Most of the fun in British humour gets lost in the translation to American humor.
    11. Re:Dangerous by timster · · Score: 1

      Maybe it's just you that gets lost.

      --
      I have seen the future, and it is inconvenient.
    12. Re:Dangerous by 19thNervousBreakdown · · Score: 1

      I can see where mush-mouthing catchphrases in a falsetto and slapping each other with a fish wouldn't translate well. Too bad, I hear it's really funny in the original language.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
    13. Re:Dangerous by UbuntuDupe · · Score: 3, Insightful

      Well, not always...

      American --> British

      transportation --> transport
      football player --> footballer
      subway --> tyube
      burglarize --> burgle

    14. Re:Dangerous by MosesJones · · Score: 1

      Nope, you lose irony and sarcasm during the compression :)

      Oh and the ability to pronounce the "h" in Herb

      --
      An Eye for an Eye will make the whole world blind - Gandhi
    15. Re:Dangerous by aprilsound · · Score: 2, Funny

      football player --> footballer/quote You misspelled 'soccer'.
    16. Re:Dangerous by Josh+Booth · · Score: 1

      In what part of the country do people say "hoomor"? Not where I'm from.

    17. Re:Dangerous by Anonymous Coward · · Score: 0

      It's actually lossy as American English is often horribly mispronounced. ("Hoomor" anyone?)

      I've lived in various parts of the USA since 1982 and I've never heard "humor" pronounced as "hoomor".

      Are you sure you're not just making that up?

      I suppose you could find some mental defectives somewhere in the USA who might pronounce it that way, but then you can probably also find some Australians who can't pronounce "g'day" correctly.

    18. Re:Dangerous by Anonymous Coward · · Score: 0

      The original French word you stole it from didn't pronounce the H, you fucktard. If the Brits weren't dumbasses they would have originally transliterated it "Erb" rather than leaving the "H" on so more dumbasses could pronounce it like fucktards.

    19. Re:Dangerous by Mister+Whirly · · Score: 1

      That's because the dumbass Brits were too busy running an empire that dominated the better part of the civilized world for many years...

      --
      "But this one goes to 11!"
    20. Re:Dangerous by gowen · · Score: 1

      And anyone who thinks British English isn't equally lossy hasn't tried to talk to a Glaswegian recently...

      --
      Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
    21. Re:Dangerous by grammar+fascist · · Score: 1

      And anyone who thinks British English isn't equally lossy hasn't tried to talk to a Glaswegian recently...

      And what's wrang wi' Glaswegians but? I'm no ashamed e my Scottish tongue!
      --
      I got my Linux laptop at System76.
    22. Re:Dangerous by Theovon · · Score: 1

      aluminum -> aluminium

    23. Re:Dangerous by TheoMurpse · · Score: 1

      I can do you one better:
      What the fuck->wtf
      That's over 400% compressioooooooooooooooooooooooooooooooooooooo234 235184351fejf asfj$#^%@ $ 2=0 9v124tKABOOOOM in the beginning God created the heavens and the Earth...

    24. Re:Dangerous by jamietre · · Score: 2, Funny

      Nah, you got it wrong.

      British -> Dude

      Transport -> Car
      Footballer -> Dude
      Tube -> Car
      Burgle -> Get


      See? Much compressed.

    25. Re:Dangerous by SpaceCracker · · Score: 1

      humour
      Humor. See? American English is actually just essentially lossless compression...
      I respectfully disagree. Most of the fun in British humour gets lost in the translation to American humor.
      Well, you can't really translate humour into humor. It's like translating apples into oranges.
      Now, try to decompress "Lovely weather today. Isn't it?"
      --
      sigo ergo sum
    26. Re:Dangerous by Anonymous Coward · · Score: 0

      Yea, but now that the Americans have taken over, *they* still manage to get most words right.

    27. Re:Dangerous by Vintermann · · Score: 1

      I thought they were still arguing whether black holes were lossy or not.

      --
      xkcd is not in the sudoers file. This incident will be reported.
    28. Re:Dangerous by lazybeam · · Score: 1

      I don't know, but it's a pet hate that so many words sound horrible coming out of the TV. Some that spring to mind include emu, buoy and puma.

      I saw the MST3K version of Pumaman, the robots "corrected" the English speakers pronounciation of "puma" to "pooma", and even saw the joke in correcting "human" to "hooman".

      --
      --
      no sig for you. come back one year.
    29. Re:Dangerous by lazybeam · · Score: 1

      Yeah, well, "G'day" is just a contraction of "Good day". I remember my grade 7 teacher insisted on spelling it "gidday" - that's the only place I've seen it spelt that way. That spelling implies a slightly different pronounciation.

      I was making up "hoomor" but I have seen many other words with a long "oo" sound replacing the "u" where it shouldn't, and even other letters. There's a song I've heard on the radio (TripleJ, so not your average radio station) that sampled a speech that mentioned "David Booee" and "Cyndi Looper". I'm assuming they are "David Bowie" and "Cyndi Lauper". But anyway...

      But then I grew up in a city called Toowoomba: almost no-one pronounces it correctly unless they've heard someone else say it.

      --
      --
      no sig for you. come back one year.
    30. Re:Dangerous by Anonymous Coward · · Score: 0

      Burglarize is a real American word? Sheesh, I always thought that one was a joke.

    31. Re:Dangerous by scott_karana · · Score: 1

      Unfortunately, it's also a game in obfuscation: centre versus center, aluminium and aluminum, et cetera.

  10. Artificial Intelligence? by mrbluze · · Score: 3, Insightful

    Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence.

    Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

    Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?

    --
    Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
  11. I'll be reading the source... by seanadams.com · · Score: 3, Interesting

    I've worked with some general purpose compression algorithms like zlib, lossy audio compression like mp3, and also lossless audio.

    Each is very different and interesting in its own right. MP3 especially, because the compression model is built on what the ears+brain can perceive.

    This algorithm I guess would be sort of like mp3 in that it contains some human-based element, maybe a language structure or something, but more like FLAC in that it might use predictors to say what word is likely to come next, with an error bitstream to point to progressively less likely words using bit sequences whose is inversely related to the probability of that word. But that's just a guess from an audio guy.

    Can somebody who's looked at this post a synopsis of how it works?

    1. Re:I'll be reading the source... by opec · · Score: 1

      I don't think so. This contest called for lossless data compression. MP3 (and JPEG, etc etc) compression is lossy.

    2. Re:I'll be reading the source... by phatvw · · Score: 5, Interesting

      I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?

    3. Re:I'll be reading the source... by Xtravar · · Score: 1

      He compared it to FLAC, which is lossless.

      --
      Buckle your ROFL belt, we're in for some LOLs.
    4. Re:I'll be reading the source... by Megatronium · · Score: 0

      While that works for sentences like this one or yours, it wouldn't work well for something like this:

      "Go around the block, across the street, down the stairs, through the door, and past the third window."

      Go the block, the street, the stairs, the door, and the third window.

      ...has so many possible decompressions that it would take forever to find Waldo.

    5. Re:I'll be reading the source... by dodongo · · Score: 1

      In fact, you'd be better off just dropping vowels.

      G rnd th blck, crss th strt, dwn th strs, thrgh th dr, nd pst th thrd wndw

      However, you'd want a lexicon including phonetic and textual transcription, and some probabilistic sound : phoneme mappings. And then maybe an ontology and a semantic parser would help. And...

      Yeah. Well, never mind.

    6. Re:I'll be reading the source... by marko123 · · Score: 1

      Prepositions can be very important, though because they define the relationship or context between two objects. e.g. cat in box, cat under box, cat behind box. How can you get the meaning if you drop the preposition?

      --
      http://pcblues.com - Digits and Wood
    7. Re:I'll be reading the source... by phatvw · · Score: 1

      In longer sentences or groups of sentences, it might be possible to predict what the preposition ought to be based on context. In a simple three word sentence, there is no additional context, so for those cases we're SOL.

  12. you can stop now by r00t · · Score: 1

    This really isn't much of a gain. If the info theory is right, there isn't much gain to be had. Even in the most optimistic case, we aren't going to go much beyond a factor of two additional reduction.

    Other stuff is more interesting: fast decompression time, fast compression time, smaller compression block size

    1. Re:you can stop now by Anonymous Coward · · Score: 0

      Its not about achieving additional reduced size .. its about achieving new ways of adaptive pattern domain learning and thats kewl as it has all marks of possible future AI which can be applied litteraly everywhere... not only cmprsn.

      SpaceQ

  13. Question for someone in the know by SpaceballsTheUserNam · · Score: 0

    Can anyone explain to me (in english where possible) how compression algorithms like this actually work?

    --
    \.
    1. Re:Question for someone in the know by smallstepforman · · Score: 1
      The algorithm is based on the frequency of letters in the English language, and assigns a bit pattern (from smallest to largest) based on the order these letters appear in the main text. For odd letter combinations which never appear together (eg. qx, tz, vz etc), you will actually get no compression, but for most dictionary words, you could compress common words into a few bits (eg. compress and into 10).

      The particular compression lookup-table is useless in non english language, but I guess another table should be used for other languages.

      The challenge is actually related to how best to design a look-up table, and how to fill the table. This has absolutely nothing to do with AI.

      --
      Revolution = Evolution
    2. Re:Question for someone in the know by tepples · · Score: 1

      The challenge is actually related to how best to design a look-up table, and how to fill the table. This has absolutely nothing to do with AI. Your explanation covers "huffword". The "huffword" algorithm fails to take advantage of redundancy of entire phrases and sentence structures. As I understand it, the rationale behind the Hutter Prize is that efficient use of phrase and sentence compression requires some of the same techniques used in AI.
    3. Re:Question for someone in the know by Anonymous Coward · · Score: 0

      The simple algorithm you describe has little to do with AI, because modeling language as characters with probabilities independent of previous ones is naive. But is quite far from the algorithm employed. Stop talking out of your ass.

  14. ai threshold? by Nyph2 · · Score: 1

    BS that this is near the AI threshold. It's not just compression, connections between peices of information & speed of retrieval are crucial to be able to make AI workable.
    The shift toward modling AI attempts after our understanding of human cognition is the best hope visable for AI, and non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

    1. Re:ai threshold? by Baldrson · · Score: 4, Informative
      non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

      paq8hp12 uses a neural network, ie: it has a connectionist component.

    2. Re:ai threshold? by TheLink · · Score: 1

      But would they understand how it works, why it works once it seems to work?

      A lot of these AI stuff seems to be throwing stuff together and hoping for the best.

      I have quite a low opinion of "making AI" that way.

      After all, if I wanted to get an intelligent non-human entity without really understanding how it works, I could just go to the local pet store and buy one.

      --
    3. Re:ai threshold? by Baldrson · · Score: 3, Insightful

      Connectionist models are models. Any model needs to be interpreted to be understood.

    4. Re:ai threshold? by Kwirl · · Score: 1

      Did you read the article or any of the following comments? How about the synopsis? It does not claim to be nearing the AI threshold, it claims that they are approaching 1% of the threshold, which is a far cry from the 'BS' that you portend.

      There is a vast distance between 'nearing the AI...' and 'nearing 1% of the AI...'

  15. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    See: this post.

  16. Re:Artificial Intelligence? by MoonFog · · Score: 4, Informative
    Shamelessly copied from the wikipedia article on the Hutter Prize:

    The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.

    The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.
  17. Lossy compression? by niceone · · Score: 5, Funny

    Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?

    1. Re:Lossy compression? by Anonymous Coward · · Score: 0

      I agree, although some data could certainly just be lost because it wasn't indexed properly...

    2. Re:Lossy compression? by Anonymous Coward · · Score: 0

      forgetting is imporant part of learning... if you forget right things(in right order which reflects they potential future and imminent usability).. it boosts/bubbles important information/connection up, which in turn gives you higher probability of right path/pattern comming out of "thinking"

      SpaceQ

    3. Re:Lossy compression? by grolschie · · Score: 0, Offtopic

      Freakin' awesome post dude!

    4. Re:Lossy compression? by CaffeineAddict2001 · · Score: 1

      I wonder how much information is actually "lost" in our brains?

      Example: Have you ever forgotten the name of a movie star and then started cycling through the alphabet and then when you hit a letter it suddenly pops up?

      Maybe everything is in there and it just becomes harder to find the right pathways to access it?

    5. Re:Lossy compression? by Phat_Tony · · Score: 4, Interesting

      That's my opinion of this. By excluding lossy compression, they're also excluding the likelihood of applicability to AI that is the point of the contest.

      Humans achieve good compression on things like encyclopedia knowledge because we don't remember the words at all. We remember the idea, and we have our own dictionary in our heads, and we re-apply words to the idea to reconstruct the entry, rather than memorizing the data. That's why we get great compression; we throw out most of the data, and just remember the "gist" of it, the argument, the facts, in an internal structure of raw ideas stored independently of the words to explain them.

      By restricting the contest to lossless compression, they eliminate the ability to use any AI-like compression techniques. The machine can not extract the ideas and then re-assign words, because it would have to be able to do so using the exact voice of each of thousands of different Wikipedia contributors. That's hopeless.

      So the entrants are restricted to clever algorithms that do endless mathematical optimizations to compress the data, a method of compression that's entirely alien to the methods of our only known intelligence. We don't remember things by figuring out clever tricks to compress the data in our own memory. We don't say "Oscar Schindler saved Jews In WWII" and then say, OK, that data had 5 spaces in it, and 4 "S's," and if I remember the positions of the spaces and the S's, I could use less memory space to store this in my head, and then just think back through the algorithm I used to take the spaces and "s's" out and put them back in where they go, and I'll have the name again, and then sit there and carefully work out in our heads what the original data must have been after our compression methods. It doesn't work that way at all. To us it apparently "just comes to us." The compression probably comes from things like remembering sounds, and then reconstructing the name's exact spelling based upon known rules of grammer. We store the name Oscar Schindler in relation to various facts regarding Jews and WWII, but we store them as ideas, and then pull the words back out, and each time someone asks us about Schindler, we'd be likely to say something similar in meaning but different in expression. So this contest is restricted to the least interesting kind of compression for intelligence; the kind that can't use it.

      Interesting compressions are things like JPEG and MP3, where they built the compression model on the human perceptual model, first saying "what about this exact data is less relevant to a human observer, that we can therefore throw away?" For JPEG's, it turns out that (among other things) we're much more sensitive to differences in color than to absolute colors, and among differences in color, we're much more perceptive in the color ranges closer to human skin tone. MIDI is actually probably closer to the compression used by human intelligence than any recorded music standard.

      Along these lines, I'd say storing the HTML formatting data exactly borders on ridiculous. It's a hugely inefficient waste of space. For instance, if you just run the HTML through one of the free online utilities that strips irrelevant data, you get the identical presentation of the data, you've only thrown out entirely worthless data. But you've already violated the contest rules. You should be able to strip the HTML entirely, as long as your compression/decompression system ends up with conveniently readable formatting in the end. Reconstructing the actual HTML in a character-identical way is so non-intelligent when you're trying to save space, it seems hard to beleive it's going to lead to intelligence.

      Regarding this contest: I'm curious what level of compression you can get if you just histogram the words and then, in order of frequency for anything with enough occurrences to save memory by using a look-up table, you assign sequential numeric values for the words in order of frequency of occurrence. Then start your data with a look-up

      --
      Can anyone tell me how to set my sig on Slashdot?
    6. Re:Lossy compression? by waynemcdougall · · Score: 1

      We don't say "Oscar Schindler saved Jews In WWII" and then say, OK, that data had 5 spaces in it, and 4 "S's," and if I remember the positions of the spaces and the S's, I could use less memory space to store this in my head, and then just think back through the algorithm I used to take the spaces and "s's" out and put them back in where they go, and I'll have the name again, and then sit there and carefully work out in our heads what the original data must have been after our compression methods. It doesn't work that way at all.

      We don't? Blast.

      Excuse me. I need to go rewrite 3 years of lecture notes.

      --
      Recycle PCs and build a wireless community network www.hillsborough.org.nz
    7. Re:Lossy compression? by Anonymous Coward · · Score: 0

      I forget the name of the result, but if you read Shannon's original paper on information theory, you'll find that for a good sized set of the most common words, each word occurs with roughly twice the frequency of the next, less common word. That is, to just make up an example, if 'the' is the most common word and 'a' is the next most common, 'the' will appear twice as often as 'a', and if 'but' comes next, then 'a' will appear twice as often as 'but', and so forth.. eventually it levels off, though

    8. Re:Lossy compression? by marcosdumay · · Score: 1

      "Humans achieve good compression on things like encyclopedia knowledge because we don't remember the words at all. We remember the idea, and we have our own dictionary in our heads, and we re-apply words to the idea to reconstruct the entry, rather than memorizing the data. That's why we get great compression; we throw out most of the data, and just remember the "gist" of it, the argument, the facts, in an internal structure of raw ideas stored independently of the words to explain them."

      That achieves better compression, but there is a lot of redundancy at our language that we can take advantaje even with lossless algorithms. Lots of that redundancy is on repeated words and too many characters per word, as you noticed, but we already take advantaje of that. Hint: to see how a thesaurus-like indexing and cutting repeated strings compress, just zip some files.

      Now, there is some redundancy we still don't take advantaje of. Some of it is on "simpe" (not quite simple, but still not strong AI) rules on our texts, and some of it need context understanding. What those contests are doing is using the "simple" rules, so next improvements will only be made by stong AI. And the fact that the results are public makes it easy for the first person that develops a strong AI to extend those powerfull compressors and prove his/her achievement.

    9. Re:Lossy compression? by dosun88888 · · Score: 1

      I'm almost in your boat, but my contention is that storing the compressor along with the compressed data is useless. Humans do nothing of the sort. We can only remember things because we have a vast, vast library of information that we can map things to. We remember things like the ability to add one to a number a bunch of times. "Give me every integer between 1 and 100" is a lot smaller in terms of data than the same set of information compressed. That's because we have a great big logical store of data.

      So, if a guy came out with something that was 1 GB in size, and cold compress virtually anything to 1/100th of its original size, it'd be more in line with what our minds do.

      I'll let you all know when I finish -my- compressor.

    10. Re:Lossy compression? by Anonymous Coward · · Score: 0

      By excluding lossy compression, they make it easier to check the results and to compare them in terms of entropy, but they also create a lot of additional work on top of the required intelligence.

      The basic point of information theoretically optimal compression is that you need log(1/p) bits to encode a message whose probability is p. To reach the optimal compression, you need to determine the probability of this particular message. Lossy compression is definitely useful here: any message with the same human-understandable idea but different details (word choice, spaces, HTML etc.) has approximately the same probability and most importantly, once you have the intelligent lossy compression of the message, an optimal encoder can encode the details separately. As long as the general gist of a message and the details are statistically independent, separately encoding them is an optimal form of compression - there is no better way to compress losslessly.

      Of course, the currently paraded software does not even try to reach full understanding of the message but concentrates wholly on the details. It may be that the number of bits in the actual meaning of wikipedia is very small in comparison to the number of bits required to encode the typographical details - even so, the optimal encoding could still be only reached by an artificial intelligence. On the other hand, it could be that the Hutter prize is reachable by using completely stupid algorithms.

    11. Re:Lossy compression? by BillyBlaze · · Score: 2, Interesting

      While allowing lossy compression might end up with way better AI, there's a logistics problem: how do you give an objective score to the precision of 100M (that's 4.76 uLoC) of paraphrased text?

  18. AI? I don't think so. by tgv · · Score: 3, Insightful

    It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.

    And yes, that's absolute bollocks. Shannon's number was just an estimate and only applied to serial transmission of characters, because that's what he was interested in. Since then, a lot of work has been done in statistical natural language processing, and I would be surprised if the number couldn't be lowered.

    Anyway, since the program doesn't learn or think to reach this limit, nor gives a explanation how this level of compression is intrinsically linked to the language/knowledge it compresses, it cannot be called AI; e.g., it doesn't know how to skip irrelevant bits of information in the text. That would be intelligence...

    1. Re:AI? I don't think so. by Anonymous Coward · · Score: 0

      Irrelevant information? On Wikipedia? That's unheard of!

    2. Re:AI? I don't think so. by dkoulomzin · · Score: 2, Informative

      It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon. I don't know where you got that idea... but trust me, no computer scientist or mathematician would ever present such an illogical argument. It's kind of like saying "Dan thinks elephants have trunks. Dan is smart. So if an elephant has a trunk, it's as smart as Dan."

      And yes, that's absolute bollocks. Worse, it's absolutely spurious: there's a good chance you made it up to sound smart.

      A reasonable explanation is in order. No one said that compression to 1.x bits per character is sufficient to call something intelligent. They just claim that this is the best that humans can do, so logically we can't call anything as intelligent as a human unless it's at least this good at compression.

      People are (incorrectly) presenting the link between compression and AI as: "If it can compress as well or better than Shannon's estimate, then it is intelligent." In fact, only the converse is true: "If it is intelligent, it can compress as well or better than Shannon's estimate." This is interesting, because the contrapositive of the latter ("If it cannot compress as well or better than Shannon's estimate, it is not intelligent") is a negative test of intelligence.

      Recall the Turing Test of Intelligence: a human "decider" sits at a terminal and chats with whatever is on the other end. On the other end is the subject; randomly either a computer or a human. After some fixed amount of time, the decider is asked whether he was chatting with a human being. This test is run lots of times, for lots of deciders and lots of subjects. If the decider answers yes for the computer at least as often as he answers yes for the human, then the computer can be regarded to be intelligent. A compression algorithm that beats the human threshold would imply that one gap between humans and computers has been closed, since the decider can no longer simply test to see if the subject can compress stuff. But there are billion other tests a decider could employ that AIs still can't do... so AI is still 10 years off, just like it's been for 50 years now.
      --
      Thou shalt not begin a subject line or post with the word "Umm".
    3. Re:AI? I don't think so. by raftpeople · · Score: 1

      so logically we can't call anything as intelligent as a human unless it's at least this good at compression

      It all depends on your definition of intelligence, doesn't it? There are many different things we can do that we consider part of intelligence but each of us does them well to a different degree. I'm not sure it makes sense to try to simplify it with comparison like "as intelligent as a human".

    4. Re:AI? I don't think so. by dkoulomzin · · Score: 1

      Fair enough. I suppose a Turing Test subject asked to compress some data better be able to respond like a human would: either by compressing it on par with Shannon's estimate or by having a convincing reason why it can't (while still claiming to be human).

      This is the beauty of the Turing Test... it cleverly avoids the trap of defining intelligence by simply stating that artificial intelligence should be practically indistinguishable from natural (human) intelligence. This means the test subject is allowed some realistic intellectual weakness. In fact, the lack of it might mark a subject as non-human!

      I should have said something like "so logically it's harder to call something UNintelligent if it's at least this good at compression." Alternatively, it would have been acceptable to say "we can't call anything as intelligent as most humans unless it's at least this good at compression."

      --
      Thou shalt not begin a subject line or post with the word "Umm".
    5. Re:AI? I don't think so. by tgv · · Score: 1

      OK, I should have been clearer: the Hutter prize people don't assume compression is AI because of the incorrect reasoning I gave here. That was an attempt at humorous insult.

      However, the link between compression and intelligence is total nonsense. If I'm going to ask a couple of randomly selected, intelligent people to compress Wikipedia, they won't reach anywhere near 1 bit per character. Perhaps if I get together a whole team of expert researchers and compression experts, they will reach this "magical" threshold, but a normal individual won't, and that's still the standard for intelligence.

      And Shannon's number comes from the redundancy in our language and its written representation and has nothing to do with intelligence. Its value also differs per language, at least as far as statistical measures are concerned. English is relatively well structured, but free word order languages have a considerably higher entropy when you just look at character sequences, and that's what Shannon did. Instead, we should do as I outlined above: ask normally intelligent people to compress a large body of text and only then we can compute what the threshold for AI is. And, as you can read above, I think we long surpassed this threshold. Even gzip will do better than most humans.

      So, the assumption behind the Hutter prize is bollocks. Satisfied?

    6. Re:AI? I don't think so. by dkoulomzin · · Score: 1

      Beating gzip is easily. As you point out, English is extremely redundant and thus inefficient in its usage of bits. Try removing most vowels from some English text: Stdis hv shwn tht ppl cn rd ths knd f txt lmst effrtlsly. My compression/decompression works by reading all of wikipedia, removing most of the vowels, then running it through the gzip algorithm (executed on paper, of course!). To decompress, execute gunzip (on paper), and then have a literate human being put the vowels back in. Sure it's going to be insanely slow, but speed of computation better not be the interesting part of intelligence; otherwise computers are already much more intelligent than we are. Also, I should be forgiven for not knowing the gzip algorithm. Even computers have to be "told" the algorithm.

      This sort of technique can be used for just about any compression algorithm. The example I gave simply isolates one way in which gzip is unintelligent... it doesn't "get the point" about English. I'm going to assert that one can call compression good enough to be intelligent when an intelligent person can't (non-negligibly) improve its compression ratio on some arbitrary and information rich English text. And I think the compression ratio on such an intelligent compression algo will be (somewhere near) Shannon's number. That is to say, it will "get the point" about English as well as an intelligent human. In this light, compression is a proxy for understanding human languages, which is absolutely crucial to human intelligence.

      --
      Thou shalt not begin a subject line or post with the word "Umm".
    7. Re:AI? I don't think so. by tgv · · Score: 1

      First point: f ts s s t pt th vwls bck n, gd sttstcl cmprssr wll s vr lttl spc fr thm.

      Second point: even a human cannot unambiguously recover the previous sentence (hint: singular or plural).

      Third point: proxies are not very interesting. Language itself is interesting, but language is not intelligence either: it's just the communication protocol between humans (which I happen to have been researching for the past 20 years).

      Intelligence has many other appearances: spatial, mathematical, verbal, social, etc. I don't think these are all really separate forms of intelligence (as many people seem to claim), but neither is intelligence as one dimensional as IQ suggests, and therefore quite difficult to measure...

    8. Re:AI? I don't think so. by dkoulomzin · · Score: 1

      first point: "If its so easy to put the vowels back in, (a) good statistical compressor will use very little space for them." That's probably true. In fact, I'd consider such an algorithm fairly intelligent. That might be what the Hutter prize winning algorithm does. I think what you're saying here is that most things humans can do to improve compression algorithms are already being done in the best algorithms. I have a feeling you're right, which probably is why compressors are approaching researchers' best guess at the average human level of compression of English.

      second point, regarding ambiguity: First of all, I said *most* vowels, not all. The human has to intelligently decide which ones. Also, if your text can't be unambiguously recovered, then you can't claim to have losslessly compressed it. That's like me compressing the same sentence to 1 bit and declaring that I have the most efficient compression algorithm possible. That's why when we talk about compression algorithms, we also talk about decompression algorithms. The human shouldn't remove vowels in such a way as to make the text unrecoverable. I think you understand the difference, since you engineered your vowel-less sentence to be as confusing as possible.

      third point: I'm not exactly sure what you mean by "proxies" in this context. I think you are claiming that I am using language as a proxy for intelligence. Rest assured, I am not. Like I've been saying all along, having language faculties is necessary, *but not sufficient* for intelligence. This is exactly the distinction that everyone seems to be missing: no one (who has a clue) is saying "it compresses/understands language, so presto, it's intelligent." The claim is "we are one step closer to having a machine that passes the Turing Test, since one can no longer stump the machine on the basis of its ability to capture the information in an English sentence."

      This being said, I think that *compression* and intelligence are highly related. Fundamentally, there is only so far you can go in compressing a blob of text without actually understanding the underlying information. For example, we both used 'gd' in our sentences; and we both without skipping a beat interpreted it as "good" rather than "god", "goad", "gad", etc. Sure, a statistical compressor could've chosen "good" and been right most of the time. But hypothetically you could have meant "god" if we were having a religious debate. This kind of technique relies on intelligence, and allows us to "compress" pretty well. The assertion is that when we pass Shannon's estimate, we probably have an algorithm that is just starting to exploit these kinds of techniques, and that's exciting. At least to me.

      Thanks for the lively debate... it's nice to exercise some logical chops again.

      --
      Thou shalt not begin a subject line or post with the word "Umm".
    9. Re:AI? I don't think so. by tgv · · Score: 1

      Umm, let's see.

      1. "most things humans can do to improve compression algorithms are already being done in the best algorithms". Yes, that's true and that's just because the structure of words is so predictable. E.g, if you see a word starting with "struct", the next two characters are "ur" (assuming no typing errors etc). So, in that ideal case, these two characters can be represented by 0 bits. The same holds for many vowels. However, people don't realize this by themselves, and they don't always know which vowels you can leave out and which ones you can't. So I would say our intelligence in compressing is lower than that of even such a simple scheme as Huffman.

      2. "Also, if your text can't be unambiguously recovered, then you can't claim to have losslessly compressed it." That by itself is a whole world of discussion, since the goal *according to me* is compression where you don't lose knowledge. However, losing irrelevant text is not a problem. Otherwise, why is there any understanding needed? So, you have to squeeze out every bit until you can just unambiguously recover the knowledge, but that has a very, very soft lower limit. So that (according to me) makes the contest rather weak.

      3. "The claim is "we are one step closer to having a machine that passes the Turing Test, ...". My counter-claim was that we've since long passed that point. If you ask a normal person on the other side of a TTY to give a good compression (plus decompression algorithm) of a certain sentence, I think they would utterly fail.

      4. A statistical compressor can choose the correct form of "gd" if necessary (note that when it has to compress lossless, it cannot and should not choose, so wouldn't need that bit of intelligence). If you're interested in that kind of thing: there are spell checkers that take a (probabilistic) distribution of the context into account and can come up with quite good corrections. I don't think any of them has so far made it into a commercial product, but a typical example from a PhD work on this topic is the correction of "onjections" in "most vehement onjections lodged against" or in "painful intramuscular onjections received daily" to "objections" in the first case and "injections" in the second.

    10. Re:AI? I don't think so. by dkoulomzin · · Score: 1

      "Umm, let's see." You're going to hell :)

      1. I think people do realize this by themselves. Just eavesdrop on any online chat room or the text messages of a typical 10 year old. But this is beside the point. I think you're fixating on compression algorithms being very strong nowadays. No one is disagreeing with you on that.

      I think the actual point is that at some point we have to admit that compression can only reach a certain level of effectiveness if the compressor can evaluate and exploit the information content of the text. Compressors work by eliminating redundancy from the representation of the knowledge. Essentially, a compressor is trying to capture the heart of the matter. I think this is exactly what a human does when he or she reads text. This is why compression is interesting to AI.

      2. I totally agree with you up until your assertion that it makes the contest weak. I think the contest is testing exactly the goal you state. It asks to losslessly compress (a proxy for) human knowledge. I think it has to be a proxy (because knowledge representation is a different and arguably more slippery AI subtopic), and I think it has to be lossless (to protect the experiment from the debate over what text/information is acceptable to lose on the grounds of irrelevance). I'm not sure how to set up a different experiment that would test your goal in reasonable isolation and still be scientifically rigorous.

      3. I think your elision of my sentence changed its meaning (bad compression/decompression? :) ). I never claimed the Turing Test should be to ask the subject to compress a sentence. I said it should be able to extract the knowledge from it. If it can compress the sentence as well as a human, then it is as efficient at cutting to the heart of the information in the sentence.

      4. That's pretty cool. I want it.

      --
      Thou shalt not begin a subject line or post with the word "Umm".
  19. Program size is 1.02 MB! by seanadams.com · · Score: 0, Troll

    Which is included in the size calculation... but this raises the question of how much data you'd really want to compress with such a program. It might be quite reasonable to use a decompressor which is, say, 100MB in size if it gives you a better net compression ratio on several GB of text.

    100MB of input text seems kind of small and might rule out more useful or more creative solutions to this problem. It also calls into question the relevance of Shannon's theory - what size data set was _he_ talking about?

    1. Re:Program size is 1.02 MB! by Baldrson · · Score: 4, Informative

      Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

    2. Re:Program size is 1.02 MB! by seanadams.com · · Score: 2, Interesting

      Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

      Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
      Maybe I'm misunderstanding the contents of that rar file. Are both of those data files needed? The .exe by itself is 124KB. Where did you get 99,696?

    3. Re:Program size is 1.02 MB! by TuringTest · · Score: 2, Informative

      Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile" Because then you'd have to measure also the size of the UNIX system in the count of your decoder program, and that would ruin your ratio.

      --
      Singularity: a belief in the "God" idea with the "demiurge" relation inverted.
    4. Re:Program size is 1.02 MB! by seanadams.com · · Score: 1

      Because then you'd have to measure also the size of the UNIX system in the count of your decoder program, and that would ruin your ratio.

      You don't get it. The issue is to what degree the decoder is tuned to the data set. A unix system obviously is not. You can use it for other things. But if a large decompressing program is only useful for decompressing a particular limited type of data, then your effective compression ratio is very poor by the time you download the data and the additional data that you need to decompress ONLY that data. See the difference?

      If the contest does not specify this stuff or make any constraints as to the applicability of the algorithm to arbitrary data sets, then it is really just an exercise in finding patterns in this one particular 100MB file. Not very interesting unless some general techniques are discovered as a result.

    5. Re:Program size is 1.02 MB! by nasch · · Score: 1

      So why couldn't I just include a 100MB data file with my decompressor Because the data file would be included as part of your compressed file size, so you would end up with, at best, zero compression.
  20. Re:Artificial Intelligence? by qbwiz · · Score: 5, Interesting

    Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

    Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?


    The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.

    If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
    --
    Ewige Blumenkraft.
  21. Huffman Example by headkase · · Score: 4, Informative

    See: Explanation. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.

    --
    Shh.
    1. Re:Huffman Example by archeopterix · · Score: 1

      Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest.
      You don't have to limit yourself to characters (although it is practical to do so for texts), or even fixed length bit sequences. If your data happens to contain a lot of "100011"'s and "10111010111010101"'s, you can use them for encoding. Any set of bitstrings works, as long as your file can be expressed as concatenation of the bitstrings from the set.
    2. Re:Huffman Example by headkase · · Score: 1

      Yup. You're completely right. Using characters as the base unit is arbitrary. Trying to keep it simple however; I should have explained binary trees better too.

      --
      Shh.
    3. Re:Huffman Example by Anonymous Coward · · Score: 0

      The idea of using a tree is a little tricky to explain. It becomes pretty obvious if you try to implement it.

    4. Re:Huffman Example by ardor · · Score: 1

      You forgot two other things:

      1) Modeling is a very big part of compression, in fact its the one where AI might occur.
      2) Huffman is only the optimum for integer symbol sizes. If one symbol has 2.117 bits, it won't be the optimum. If one symbol needs only 0.004 bits, then huffman gives it 1 bit, which is far too large. Arithmetic/range coding address these issues, and come VERY near to entropy, so entropy coding is a solved problem. Which leads me to (1) - its there where research happens.

      --
      This sig does not contain any SCO code.
    5. Re:Huffman Example by headkase · · Score: 1

      See: here. Representation or modeling defines everything. How easy it is to code logic depends directly on how you represent your data within memory. Not only in AI but in other areas such as how the philosophy behind Object Orientated Programming simplifies logic for programmers.

      --
      Shh.
    6. Re:Huffman Example by junglee_iitk · · Score: 1

      Just a noob question. How did you manage to give a link without Slashdot pointing out the domain name?

    7. Re:Huffman Example by headkase · · Score: 1

      I see [wikipedia.org] after the link on my end....

      --
      Shh.
    8. Re:Huffman Example by Anonymous Coward · · Score: 1, Informative

      This behaviour is configurable in your Slashdot preferences, under Comments > Display Link Domains.

    9. Re:Huffman Example by junglee_iitk · · Score: 1

      Yep! Thanks. This option was checked: "Show the links domain only in recommended situations"

      I suppose then, Wikipedia is not recommended :)

    10. Re:Huffman Example by Jack9 · · Score: 1

      Wouldnt the smallest unit of information be a bit pattern? A bit is 1/0 while a bit pattern can be anything from null to a complex waveform holding much more information (jpeg vs bitmap).

      --

      Often wrong but never in doubt.
      I am Jack9.
      Everyone knows me.
  22. Re:Artificial Intelligence? by fireboy1919 · · Score: 5, Insightful

    The first poster on this topic had a good explanation - it seems like an AI problem, but not why.

    Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.

    We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
    So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.

    Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.

    --
    Mod me down and I will become more powerful than you can possibly imagine!
  23. how does compression relate to AI? by timmarhy · · Score: 0, Redundant

    so if we compress google, we will give birth to skynet? how the fuck does a compression program == AI

    --
    If you mod me down, I will become more powerful than you can imagine....
    1. Re:how does compression relate to AI? by Paradigm_Complex · · Score: 1

      Compression is based largely on pattern recognition. If you see a pattern, you can use a smaller piece of information to represent it. Pattern recognition is arguably where AI falls short compared to actual human intelligence. My favorite example is a pattern-recognition-based boardgame called "go." The best go programs out there are still comparable at best to amature human players. I can honestly say I'm better then any go program out there right now, and I'm not very good.

      --
      "A witty saying proves nothing." - Voltaire
    2. Re:how does compression relate to AI? by Paradigm_Complex · · Score: 1

      I didn't think of a solid explanation until I've already clicked "post," so here goes: Boardgames are an excellent place to compare AI to actual, human intelligence. A roughly four thousand year old boardgame called "go" is my favorite example here. Brute-forcing your way through a go game is obscenely difficult, even for modern computers; the number of possible games is absolutely astronomical. See http://senseis.xmp.net/?PossibleNumberOfGoGames Humans don't need to read every possible move to play the game, but rather we use our excellent pattern recognition capabilities. Hence, for a computer to even try to undertake playing this game against a human it must find "shortcuts" for reading things out - or in other words, compression. One of the leading go programs, Gnu Go (), is based largely on referencing established working responses for certain situations on localized parts of the board ("joseki"). In this sense, Gnu Go's capability is based largely on compression.

      --
      "A witty saying proves nothing." - Voltaire
  24. Re:Artificial Intelligence? by throup · · Score: 1
    Or, as compressed by the algorithm:

    Goal! woot woot
  25. Re:Artificial Intelligence? by Baldrson · · Score: 1
    As Matt Mahoney explained it to me when we were brainstorming the prize criteria:

    Hutter's* AIXI, http://www.idsia.ch/~marcus/ai/paixi.htm makes another argument for the connection between compression and AI that is more general than the Turing test. He proves that the optimal behavior of an agent (an interactive system that receives a reward signal from an unknown environment) is to guess that the environement is most likely computed by the shortest possible program that is consistent with the behavior observed so far. In other words, the most likely outcome for any experiment is the one with the simplest explanation, where "simplest" means the smallest program that could model what you currently know about the universe.

    He gives a formal proof, but it basically says that the only possible distribution of the infinite set of programs (or strings) with nonzero probability is one which favors shorter programs over longer ones. Given any string of length n with probability p > 0, there are an infinite set of strings longer than n, but only a finite number of these can have probability higher than p.

  26. Obligatory... by Stormwatch · · Score: 4, Funny

    - The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
    - Wikipedia fights back.
    - Yes. It launches its rvv missiles against Slashdot.
    - Why attack Slashdot? Aren't they our friends now?
    - Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.

  27. Re:Artificial Intelligence? by Estanislao+Mart�nez · · Score: 1

    All this means is that just like a machine that can perform arithmetic isn't "intelligent," neither is one that can compress Wikipedia down to 1.319 bits per character. (And the reason it's not "intelligent," of course, is nothing more and nothing less than the fact that it is a machine.)

  28. Re:Artificial Intelligence? by mbkennel · · Score: 2, Insightful



    They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge.

    The apparent empirical result is that predicting which characters are most likely to occur next in a text sequence requires either

    1) vast real-world knowledge

    OR

    2) vast real-world derived statistical databases and estimation machinery

    but there can be a difference in their utility. The point of course, is that humans can do enormously more powerful things with that vast real-world knowledge in addition to symbolic estimation.

    The underlying question is whether physical natural intelligence is really just real-world derived statistical databases and estimation machinery. Modern neuroscience says,
    "depends on what the meaning of 'is' is, but it's at least halfway there."

    However would completing mathematical theorems by searching through Google (statistical pattern matching, which might sort of work for all known theorems on Google) work?

    Clearly natural intelligence includes many tasks which can be now well solved with data-oriented sophisticated statistical approaches, perhaps with equal or better performance. Modern algorithms like 'independent components analysis' now can estimate individual sources in audition, "the cocktail party effect" a problem some once thought was a clear sign of true 'intelligence'. Turns out that some sufficiently clever signal processing and nonlinear objective functions can do it---so maybe that's what neurons do too.

    The still unsolved question is whether there are some tasks which are clearly 'intelligence' where this class of methods will profoundly fail. Maybe like creating really new mathematics?

  29. Yeah, about that source. by QuantumG · · Score: 1

    Here's a tip to whoever made this archive, if you want people to abide by the GPL you really should do so yourself. That means:

    1. Putting a copy of the GPL in the archive with your program.
    2. Putting the source code in the archive with the binary form of your program; or
    3. Putting an offer to provide source code, valid for 3 years, in the archive with your program.

    If you don't do it, what makes you think anyone else is going to?

    --
    How we know is more important than what we know.
    1. Re:Yeah, about that source. by Chandon+Seldon · · Score: 1

      He's the copyright holder, so he can distribute his binaries however he wants to. If you care about acting legally, it's your responsibility to track down the license before you use, copy, modify, or redistribute the program.

      --
      -- The act of censorship is always worse than whatever is being censored. Always.
    2. Re:Yeah, about that source. by QuantumG · · Score: 1

      No shit.

      I think I made myself pretty clear. If you don't follow your own license, who will?

      --
      How we know is more important than what we know.
    3. Re:Yeah, about that source. by greg1104 · · Score: 1

      There is actually a copy of the GPL in that archive, he just compressed it into a few characters and you may not have noticed it.

    4. Re:Yeah, about that source. by mindstormpt · · Score: 1

      Those who have to. He didn't choose the GPL to be used by himself, he chose it to be used by others.

    5. Re:Yeah, about that source. by p3d0 · · Score: 1

      This would be a good time to put your fingers in your ears and start shouting "CAN'T HEAR YOU!! NANANANANANA!!!!"

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  30. How to win the Hutter Prize by seanyboy · · Score: 5, Funny

    1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
    2) Add a long and self referencing article on wikipedia about said algorithm.
    3) Use algorithm to compress first x% of wikipedia (including your own article)
    4) WIN HUTTER PRIZE.

    --
    Training monkeys for world domination since 1439
    1. Re:How to win the Hutter Prize by game+kid · · Score: 2, Funny

      [...]a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm[...]

      That's gotta be the most annoying compression algorithm in the world.

      --
      You can hold down the "B" button for continuous firing.
    2. Re:How to win the Hutter Prize by will_die · · Score: 1

      From what someone else said the size of the decompress program has to be included into the overall size of the compressed data.

    3. Re:How to win the Hutter Prize by vertigoCiel · · Score: 2, Interesting

      Great idea, but unfortunately the Hutter Prize uses a static sample of the first 10^8 bits of Wikipedia.

    4. Re:How to win the Hutter Prize by r3m0t · · Score: 1

      Just use your time machine to retroactively slip in an article just before they start the database dumps.

    5. Re:How to win the Hutter Prize by Anonymous Coward · · Score: 1, Funny

      Welcome, to the real world

    6. Re:How to win the Hutter Prize by BoredAtWorkWhatElse · · Score: 1

      You forgot:
      5) ???
      6) Profit !

    7. Re:How to win the Hutter Prize by Anonymous Coward · · Score: 0

      As an added bonus, it also shows up first in the phone book.

    8. Re:How to win the Hutter Prize by tepples · · Score: 2, Funny

      Just use your time machine The prize for this is much bigger than the Hutter Prize, so why use a time machine to attack the Hutter Prize?
    9. Re:How to win the Hutter Prize by Anonymous Coward · · Score: 0

      Just Google: time mechine hitler

      You will get many results telling you why.

      Oh wait...

    10. Re:How to win the Hutter Prize by revengebomber · · Score: 1

      1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
      Oh man, I get it. With all those spaces you can lose a few bits for free!
      --
      09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
      45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
  31. Segfault by qrwe · · Score: 1

    Tried the program; it crashed. Segfault?

    --
    There are 2 types of people in the world - those who understand decimal and those who don't.
    1. Re:Segfault by Baldrson · · Score: 1

      Try this.

    2. Re:Segfault by ObsessiveMathsFreak · · Score: 1

      That's no core dump. That's the compressed data.

      --
      May the Maths Be with you!
    3. Re:Segfault by qrwe · · Score: 1

      Doesn't work at all, if I use the syntax from the first program. There is no '--help' tag either.

      --
      There are 2 types of people in the world - those who understand decimal and those who don't.
    4. Re:Segfault by Baldrson · · Score: 2, Interesting
      On a 1.73 GHz Pentium with 1.2G RAM running Cygwin after compressing with:

      PAQ8HP12 -7 enwik8.paq8hp12 enwik8

      and moving the enwik8 archive to the parent directory:

      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $ time ./PAQ8HP12 -7 enwik8.paq8hp12
      100000000 enwik8: extracted
      16381959 -> 100000000 (1.3106 bpc) in 22398.08 sec (4.465 KB/sec), 941315 Kb


      real 373m19.379s
      user 0m0.031s
      sys 0m0.030s

      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $ ls -al
      total 114216
      drwxrwxrwx+ 2 JamesBowery None 0 Jul 1 00:43 .
      drwxrwxrwx+ 6 JamesBowery None 0 Jun 30 14:59 ..
      -rwxrwxrwx 1 JamesBowery None 99696 May 14 09:16 PAQ8HP12.EXE
      -rwxrwxrwx 1 JamesBowery None 100000000 Jul 1 06:54 enwik8
      -rwxrwxrwx 1 JamesBowery None 16381959 Jun 30 21:09 enwik8.paq8hp12
      -rwxrwxrwx 1 JamesBowery None 465211 Jul 1 00:43 temp_HKCC_dict1.dic

      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $ diff enwik8 ../enwik8/enwik8

      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $

  32. which only goes to show... by nanosquid · · Score: 4, Informative

    If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.

    1. Re:which only goes to show... by Just+Some+Guy · · Score: 1

      While that is nice for compression, it doesn't really advance the state of the art in AI.

      It's been said that AI is whatever we haven't managed to make computers do yet. A lot of people thought of chess as AI until machines could beat grandmasters, then it was just something computers can do. Speech recognition was AI until you could buy Dragon Dictate, then it was just something computers can do. Lather, rinse, and repeat.

      I am absolutely certain that some day you will be able to buy a computer that understands spoken English and can make philosophical judgments on the truth of uttered propositions. When that day comes, it will no longer be AI. It'll just be something that computers can do.

      --
      Dewey, what part of this looks like authorities should be involved?
    2. Re:which only goes to show... by Raenex · · Score: 1

      Chess and dictation? One is a well defined board game solved by a human-invented evaluation function. The other is just matching spoken words to written words. That is not to say that they are useless or easy to solve, but that they are just single-minded tasks that don't exhibit much intelligence.

      There's always this claim that the goal posts are moving, but the Turing test was proposed in 1950. Compressing Wikipedia? I don't care. Having an AI that read the daily paper and could keep up it's part of the conversation? Now that would be impressive.

  33. Re:Artificial Intelligence? by iamdrscience · · Score: 1

    We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end. So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
    I've heard that somebody already did a software implementation of a grad student. I tried to download the source code but I couldn't open the file: gradstudent.tar.gs
  34. AI by evilviper · · Score: 1, Insightful

    bringing text compression within 1% of the threshold for artificial intelligence.

    I see no reason to believe AI and text compression are interchangeable.

    I can think of a few methods that would allow a computer to guess a missing word better than humans (exceeding the AI limit), and that such methods would be useless for determining a response to a question, particularly in the real world, where things like punctuation, abbreviation, and capitalization would be highly suspect to begin with.

    So I have to say the basis for this competition is flawed, and what's more, the results coming out of it are specific enough to just succeed in this competition, but be completely and utterly useless for any other (real) tasks.
    --
    Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    1. Re:AI by cytg.net · · Score: 1

      I think you're mistaking AI for GAI .. theres a big difference there.

    2. Re:AI by DNeoMatrix · · Score: 1

      The way I see it (and I could be mistaken), is that the only possible way to design a compression technique which passes the "AI Threshold" is to have a lossy technique, and the only way to determine what is okay to loose in human text is to have some form of AI saying if a particular word is necessary in a specific context, or if it can be removed. Such as removing consecutive repetetive thoughts which repeat themselves. Thus, in order to get any type of compression technique beyond what this new program does, we need to run an AI contextualizer on the text and somehow combine that with the compressor. Just kinda thinking out loud

    3. Re:AI by evilviper · · Score: 1

      I think you're mistaking AI for GAI

      I'm mistaking artificial intelligence for Global Authentication Inc.? Government Affairs Institute? General Applet Interface?

      Perhaps you meant AGI, but perhaps not, since your post shows you don't understand what it means.

      What a compressor does could be considered "weak AI" but frankly, it's an absolutely idiotic classification, because just about any function a computer performs can be considered "weak AI".

      A compression program does not come close to meeting the definition of strong AI, let alone AGI. What's more important in this case, however, is that the percentage of compression a program gets doesn't make it any smarter, in the AI sense. A program that can compress text even better, might have an advanced syntactical understanding of English, but that gives no insights into RESPONDING to a question, or anything of the sort.

      A better compressor is a better compressor, is a better compressor, and the technology applies nowhere else.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    4. Re:AI by LeadSongDog · · Score: 1

      I see no reason to believe AI and text compression are interchangeable.
      I see no reason to believe Wikipedia and text are interchangeable. What's your point? It's still a useful endeavor.
      --
      Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.
    5. Re:AI by cytg.net · · Score: 1

      oh the flames subcleverly hiding the first google search on the matter.
      if you'd done a second googling you'd know the subject is reffered to by both acronyms, and the point of reading on the subject i've done GAI is used more often than AGI.
      Dont give me shit about what post shows who dont understand what about wich subject.

      And stop backpeddeling allready, you stereotype slahsdot inflamable stereotype you ..
      And on the subject of educating, you, lookup a project like novamente (GAI) and you tell me that you know for a fact that such compression-like rutines have no place in a framework like that... novamente, i understand, spends most of its time reading.
      neural nets is AI
      knowledgebased rule systems is AI
      fuzzy is AI
      genetic algos is AI
      baysian reasoning is AI
      So like.. get the f... out of here and take your "weak ai" with you.

      "What a compressor does could be considered "weak AI" but frankly, it's an absolutely idiotic classification, because just about any function a computer performs can be considered "weak AI"."
      - thats stupid.
      - flame-out.

  35. Re:Artificial Intelligence? by bytesex · · Score: 3, Insightful

    The problem with this approach is that there are many ways the say the same thing, and that this compression/decompression algorithm is tested using strict text-comparison only. A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.

    --
    Religion is what happens when nature strikes and groupthink goes wrong.
  36. I think I know... by DohnJoe · · Score: 1

    I heard the decompression binary is around 100.1MB....

  37. Bah humbub by yusing · · Score: 1
    equivalent to solving the artificial intelligence problem

    ... Without actually contributing anything to the development of artificial intelligence (an entity capable of understanding and interacting with the real world in an intelligent way).

    It's impressive as it stands. The hype is superfluous.

    --

    "You must try to forget all you have learned. You must begin to dream." -- Sherwood Anderson

    1. Re:Bah humbub by cytg.net · · Score: 1

      no no and no .. that is not the definition of artificial intelligence.. AI is a wide range of subjects and techniques, what you're describing is more that of a GAI(general artificial intelligence) and if you're into that sorta shit(wich i guess you're not) you could check out the singularity institute.

    2. Re:Bah humbub by yusing · · Score: 1

      When they quit changing the meaning of AI to make a little progress seem important, I'll be able to get a better focus on it. Of course I meant general intelligence -- ants are specialized. Playing checkers: whoopee.

      --

      "You must try to forget all you have learned. You must begin to dream." -- Sherwood Anderson

  38. Re:Artificial Intelligence? by tlapale · · Score: 1

    Could someone out there please explain how being able to compress text is equivalent to artificial intelligence? There is also the sparse coding used in neuron populations. Very used to explain neuronal learning. Sparse coding goal being to code input with the less spikes as possible (eg. between the retina and the primary visual cortex).
  39. Re:Artificial Intelligence? by Bombula · · Score: 1
    I've never studied information theory, so I'm woefully ignorant, but as I understand it, it was Shannon himself who defined information content as the amount by which a recipient's ignorance is reduced. Here's what I don't understand, and hope someone here will be able to explain:

    It makes sense to me that you can measure how much information is contained, say, in a text message. But to interpret that information you need ... what? We usually fill in the blank with 'intelligence', but it seems to me that interpretation itself requires, at the very least, some information - the 'look-up tables' as it were. But you also need information about what a look-up table is, and how do you look that up, and so on, reducto ad infinitum, and so once again ... 'intelligence'. So here's my point: doesn't it follow that a certain amount of information is also required in order to interpret information? And isn't that, itself, an inseparable measure of any information being transmitted?

    From this, I extrapolate the following right out of my colon: it's fine to say that a string of text contains 100 bits, but those bits are only extractable given an adequate, non-zero sum of interpretor-side information. Simpler collections of information - say, 100 bits, require less interpretor-side information, ie: less intelligence, to be decoded. More complex collections of information, say a recording of Beethoven's 5th symphony, require far more interpretor-side information to decode. I could of course be dead wrong about this. But to continue, it also seems to me that if more information on the interpretor-side, i.e. intelligence, is required in order to extract more information, then this could mean that more information can also be imputed from a given block of data. A particular rendition of Beethoven's 5th contains much more information to a classical violinist than a non-musician, for example. Is it possible that the density of information in a block of data is actually affected by the interpretor-side information? Like some sort of Rorschach test, I could look at a string of 100 bits and infer and impute a gargantuan amount of information from it. Consider this more numerical analogy: I can 'decode' a text block of 100 bits with an infinite number of 'keys' and get a different output each time. So what information is actually stored in the block itself, irrespective of my decoding keys? Could it instead be that information doesn't really exist on one side or the other - message vs recipient - but can only be defined in terms of both together?

    Fruity pebbles, I know, but I've been curious about this for years.

    --
    A-Bomb
  40. Re:Artificial Intelligence? by Eivind+Eklund · · Score: 1
    You're assuming that humans aren't machines - in this context, that's actually a matter of faith. Human intelligence may be a result of "machine processes" - ie, direct physical processes. If we assume that humans are intelligent - otherwise, the term seems sort of useless - we can't rule out machines being intelligent by them being machines unless we have a definition of machine that excludes humans. (And I believe such a definition would probably be counter-productive when it comes to the matter of defining intelligence.)

    Eivind.

    --
    Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.
  41. Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 3, Interesting

    I suppose I have to post this anonymously, or the Hutter Prize thralls will just mod it down; I like my karma.

    I am at a loss as to how this meaningless charade keeps getting posted on Slashdot. Anyone with half a brain who reads TFA (or any of the previous FA Slashdot has posted on this stupid prize) can see this for what it really is: a handful of crazy people who think that this has meaning beyond above-average technogarbage.

    There are all of four people seriously involved in this Hutter Prize: Hutter himself, Bowery (who's made all the /. submissions), Mahoney (who wrote a thesis on this crap), and Ratushnyak, who seems to enjoy wasting his time on this AI-obsessed prize.

    PAQ8HP12 may be able to compress the corpus extremely efficiently, but it has obvious and real drawbacks for any real-world application: it's tuned for this specific corpus ("H[utter]P[rize]" is even in the name of the compressor), it's slow as fuck, and it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.

    And what's with the obsession with Wikipedia? It is not the be-all, end-all of human knowledge, and, despite its devotees' claims, never will be; just look at the internal politics, and you'll see that it simply can't scale to that size. Is it a useful resource? Of course. Is it something worthy of adoration and fawning over? No.

    And then, of course, there's the obsession with AI. These people seem to be of the opinion that a text compressor will actually lead to artificial intelligence -- with no other tuning! An absurd claim if I've ever heard one; the predictive capabilities of a good text compressor are something that would no doubt be useful to an AI, but there's one hell of a lot more to general intelligence than just pattern matching and statistical algorithms for compression.

    If one really wanted to sponsor an AI prize, it would probably be much better to focus on creating, say, an effective chatbot -- something that really can predict a desirable response and pass Turing's test.

    Not this compression bullshit.

    1. Re:Why the Hutter Prize is Bullshit. by catxk · · Score: 1

      I hear ya, and I have know little to nothing about any of this, but wasn't the idea that predictive compression and chatbots essentially do the same thing, i.e. trying to figure out the answer to something previously stated? If you manage to do one, you get the other.

      --
      Don't be crazy anymore!
    2. Re:Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 0

      ...this is CS weenies wanking off.

      I work in THEORETICAL CS and I tell you son you ain't seen CS weenies wanking off yet. This compression stuff we will call EXPERIMENTAL CS and personally, "it's nice, I like."

    3. Re:Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 0

      Epic fractal wrongness. Astounding

    4. Re:Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 0

      AC, I think I love you.

    5. Re:Why the Hutter Prize is Bullshit. by UbuntuDupe · · Score: 1

      it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.

      If that's true (and this is important), then I'd have to agree with you. Compressing only one data stream, with an algorithm longer than the data stream, is not compression. Oddly enough, in my last post, I was just joking about how you could cheat by having the algorithm be:

      0 = 0
      1 = [entire wikipedia database]

      And say, "Hah, I compressed it to 1!"

      But then ... that would actually be better than what they did, if what you're saying is true :-P

    6. Re:Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 0

      Exactly. The thing is, given infinite time and resources any set of data could be compressed perfectly. This is surely better than any human could do consistently (this problem is very similar to the more limited game of Chess). The algorithm isn't even complicated, just a simple brute force approach would work. However, this approach is not feasible in the real world.

      I fail to see how a brute force search algorithm equates to AI. Humans don't work that way.

      We won't see anything close to AI until the computers approach the complexity of the Human brain. That means either a incomprehensible (with todays technology) number of transistors or some other computing method like quantum computing.

    7. Re:Why the Hutter Prize is Bullshit. by Anonymous Coward · · Score: 0

      Sorry, no dice. The size of the decompressing program is included in the size.

  42. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    depends on your definition of information. i believe pure information science defines information in a way somewhat unlike your intuitive equate to "human knowledge". information does not need to be processed, let alone understood, to be information. that is an intrisic quality.

  43. pfft by phagstrom · · Score: 1
  44. baldrson by Anonymous Coward · · Score: 0

    Baldrson m'boy!! Up to your old tricks again, eh? We know what you're up to, we won't be fooled!

  45. GPL source code compressed with RAR by noidentity · · Score: 1

    "Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 (link to paq8hp12any.rar)."

    Funny that the GPLed source code is stored in a proprietary compression format for which there isn't any GPLed decompressor (that I know of which handles the latest RAR format).

    1. Re:GPL source code compressed with RAR by the_greywolf · · Score: 1

      It's my understanding that the official rar decompression library is licensed under a permissive license - not compatible with the GPL by any stretch, but it's not closed as "proprietary" implies.

      --
      grey wolf
      LET FORTRAN DIE!
    2. Re:GPL source code compressed with RAR by noidentity · · Score: 1

      True, it's not completely hidden, but they don't document the file format and forbid reverse-engineering for the purpose of making a compatible compressor. Still kind of funny, given that 7-zip exists, with its solid LZMA compression (which I believe is not proprietary in any way).

  46. Not truly == AI just yet by Anonymous Coward · · Score: 5, Interesting

    I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).

    Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).

    Now depending on the text itself, prediction based on previously seen text isn't enough ... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.

    Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork" ... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).

    Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.

    It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process ... the 100MB enwik8 file. A different file will need a different dictionary.

    Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components ... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.

    Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc ... AND do all this in less than 9 hours I believe it takes for the latest compressor.

    This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.

    At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.

    1. Re:Not truly == AI just yet by MikeBabcock · · Score: 1

      Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.

      It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process ... the 100MB enwik8 file. A different file will need a different dictionary.

      That's not at all true in a multi-pass codec of course. An optimal dictionary can of course be computed by the algorithm itself either before or after compression (for example by re-optimizing its dictionary). Its slower, sure, but if optimal algorithms are your thing and speed isn't an issue, try data-analysis passes.
      --
      - Michael T. Babcock (Yes, I blog)
    2. Re:Not truly == AI just yet by Anonymous Coward · · Score: 0

      Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork" ... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee".

      Ah, but your data set is slightly misaligned! Frastelo (FRee ASsociation TExt LOokahead) v.7.10 can predict that nicely.

      To wit:

      Frog : Cigarette (have you ever seen a Frenchman without...?)
      Toilet : Coffee (the diuretic and laxative effects of Coffea arabica extract are well-known)
      Woodwork : Slashdot (I'll let you figure this one out...possibly with the help of the Kimberly-Clark corporation)

    3. Re:Not truly == AI just yet by noidentity · · Score: 1

      You could compress your post by not using so many one-sentence paragraphs.

  47. Re:Artificial Intelligence? by archeopterix · · Score: 1

    Could it instead be that information doesn't really exist on one side or the other - message vs recipient - but can only be defined in terms of both together?
    As far as I know, this is how information theory defines information (actually it defines uncertainty, but this is rather a technical detail). The definition of uncertainty relies on a random variable. This random variable represents the knowledge common to the sender and the recipient before sending the message ( they both know whether they will be transmitting symphonies, wikipedia articles, ECG records or whatever). The other part is encoding (they agree how to encode a symphony into a sequence of bits). Only when we have a random variable and an encoding then we can speak about information. So, IMHO your intuition and information theory are similar.
  48. Re:Artificial Intelligence? by nanosquid · · Score: 1

    Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far.

    Well, one should keep in mind that the connections between AI and compression have been known far longer, but have never turned out to be particularly useful for building AI systems. Hutter's point is merely a variant of these previous theories, and there is no reason to believe that it will lead to AI any more than previous attempts.

  49. Only 1% away from AI? by glwtta · · Score: 1

    That's only about 1.5 laptop meters! The thinking machines are here!

    --
    sic transit gloria mundi
    1. Re:Only 1% away from AI? by Anonymous Coward · · Score: 0

      awesome critique of their silly measurement via the absurd, if you don't get modded up to highlight how ridiculous this prize is it will be a shame

  50. Re:Artificial Intelligence? by Baldrson · · Score: 1

    The value of the Hutter Prize isn't in the use of Hutter's theory to build AI. The value of the Hutter Prize is in the use of compression ratio to provide a figure of merit for AI that is very good. That value has been proven in various ways -- mathematically and practically.

  51. Re:Artificial Intelligence? by pitu · · Score: 2, Insightful

    The problem with this approach is that there are many ways [to] say the same thing

    that is an idea or a concept. Interpreting an idea or concept in different ways is meaningful
    only by its context.

    ex1 the sky is blue => it's beautifull weather (context: you're making a walk)
    ex2: the sky is blue => use #0000FF for the sky area (context: graphic work)

    this compression/decompression algorithm is tested using strict text-comparison only.A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.

    If you say, "the weather is beautifull" to an artist he may draw you yellowish-reddish sunset,
    which is not the correct interpretation of "the sky is blue" you had in mind" So the context is vital.

    I imagine a real AI would evaluate the context and predict what are the next words most likely to be put forward. If it
    succeded to translate a concept to an another in a meaningful context "the sky is blue => it's a beautiful weather let's get down the nasa shuttle"
    it would no longer be an AI but an I :)
  52. broken assumptions by nanosquid · · Score: 1

    He proves that the optimal behavior of an agent (an interactive system that receives a reward signal from an unknown environment) is to guess that the environement is most likely computed by the shortest possible program that is consistent with the behavior observed so far.

    But the task of compressing Wikipedia character by character is thoroughly irrelevant to human intelligence. Humans are bad at it, so it's not a characteristic of human intelligence, and if you had a very high quality compressor, even if it used some kind of internal model of the meaning of the text, you still couldn't use it as an AI system.

    Another problem with the whole approach is the assumption of optimality; in fact, intelligent behavior is unlikely to be optimal in any given environment.

    1. Re:broken assumptions by Baldrson · · Score: 1
      Clearly being able to predict the next symbol in a stream coming from a corpus of human knowledge is relevant to AI and such prediction is the core technique of compression.

      As for "optimality", you aren't thinking about things at the right level. If you have an unknown environment the fact that it is unknown means you are making educated guesses as to maximum likelihoods, or "expectation maximization". No -- of course those guesses are very unlikely (approaching zero) to be correct 100% of the time.

    2. Re:broken assumptions by imsabbel · · Score: 1

      You think at a much to high level.
      Sure, there is human intelligence in being able to comprehend, for example, the text in my sig by expanding.
      But that has limits, as ambiguity rises. "The * is blue". Sky? Or maybe "shirt pocket bottom".
      "Normal" compression algorithms that simply act upon the entropy are already way past the level of compression thats possible with "intuition".

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    3. Re:broken assumptions by nanosquid · · Score: 1

      Clearly being able to predict the next symbol in a stream coming from a corpus of human knowledge is relevant to AI and such prediction is the core technique of compression.

      No, it is not "clearly" relevant at all.

      If you have an unknown environment the fact that it is unknown means you are making educated guesses as to maximum likelihoods, or "expectation maximization".

      You're mixing up a bunch of things. First, maximum likelihood and expectation maximization are different concepts. Second, maximum likelihood decisions are not optimal in general. Third, in actual experiments, humans often don't make either optimal or maximum likelihood decisions, meaning that it's odd to use anything related to them as a measure of intelligence.

    4. Re:broken assumptions by Baldrson · · Score: 1

      What could be clearer than that prediction is relevant to AI? ML gives predictions and EM gives decisions based on predictions. Maximum compression of observations gives optimal models for prediction as you yourself have asserted, however old-hat this understanding of compression is. It all seems so straight forward if you accept, as you say you do, the relationship between compression and optimal prediction. I don't understand what your issue with the Hutter Prize is.

    5. Re:broken assumptions by nanosquid · · Score: 1

      It all seems so straight forward if you accept, as you say you do, the relationship between compression and optimal prediction

      Your mistake is that you're equating optimal prediction with intelligence. Many natural and artificial systems learn to make optimal predictions, yet aren't intelligent. And intelligent systems often don't make optimal predictions.

      I don't understand what your issue with the Hutter Prize is.

      It is precisely what I said: it misleads people into equating optimal prediction with intelligence.

  53. Re:Artificial Intelligence? by bentcd · · Score: 1

    The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on. (Emphasis is mine)

    But, surely, the compression algorithm isn't actually guessing at all - it knows what comes next because it is working from a very strict set of rules. I would probably be more impressed(*) if they wrote a heuristic that could accurately guess symbols based upon previous symbols - even if such a heuristic were to give a higher error rate than what the deterministic algorithm does.

    Then perhaps as the next logical step, we could have a heuristic trained by Wikipedia which could accurately predict considerable parts of Britannica, or a physics text book, or a text book on the hundred years war, etc. Then we'd be talking.

    (*) - not that I'm not impressed as it is, but there's always more where that came from :-)
    --
    sigs are hazardous to your health
  54. Re:Artificial Intelligence? by Eivind · · Score: 1

    True, but the real gains are achieved when you're allowed to be lossy. If you're studying a picture, with your goal being to be able to answer human questions about that picture afterwards, you don't make any attempt whatsoever at remembering the precise pixel-by-pixel colours. Instead you focus on those parts of the contents most likely to be of human interest. You take note of a "car" standing with the "side" towards you, perhaps the make. The colour. The fact that it's raining. That there's a girl sitting on the roof of the car. You *don't* in any way shape or form have enough info to be able to reconstruct the image in a losslell manner. You are however able to summarize the picture in such a way that the most important parts of the actual content is included. But what is "important" depends on what you're going to need the info for and so requires intelligence. A photographer migth instead notice the the picture is somewhat oversatured in colour, taken with a large aperture so the background is out of focus, too low resolution to use for a poster, taken without flash etc etc. Both are equally "valid", but they serve completely different purposes.

  55. Maybe you could by Kaeles · · Score: 1

    have a table that stores commonly used long words, hide this in the compressed file, and then reference it with a checkTable bit then the byte (or 2 bytes) to the table element, or DontCheckTable bit and it just uses a normal compression scheme?

    You could even then compress the whole thing again and have a reference table that points to common references or something :P

    1. Re:Maybe you could by vidarh · · Score: 1

      That would leave you only 30-40 years or so behind the state of the art of compression algorithms...

    2. Re:Maybe you could by the_greywolf · · Score: 1

      Nearly all common compression algorithms include this as part of the first pass. Rar, for example, uses a particularly large dictionary, and Bzip2 uses a moderately-sized lookahead buffer for a similar purpose. PKZip's deflate algorithm is one example which doesn't use dictionaries.

      Most text compression algorithms also use some kind of domain reduction to reduce the working set for a block of text before compressing: when you've got fewer bits to care about, they pack quite a bit tighter.

      --
      grey wolf
      LET FORTRAN DIE!
  56. Re:Artificial Intelligence? by mrjb · · Score: 2, Funny

    "A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong." That might be a good example of acceptable *lossy* AI text compression. One step further and it will compress articles into a proper, readable summary.

    --
    Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
  57. pi by misanthrope101 · · Score: 1

    There you go--a 2-letter (or one, depending on the alphabet) representation of a number that contains ALL information. Can't get more compressed than that. Of course, which decimal places to start and end with, that's the question. How many irrational numbers are there? Can I patent one of them?

    1. Re:pi by Twinkle · · Score: 1

      Is the expansion of pi proven to contain every possible sequence? Just because the expansion is infinite, I don't think you can assume that it does.

    2. Re:pi by misanthrope101 · · Score: 1
      If a sequence is infinite and random, it would seem to include all finite sequences in there somewhere. Actually it would seem to include them occurring an infinite number of times. Even if the probability of a sequence occurring isn't 1, it approaches one as the containing sequence approaches infinity. The probability of finding the sequence 0123456789 may be small in a 10-digit randomly selected number, but in a billion-digit number, that sequence is very likely to occur somewhere. Shouldn't that logic extrapolate to larger sequences found within larger sequences?

      But I'm not all that good at math, so please correct me if I'm wrong. I just find irrational numbers elegant.

    3. Re:pi by cnettel · · Score: 1

      It's not true for all irrational numbers, but it is true for pi (and others). Anyway, this won't help you compress it. You need to store the index of the sequence inside pi, and it's quite obvious that a binary encoding of that index will easily get longer than the original sequence!

    4. Re:pi by Anonymous Coward · · Score: 0

      But I'm not all that good at math, so please correct me if I'm wrong

      Everything that you said is correct, but it doesn't really apply to pi since pi is not random.

    5. Re:pi by Anonymous Coward · · Score: 0

      How many irrational numbers are there?

      Since the real numbers are uncountable and the rationals are countable, the set of irrationals is uncountably infinite.

    6. Re:pi by misanthrope101 · · Score: 1

      Is "nonrepeating" the word I was looking for?

    7. Re:pi by stdarg · · Score: 1

      My interpretation was that if the compressor sees 3.141592653589793238462643383279502884197169399375 10 it'll compress it to a form meaning "first 50 digits of pi", which would be quite impressive. Likewise if it sees "2, 4, 8, 16, 32, 64, 128" it would compress it to "first 7 powers of 2". In essence, the best compression for many sequences of numbers requires something like AI to be able to see patterns. Same with words, although probably less so.

    8. Re:pi by Anonymous Coward · · Score: 0

      I don't think so. I think that the word you're looking for is "ergodic".

    9. Re:pi by Anonymous Coward · · Score: 0

      Nope. It's entirely possible that pi doesn't contain the digit '7' at all. (Ok, it _does_ contain '7'...but that's not the point. There's no reason to assume it does...or that '7' ever appears again after the 14 billionth digit, for example.)

    10. Re:pi by misanthrope101 · · Score: 1
      It's possible, but not probable. The probability of any digit (or any finite sequence) approaches 1 (but never reaches it) as the number of digits approaches infinity. At least that's what I'm saying--whether or not it's mathematically correct is another question. So yes, it's possible that any digit or sequence thereof is absent, but the probability continues to go up as you get a larger pool of digits. There are pages where you can download megabytes of pi or other irrational numbers.

      Pick a sequence, say "9999999999" and if you get a large enough chunk of numbers, random or just a section of an irrational number, the sequence you picked will be there. I've found 000000000 and other seemingly improbable sequences, only by using ctrl-f and starting with a big enough chunk of pi. The chance of a 9-digit number being 000000000 is small, but the chance of finding that sequence in a 10-million random (or nonrepeating) sequence is much higher.

    11. Re:pi by MauricioC · · Score: 1

      The probability of a finite sequence not appearing in pi would be zero if the digits were random (This is a consequence of the Borel-Cantelli lemma... if you are not familiar with it, think of the finite sequence as Shakespeare's works and the randomness of the sequence coming from monkeys typing on a typing machine... it's the same thing), but zero probability does *not* mean it can't occur. For example, suppose we have a lottery where any real number may be chosen. What is the probability that I will win, given that I chose one number? What's the probability that I will win, given that I chose *all* rational numbers? (They're both zero, but it doesn't mean I can't win!)

    12. Re:pi by John+Meacham · · Score: 1

      Now, figure out the size of the number representing the offset in pi that your data is. Assuming that pi is a normal number [1] (which is believed to be true) then you end up with a number which can only be represented with as much space as your original data in general.

      [1] http://en.wikipedia.org/wiki/Normal_number (proving pi to be a normal number is a famous open problem)

      http://en.wikipedia.org/wiki/Kolmogorov_complexity is another interesting and related topic, as the prize is for the decompressor + the data together, the hutter prize winners are slowing approximating the kolmogorov complexity of human knowledge. Which very well might turn out to be a few field equations describing quantum mechanics that you just let run for a few billion cycles. now that is some good compression :)

      --
      http://notanumber.net/
  58. Re:Artificial Intelligence? by JohnFluxx · · Score: 1

    You do want it to guess. Take the string:

    "The weather * good"

    If you algorithm understands English well enough that it can guess that the "*" is the word "is", then you don't need to store the word "is", but you know that when it's decompressed that word "is" it will be guessed at.

  59. Re:Artificial Intelligence? by nanosquid · · Score: 1

    The value of the Hutter Prize is in the use of compression ratio to provide a figure of merit for AI that is very good.

    The same figure of merit has existed for decades before, and it has never proven to be very useful in evaluating AI systems.

    That value has been proven in various ways -- mathematically and practically.

    Really? Various practical ways? Like what? Where has a Hutter prize related advance been demonstrably linked to an advance in NLP?

  60. super-grammar-improved paq8hp12 by superbrose · · Score: 4, Funny

    After implementing a few minor tweaks to paq8hp12 and incorporating your grammar optimisation algorithm I managed to compress the above text amazingly to a single character: '&'.

    Now you figure out which one it was and how to decompress it.

    1. Re:super-grammar-improved paq8hp12 by pla · · Score: 5, Funny

      Now you figure out which one it was and how to decompress it.

      Well, with only 256 choices, it didn't take long to check all possible decodings for one that makes sense. Ended up working for "}".

      Oddly, though, the algorithm not only restored, but improved the original! I get:

      "The King's English version of Wikipedia should fit in eight gigabits, I do believe. Only humanity's sphexish adherence to grammatical rules limits the attainable compression ratio; the good gentleman might wish to consider filtering to a more base patois prior to applying his algorithm".

      Amazing... This discovery could single-handedly render the next generation (nearly) intelligible!

    2. Re:super-grammar-improved paq8hp12 by Mark_MF-WN · · Score: 1
      Who says that his encoding scheme uses fixed-length tokens? Maybe in his system, one character decodes to 256 different multicharacter strings, chosen for their frequency of use?

      I'll be I could develop an encoding scheme like that for neoconservative diatribes. Let's see:

      ! : "terrorist"
      " : "threat"
      # : "alkayda"
      $ : "nuhkular"
      % : "evil"
      & : "god"
      ' : "uh"
      ( : "foo"
      ) : "mah"
      * : "fooled"
      + : "again"
      , : "huggle-icious"
      - : "texas"
      . : "wmd"
      / : "mission"
      0 : "accomplished"
      1 : "damnliberalcommiepinko"
      2 : "long"
      3 : "war"
      4 : "oil
      ....

      Hell, with this encoding scheme and a good pseudorandom number generator, could make state-of-the-union speeches obselete. Of course, developing a comparable encoding for hypothetical Democrat presidents is left as an exercise for the reader.

  61. Re:Artificial Intelligence? by MichaelSmith · · Score: 1

    and more interesting still in video

    In this case the receiver of the message (a human brain) only uses a minute proportion of the data in the video stream. So a compression algorithm with a good understanding of human vision should be able to achieve an enormous compression ratio.

  62. Re:Artificial Intelligence? by iapetus · · Score: 2, Interesting

    Which is a shame, because the weather wasn't good.

    --
    ++ Say to Elrond "Hello.".
    Elrond says "No.". Elrond gives you some lunch.
  63. Re:Artificial Intelligence? by JohnFluxx · · Score: 1

    If at compression stage you guess incorrectly, then you have to put the actual word in. You would only replace words if you guess correctly.

  64. Re:Artificial Intelligence? by iapetus · · Score: 1

    Which, of course, is fine if and only if you can guarantee that your 'decompression' will work the same way every time. For a learning AI that's not really an option.

    --
    ++ Say to Elrond "Hello.".
    Elrond says "No.". Elrond gives you some lunch.
  65. Not made for mobile devices by HNS-I · · Score: 2, Insightful

    The question is does a mobile hand held device got enough processing power to decompress it? in a reasonable time?

    Seriously, this not inveted for mobile hand held devices. At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.

    1. Re:Not made for mobile devices by Yvan256 · · Score: 4, Insightful

      At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.
      That may be, however we're talking about Wikipedia here. It's not about storing so much text that you can't go through it within a month, it's about storing everything so that you can access it as a reference.

      When you look up a word in the dictionary, it takes from 10 to 30 seconds to read the definition. But you did need the whole book/brick to do it.

    2. Re:Not made for mobile devices by HNS-I · · Score: 1

      You will get the text from a server and that server will decompress it for you.

    3. Re:Not made for mobile devices by Yvan256 · · Score: 1

      Well, if that's what you think then this isn't about "portable Wikipedia", it's just a regular portable device that happens to have internet access. I can do that on my Nintendo DS.

      The parent was talking about the idea of compressing Wikipedia so you can access it on a portable device (via a 4GB SD card, etc).

    4. Re:Not made for mobile devices by Anonymous Coward · · Score: 0

      I think it would be lame and you would never be able to get it on 4GB, it's a rediculous thing to say. The whole purpose of wikipedia is that it is centralised so that you can get information that is up to date at any moment.



      HNS
    5. Re:Not made for mobile devices by Yvan256 · · Score: 1

      We're talking about Wikipedia access for devices with no internet access (or else they'd use the regular wikipedia website).

      And nothing would prevent you to download an up-to-date version when you have internet access.

  66. That is the problem by hummassa · · Score: 2, Insightful

    Anything that is science is math.
    Ok, computer programming is not necessarily a lot of maths.
    But this article is about something that is really computer science... as opposed to making a CRUD screen in VB.net, which is akin to programming a VCR.
    Parsing, compiling, linear programming, sorting, searching, indexing, compressing, walking graphs, drawing graphics, designing circuits, optimizing circuits, these are activities that are computer science and that are all maths.

    Edsger Dijkstra once said: "Computers are to computer science what telescopes are to astronomy".

    --
    It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
    1. Re:That is the problem by FesterDaFelcher · · Score: 1

      computer programming is not necessarily a lot of maths.
      Maybe not what you see, but what do you think happens to each line of VB that you write? Do they magically form together with help from Gates the Sorcerer? Computer programming is all about "maths."
      --
      My user number is prime. Is yours?
    2. Re:That is the problem by Lord+of+Hyphens · · Score: 1

      designing circuits, optimizing circuits
      Nope, that's engineering. Sorry.
      (Reference: I am an undergraduate Electrical Engineering and Computer Engineering major, or IAAEECE)
      --
      "I've spent my whole life figuring out crazy ways to do things. It'll work." -- Montgomery Scott, "Relics"
    3. Re:That is the problem by GeffDE · · Score: 1

      IAAAE&CE (I am also an Electrical & Computer Engineer) ...and designing and optimizing circuits is a mathematical problem.

      The Quine-McCluskey algorithm is, I believe, the best method of getting a provably optimized circuit. Granted, it isn't used in real-life because it's complexity is NP, but it is the basis of all the optimization algorithms out there.

      Designing digital circuits is also math because logic is a subset of mathematics, and all digital circuits are physical manifestations of boolean logic.

      Furthermore, mathematics is used always in ECE. Math is used always in any engineering; engineering - math = guesswork. Problems in engineering always reduce to equations, always reduce to math.

      --
      It has been a nervous year, with people beginning to feel like Christian Scientists with appendicitis.
    4. Re:That is the problem by nasch · · Score: 1

      It seems to me you're conflating computer programming with program execution. My programming seldom involves me doing any math at all. Even when it is doing math, it's generally nothing more arduous than very simple algebra. When the program runs, there could be lots of math taking place, but that doesn't have any effect on the writing of the program. And I am deeply offended at the suggestion that I write VB. ;-)

  67. Re:Artificial Intelligence? by LS · · Score: 1

    Also, different language would compress at different rates. For example, almost every phrase in a technical manual is concrete and will be repeated elsewhere, whereas a poem may have several combinations of words never before put together. And then, there are more layers of meaning in a poem, many of which can't be extracted by humans let alone AI. Wikipedia is probably a good test, because direct language makes up the majority of it's text. But it is by no means a representation of all human language.

    LS

    --
    There is a fine line between being a cultivated citizen and being someone else's crop. - A. J. Patrick Liszkie
  68. Re:Artificial Intelligence? by Rakshasa+Taisab · · Score: 1

    Not it won't... The thing here is that each option has a different weight, and I assume the compressed data would just contain the rank of the word to be selected.

    So if you have "The sky is", and the AI ranks "blue", "red", "falling", ... then the data stream just needs to contain the index 2 if the original text was "The sky is red".

    --
    - These characters were randomly selected.
  69. Very silly goal by Ancient_Hacker · · Score: 2, Interesting
    This is about the silliest competition imagineable. Think:

    Compression has reached a point of diminishing returns, getting less and less return for more and more work. And at best it's asymptotically approaching the theoretical limit. You could offer a billion dollar prize and get back maybe a few percent of improvement, while making any further improvement more difficult.

    Meanwhile data storage and data transmission technology keeps improving many percent a year, with each improvement compounding on the previous ones.

    In Other Words, IMHO money would be better spent on the second area rather than the first.

    1. Re:Very silly goal by Baldrson · · Score: 1

      The value of the Hutter Prize is the epistemological implications of the compressed human knowledge: What constructs/concepts did it reify in order to achieve better compression of human knowledge and how much did they contribute? That tells us a lot about the human conversation we can't get any other way.

    2. Re:Very silly goal by Just+Some+Guy · · Score: 1

      Meanwhile data storage and data transmission technology keeps improving many percent a year, with each improvement compounding on the previous ones.

      In Other Words, IMHO money would be better spent on the second area rather than the first.

      You're on the Vista release team, aren't you?

      --
      Dewey, what part of this looks like authorities should be involved?
    3. Re:Very silly goal by Ancient_Hacker · · Score: 1
      >The value of the Hutter Prize is the epistemological implications of the compressed human knowledge: What constructs/concepts did it reify in order to achieve better compression of human knowledge and how much did they contribute? That tells us a lot about the human conversation we can't get any other way.

      Perhaps you're misunderstanding what a computer can and can't do. Computers are good at searching out literal patterns, really poor at understanding and digesting.

      Computer compression schemes can see really obvious patterns and repetitions. But that's got nothing to do with anything like concepts or epistemology. We're not going to learn a thing along those lines.

      For example, compression schemes can see when a text repeats over and over and over and over and over the same exact text, and they can compress that down very nicely. But something as simple as saying hello, verbalizing greetings, expressing salutations, giving you a shout-out, conveying warm acknowledgement of one's existence-- no computers are unlikely to ever be able to really deeply understand the similarities there and do any useful compression of information. Liekewise any old human can find irrelevant sentences that by the way my sister Susie is prone to include, why she almost flunked out of Defenestration U for that, and many other things too like staying out too late. And human editors can snip that stuff out without even thinking but I'd bet several kilobucks or as we call them in Dismal Seepage, Ohio, "not spare change I tell ya buddy", no computer is within 50 years will be able to do a comparable job.

    4. Re:Very silly goal by Baldrson · · Score: 1
      No, you have misunderstood what I mean when I say:

      What constructs/concepts did it reify in order to achieve better compression of human knowledge and how much did they contribute?
      The "it" to which the above sentence refers is not the compression program, but rather the program that outputs the corpus: the decompression program plus the compressed representation. In principle, this decompression program and compressed representation can be constructed entirely by scholars/epistemologists with no computers assisting them except during the testing of the decompression process.
    5. Re:Very silly goal by Ancient_Hacker · · Score: 1

      Sorry, I thought they were compressing text with a computer algorithm, not people.

    6. Re:Very silly goal by Baldrson · · Score: 1

      The Hutter Prize submissions consist of compressed corpus plus decompressor. It does not specify the method used, quite deliberately. In practice, it will probably be a combination of human insight augmented by computation that will produce the compressions. In the early stages, it should not be surprising that computer programs without a lot of a priori knowledge are squeezing out redundancy.

  70. You mean like this?: Re:That's cool.. by Anonymous Coward · · Score: 0
  71. Re:Artificial Intelligence? by Anonymous Coward · · Score: 1, Informative

    Actually, here's a paper on just that.

    Text Compression as a Test for Artificial Intelligence
    http://citeseer.ist.psu.edu/171781.html

  72. How much memory? by tepples · · Score: 0

    so...wikipedia dumps will now be using paq8hp12 instead of l33t 7zip ? I think the Wikipedia dumps are built to run on machines that don't have nearly 1 GiB of RAM. Older motherboards that are still in use often can't take more than half a GiB.
    1. Re:How much memory? by Anonymous Coward · · Score: 1, Informative

      I think the Wikipedia dumps are built to run on machines that don't have nearly 1 GiB of RAM. For compression, that is. Chances are it won't need anywhere near that to decompress (though I can't find any references). So as long as someone else compressed the dump you'll likely be able to use it on your weaker hardware.
    2. Re:How much memory? by tronicum · · Score: 1
      Ineteresting point on the compressor is, you need a dictionary file.
      Not everybody downloads Wikipedia (probably very few i guess) but if you would like have an simple word list (even compressed, too) that originates from wikipedia of your language you could use that as an refernce for compressio stuff

      to get a) a good dictionary (for words of an language, not for content) to get b) a big chunk of nonsense but useful for scientific (de)compression

  73. PAQ8, Hutter Prize branch, version 12 by tepples · · Score: 4, Informative

    As far as I can tell given this Wikipedia article, "paq8hp12" means PAQ8, Hutter Prize branch, version 12.

  74. Phew... by hotfireball · · Score: 1

    Phew... I can compress to zero! Well, no decompressor were written yet though......

  75. But of course, you don't need math for this... by maillemaker · · Score: 2, Funny

    Surely you don't need any mathematical skills to do this kind of work...

    http://science.slashdot.org/comments.pl?threshold= 1&mode=thread&commentsort=0&sid=247781&op=Reply ;)

    --
    A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
  76. FLT by laejoh · · Score: 0

    Cuius rei demonstrationem mirabilem sane detexi. Hanc limen exiguitas non caperet.

  77. Re:Artificial Intelligence? by 4D6963 · · Score: 1

    The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.

    If by artificial intelligence you mean strong AI, then I disagree. as perfect as a compressor could be, even if switching some bit in the compressed data resulted in something else in the decompressed result that made sense, and even if you used that to reply something senseful to a human speaker and thus pass the Turing test, you wouldn't have a strong AI, because the program still wouldn't know what it's doing (the Turing test is not a strong AI test).

    --
    You just got troll'd!
  78. Re:Artificial Intelligence? by smallfries · · Score: 1

    What is the meaning of "optimal" in the phase "optimal behaviour of an agent"? With respect to what criteria? Clearly it can't mean maximisation of reward as I can construct an infinite number of pathological environments that deliberately require complex models rather than simpe models. To prove optimiality of an agent would require that there are not more of these pathological constructions than environments ameniable to Occam's razor in the distribution of all possible environments. I doubt that is a decidable question.

    The AIXI paper is very interesting, although somewhat verbose :) He does a good job of explaining his ideas / and his proof in a straightforward manner. What was he like in person?

    --
    Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  79. Re:Artificial Intelligence? by tepples · · Score: 1

    Which, of course, is fine if and only if you can guarantee that your 'decompression' will work the same way every time. For a learning AI that's not really an option. This means that a learning AI needs at least some redundancy for robustness, much like an error-correcting code. But compression researchers can still establish bounds on the size of a dataset by building a non-learning AI.
  80. Enough "Turing Test" articles, already! by seandiggity · · Score: 1

    I've held back from replying to the myriad of other /. articles about the "Turing test", but I can't help responding to this one. There is no meaningful comparison between the achievements of this program and the cognitive capacities of human beings. I agree with Noam Chomsky on this issue; since I can't state it as eloquently or concisely as him, here's his take on the subject.

    --
    Geeks like to think that they can ignore politics, you can leave politics alone, but politics won't leave you alone.-rms
  81. Hutter Prize rules by tepples · · Score: 1

    I heard the decompression binary is around 100.1MB....

    Poor joke. The Hutter Prize rules include the size of the decompressor in the size of the entry. Decompressors may depend only on stock libc of Windows or GNU/Linux operating systems. In practice, they'll need to run on a net-disconnected machine with a fresh OS install.

  82. Patents apply to arithmetic coding by tepples · · Score: 1

    Arithmetic/range coding address these issues, and come VERY near to entropy, so entropy coding is a solved problem. Or rather, the solvedness of entropy coding depends on the country where the computer is located, and it will automatically become a solved problem on the day essential patents expire. Or have they already?
    1. Re:Patents apply to arithmetic coding by ardor · · Score: 1

      Range coding is essentially the same as arithmetic coding, but patent-free. It is minimally worse than AC, but the difference is insignificant in practice.

      --
      This sig does not contain any SCO code.
  83. No lzip? by Gothmolly · · Score: 0

    Is the lzip sourceforge site still around? They should compare it against that.

    --
    I want to delete my account but Slashdot doesn't allow it.
  84. Link to source for the minor version prior to hp12 by mrand · · Score: 1

    I now await my many awards for searching the Internets - and then another award for each 1% improvement in time that I demonstrate...

    http://cs.fit.edu/~mmahoney/compression/paq8hp11an y_src.rar

    --
    -- PGP keyID: 0x4C95994D
  85. Re:Artificial Intelligence? by silent_artichoke · · Score: 1

    apt-get install gradstudent

  86. I for one.. by ericrost · · Score: 1

    welcome our new Wikipedia reading overlords.

  87. It's AI Jim, bit not as we know it! by ifknot · · Score: 0

    I'm in ur text comprezzin without using semanticz

    --
    we are all cosmic nuclear waste
  88. will not help AI by kwikrick · · Score: 1

    I do not dispute that compression and AI are related, from an information-theoretic perspective. I've done quite a bit of pondering and tinkering with the AI=prediction=compression approach myself. However, I doubt that this research will help AI much further. An true universal AI would be able to do a good job of compressing some data, and compression is a very fundamental problem. Solve this fundamental problem, and perhaps you will have a useful result for a universal AI. However, compressing text is just a very small subset of the fundamental problem. An algorithm that is good at compressing text is not necessarily similar to or helpful for building a universal AI. I don't think paq8hp12 brings us closer to a universal AI.

    --
    assignment != equality != identity
  89. No source in paq8hp12any.rar by Anonymous Coward · · Score: 0

    So I thought I'd take a look at the source, and noticed there is no source in the rar paq8hp12any.rar.. You've got the compiled executable, and two dictionary files. Since the program states it is gpl when trying to be run, you'd think he would include the source, or at least a way of contacting him? Any ideas on where to get a copy of the source? I've only been able to find older copies for previous versions.

    Also, seems really strange that there are people commenting about the source on slashdot who didn't even look at it, because it isn't there?

  90. So to get real AI by Anonymous Coward · · Score: 0

    All we have to do is fashion a big black rectangular prism, polish it up real nice, and launch it into space. Then we put this algorithm on a spaceship and have it pass by... This time we call it PAQ, and no one on the ship better be named Dave.

  91. [RAR file] by zero1101 · · Score: 1

    The fact that it's distributed as a RAR archive kinda says a lot.

    1. Re:[RAR file] by The+Cisco+Kid · · Score: 1

      It says that the new compressor is good at compressing english text, not necesarrily program code (text or binary).

      It also says that distributing a new compressor/decompressor compressed in its own format creates a chicken-and-egg problem if you dont already have the program.

      Now, why RAR instead of .tar.gz, I have no answer for, especially for something that is purportedly FOSS.

  92. Science != Math by sacrilicious · · Score: 1
    Anything that is science is math.

    Substantial over-generalization... I'll settle for "many scientific endeavors are well-assisted by math." Science is fundamentally the process of evaluating a hypothesis. If my hypothesis is that the next time I close my eyes I'll smell tulips, there's no math involved in evaluating this. I could *make* it mathematical, changing my hypothesis to something statistical, like "50% of people smell tulips upon closing their eyes at least some of the time"... but that's a different hypothesis.

    --
    - First they ignore you, then they laugh at you, then ???, then profit.
    1. Re:Science != Math by StarvingSE · · Score: 1

      Math is the language of science. Do you think there would be any science if there was no concept of math? How do you evaluate a hypothesis without mathematical models? I know high school physical science doesn't go into this much detail, but it is still there in the background.

      --
      I got nothin'
    2. Re:Science != Math by rcw-home · · Score: 2, Insightful

      If my hypothesis is that the next time I close my eyes I'll smell tulips, there's no math involved in evaluating this.

      "When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science."
      --Lord Kelvin

    3. Re:Science != Math by HiThere · · Score: 2, Insightful

      You need to read Michael Faraday, or some of his predecessors.

      Math is a relatively late addition to science. Yes, it's proved very useful. But science happened long before they introduced math.

      Well, thinking again, this depends on what you mean by math. Leonardo used math to figure out perspective. Does this means that art depends on math? If so, then science depends on math, and so does walking across the room. And I can see a valid argument to be made along those lines, but that's not what people normally mean. If we look at what people normally mean, then science didn't depend on math until around the time of Kepler. Perhaps you want to call everything earlier engineering rather than science, but engineering depends on math just as heavily as science.

      What actually happened was that after algebra was invented, and arabic numerals, it became a lot easier to describe things in math, so people gradually switched away from describing things in ordinary language and to describing them in math. This has had both advantages and disadvantages. Certainly precision has improved. But comprehension by "ordinary folk" has declined, and not entirely because of the arcane subject matter, but also because they needed to learn a new language in order to understand what was being talked about.

      OTOH, can you imagine talking about computer programming without using "jargon"?

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    4. Re:Science != Math by myowntrueself · · Score: 1

      OTOH, can you imagine talking about computer programming without using "jargon"?

      I always thought that was the whole point of COBOL....

      --
      In the free world the media isn't government run; the government is media run.
    5. Re:Science != Math by Zombywuf · · Score: 1

      Wrong, the only generally accepted way to evaluate a hypothesis such as this is by statistical means. That is, mathematical means, hypothesis testing to be precise.
      Thank you, come again.

      --
      If you can read this you've gone too far.
  93. Re:Artificial Intelligence? by UbuntuDupe · · Score: 1
    I don't know if this is quite what you are getting at, but it seems like you are saying:

    What if someone proposed this algorithm?

    Compression key:

    0 = 0
    1 = Wikipedia, the free encyclopedia. Welcome to ... [i.e., the entire Wikipedia database you're supposed to be compressing]

    And then stored Wikipedia as "1". And you'd be correct, that you can give a false "improvement" in the compression by stuffing the informational complexity within the compression algorithm and completely destroy its usefulness. It would be great for one very specific datastream, but nothing else.

    More generally, if you want a compression algorithm for an arbitrary data stream, you need to rely on some kinds of patterns being in it, that are inherent to the kind of data it is. Truly random data can have no compression because the average length in compressed form will be no shorter -- there are no patterns to exploit.

    Good point, but I'm not sure if I said what you were looking for :-/
  94. Re:Artificial Intelligence? [OT] by Anonymous Coward · · Score: 0

    Coincidentally, the Wikipedia article I read just before reading your post was archeopterix. waaaay off-topic. Sorry.

  95. In Soviet Russia ... by Anonymous Coward · · Score: 0

    Their russian compressors are better and quicker than yours.

    GRZipII, UHARC, Thor, TarsaLZP, ...

    random page 2
    random page 1
    random page 3
    random page 0

    1. Re:In Soviet Russia ... by Anonymous Coward · · Score: 0

      1. Turtle 0.03_____________-> 12.089.054 in 9,580 sec
      2. LZPM 0.06_____________-> 12.800.532 in 30,809 sec.
      3. Thor0.95(e4)___________-> 13.839.120 in 4,527 sec.
      4. Quad 1.12 -f____________-> 13.274.400 in 15,391 sec.
      5. 7Zip 4.41 LZMA Max Speed-> 17.747.266 in 13 sec.

      http://quad.sourceforge.net/ says
      QUAD is a high-performance file compressor that utilizes an advanced LZ-based compression algorithm. Its main features are high compression ratio and fast decompression speed.

      Original 24,375,895 bytes
      QUAD 1.12, -x 5,637,162 bytes
      PKZIP 2.50, -max 8,182,951 bytes

  96. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    But human pattern recognition is based on a large and limited database. The brain has quite a number of parts in it (consider a single cell is a massively complex piece of machinery). If you had a similarly complex computer it would be pretty damn good at pattern recognition too (probably way better than a human even).

  97. Re:Artificial Intelligence? by Lord+Ender · · Score: 1

    I can't say how well this will advance "AI," but it would certainly have fantastic implications for the computer translation world!

    --
    A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
  98. Uh Mistake? by The+Cisco+Kid · · Score: 1

    If its "Open Source" and "GPL", why is there only a proprietary binary executable and no source code?

    The article doesnt mention it being GPL, where did this come from?

  99. The irony by ScriptedReplay · · Score: 1

    Anything that is science is math.
    Science != Math

    While I think he mis-stated Anything that is science requires math. you just proved the point. All in A are in B does not imply A equal to B. You might want to revisit either Logic or Set Theory which, incidentally are ... math.
  100. Re:Artificial Intelligence? by leonem · · Score: 1

    It's possible the difference you're describing is not to do with the underlying 'guessing' algorithm, but the IO layer, i.e: if a first layer encoded the text according to thesaurus-style meaning groups (with a certain amount of context-sensitivity to account for homonyms), and then the algorithm produced a 'reply' guess based on this semantic encoding, and finally the first layer decoded (with random diction), you'd have an effect similar to the one you describe.

    Admitedly, the grammatical interpretation process here is not trivial, but even Word does it to a certain extent (yes, I know it's famously bad at certain constructions, but one can at least see how they could be improved). Of course, it wouldn't quite do what you describe until the 'meaning units' are considerably longer than single words, and this would mean the first layer would have to be a good deal more sophisticated. However, that sort of pattern-grouping is another area in the same domain as compression algorithms (consider fractal compression), so it's not ridiculous to relate them to AI, even if you don't accept that the problems are a 1:1 match.

    A criticism of what I've written above would be that the specific algorithm under discussion in TFA is unlikely to be as suitable for guessing these theoretical semantic units as plain text. This is true, however it doesn't negate the underlying point that some other algorithm built around another data model might display qualities necessary to pass the Turing test, one perhaps built around lessons learnt from this breakthrough.

  101. Reedickulousity by DynaSoar · · Score: 1

    Guess those successive characters, Shannon.

    From T(second)FA: "Shannon (1950) estimated the entropy of written English to be between 0.6 and 1.3 bits per character (bpc), based on the ability of human subjects to guess successive characters in text."

    What about reading not just whole words but several words at a time (Miller's magic 7+/- 2 item "chunking")? What about guessing the rest of a sentence from the first one or two such chunks? What about guessing the rest of a paragraph or statement from the first couple sentences and/or context? All these are functions of the brain's heuristic processing and "compress" language. If language compression is going to be a metric for AI based on the estimated brain processes it's competing with, then all the processes involved should be measured and accounted for.

    Why anyone would cripple a perfectly good computer by forcing it to pretend to be a very different device that operates on very different principles is beyond me. But then Shannon was an engineer, not a cognitive psychologist. For that matter, neither was Turing.

    --
    "I may be synthetic, but I'm not stupid." -- Bishop 341-B
    1. Re:Reedickulousity by demi · · Score: 1

      The thing is, Wikipedia is not a good example of general written English. It would probably have a lower entropy than a random sample of prose.

      Wikipedia article text contains a great deal of repetitive structured syntax associated with formatting, document structure, article classification, etc. In addition to this literal structure, its prose is highly stereotyped due to the pedagogical nature of encyclopedia writing, and to the fairly tight conventions around Wikipedia article-writing itself. The vast majority of article first lines could be generated from a table of meta-data, for example ("'''$name''' ($birthdate - $deathdate) is a '''$nationality''' '''$profession'''..."). Many sorts of sentences and paragraphs never appear in Wikipedia, as they might in a random sample of writing. Many whole articles are essentially the same as every other article in a series, with the small exceptions of things like the name and date of introduction.

      --
      demi
    2. Re:Reedickulousity by Anonymous Coward · · Score: 0

      You have actually identified the idea behind the AI-compression connection: the best possible guess for the next character takes into account the meaning behind the text: what the author seems to be getting at, what kind of ideas are being presented in what kind of sentences, which gives clues for the next word, and all this is definitely useful for determining the next character. A human being does take all this into account, and it is assumed that a computer would have to do similar analysis to reach a comparable result. The only uncertainty with regard to the result is that a computer is likely to take advantage of many more "stupid" patterns than a human, and thus produce good results without true understanding. However, a computer that has both understanding and an extensive non-intelligent search over repeating patterns will be able to reach even better compression - we just don't know how good this compression will be in comparison to a human who relies wholly on intelligence with little brute-forcing capability for the details.

    3. Re:Reedickulousity by pclminion · · Score: 1

      For somebody who's posted no credentials of your own, your ridicule of Shannon is hollow. He measured the conditional entropy of English letters based on human trials and arrived at a particular value. The reason he did this was not because he thought letter-sequence prediction was some kind of wave of the future -- he did it to test an important theorem which predicts the minimum possible entropy of a sequence given a theoretically optimal encoding scheme -- and in doing so, he learned that humans come close, but not quite right up to, the theoretical limit.

      Now THAT is an interesting result, wouldn't you agree?

      Years later, some researchers have now used this number in a way that may or may not be appropriate (I think it's bogus, personally) -- but it has nothing to do with Shannon.

      The way you take this article and somehow twist it into an attack on Shannon and Turing, is actually kind of sickening. At least you haven't been modded up.

    4. Re:Reedickulousity by DynaSoar · · Score: 1

      > For somebody who's posted no credentials of your own, your ridicule of Shannon is hollow.

      I'm a cognitive neuroscientist. I've studied under Karl Pribram, and through his lab worked with David Bohm's parter Basil Hiley, and two physicists from Japan, Jibu and Yasue. I contributed to Pribram's work by applying tensor calculus extending his use of Gabor's math to describe brain processes and electrical fields as being similar to holography. I've discussed theories of consciousness Roger Penrose and contributed to that work through his partner Stuart Hammeroff. I've studied at the Santa Fe Institute twice. I regularly work with physicists and electrical engineers to apply signal processing concepts such as wavelet transform to brain imaging. You can look for my other postings about my work with a Parkinson's prevention. I'm very much an experimentalist and methodologist, but with a heavy background in cognitive science theories. I've worked at NIH/NIDCD (Deafness and Communication Disorders) and the Department of Psychiatry at Yale Medical School. As such I'm well qualified to critique Shannon's application of "compression" to language and the validity of its extension to the use in the article. I design psychology experiments based on concepts like Shannon's theory. Shannon's experimental design is sadly lacking in cognitive psychological considerations, particular that relating to language comprehension, and the conclusions drawn are therefore flawed. I specified the particular problems in the posting you replied to. The comment about Turing was an offhand comment, but is just as valid for the same reasons as those regarding Shannon, as well as being on topic, as the claim in the parent article was regarding AI.

      Do you perhaps have credentials to validate your critique of my comments, or were you just blowing smoke out your ass?

      --
      "I may be synthetic, but I'm not stupid." -- Bishop 341-B
  102. Where is the source? by mi · · Score: 1

    Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file].

    Well, if it is GPLed, it is in violation of the license, because the source is nowhere to be found — the RAR-file contains the Windows executable and two dictionary files — where is the source code?

    The readme.txt in the directory does not mention the sources either...

    --
    In Soviet Washington the swamp drains you.
    1. Re:Where is the source? by Anonymous Coward · · Score: 0

      He's the owner of the copyright, he cannot be in violation of the licence.

    2. Re:Where is the source? by TommydCat · · Score: 1

      Well, if it is GPLed, it is in violation of the license...
      I wouldn't think the copyright holder would be bound by his own license if it was a completely original (non-derivative) work. However, this would effectively prevent anyone else from copying and distributing (those two acts tied together) since no one else would be able to meet the terms of said license...
      --
      This comment does not necessarily represent the views and opinions of the author.
    3. Re:Where is the source? by mi · · Score: 1

      Whether there exists an actionable violation of the license or not is a moot point. My complaint was about Slashdot announcing it as an open-source program, when no source is, in fact, being distributed...

      The submitter goofed and the editor failed the "due diligence" — again...

      --
      In Soviet Washington the swamp drains you.
    4. Re:Where is the source? by Anonymous Coward · · Score: 0

      Bitch, bitch, bitch. Typical whiner. The author of the code OWNS THE COPYRIGHT. He can do whatever the fuck he wants. Go back to your hole.

  103. Re:Artificial Intelligence? by leonem · · Score: 1

    Read Permutation City!

  104. It's impractically slow by p3d0 · · Score: 1

    It's a research tool really. I've tried it. It achieves phenomenal compression, but it takes several orders of magnitude longer to run than gzip.

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  105. Re:Artificial Intelligence? by hesiod · · Score: 1

    > The fact that it's raining. That there's a girl sitting on the roof of the car. You *don't* in any way shape or form have enough info to be able to reconstruct the image in a losslell manner.

    Quite true. Especially since it was not raining, she was actually being sprayed with a hose. The bikini is what tipped me off.

    Semi-seriously though, (and not really directly-related to your post) taking into account the context of the image does seem rather important. Can a computer determine the difference between rainfall and water sprayed from a hose in a still image? The fact that she is wearing a bikini does not preclude the water being rainfall, although the chances may be lower. It seems to me that any assumption at all, even the smallest leap of logic, will result in unreliable data, even if the result appears to be correct.

    Therefore, data compression is inherently unpossible! QED. (speaking of logical leaps...)

    Don't mind me, it's morning and forming a coherent thought is still a few cups of coffee away.

  106. QED by hummassa · · Score: 1

    Science is fundamentally the process of evaluating a hypothesis Which is an application of logic, that is a branch of... math. There are a lot of other good answers...
    --
    It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
    1. Re:QED by AoT · · Score: 1

      And by math you mean philosophy.

      I think you've got your terms confused.

  107. Engineering vs. Science. by hummassa · · Score: 1

    designing circuits, optimizing circuits Nope, that's engineering. Sorry.

    (Reference: I am an undergraduate Electrical Engineering and Computer Engineering major, or IAAEECE) I am a B.Sc. in Computer Science (graduated 15 years ago) and I studied circuits design and optimization (from the mathematical POV).

    --
    It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
    1. Re:Engineering vs. Science. by Lord+of+Hyphens · · Score: 1

      Point. Of course, Computer Engineering is a strange bastard mix of Electrical Engineering and Computer Science.

      --
      "I've spent my whole life figuring out crazy ways to do things. It'll work." -- Montgomery Scott, "Relics"
  108. Re:Artificial Intelligence? by jonadab · · Score: 1

    > The organizers believe that text compression and AI are equivalent problems.

    I'm pretty sure I don't believe that.

    Text compression is very much a form of computation, something computers are naturally very good at. There's a lot of arithmetic, a lot of searching and comparison, and so forth. I'm not aware of any compression algorithm that involves understanding what the text means (unless you count synopsis, but that's very much lossy and gets compression rates that are numerous orders of magnitude better than anything we're talking about here).

    --
    Cut that out, or I will ship you to Norilsk in a box.
  109. Re:Artificial Intelligence? by Baldrson · · Score: 1
    The same figure of merit has existed for decades before, and it has never proven to be very useful in evaluating AI systems.

    Are you saying NLP model perplexity has never proven to be very useful in evaluating AI systems? Really? How are you measuring utility?

  110. Re:Artificial Intelligence? by Jon+Abbott · · Score: 1

    Bingo. Our minds are somewhat lossy, and perhaps that is one reason why we are able to remember so much information. We often disregard or forget dissenting and/or contradictory information to reaffirm our memory. Creating lossless AI almost sounds like an oxymoron from that perspective...

  111. Why compression is AI by plunderphonic · · Score: 1

    Compression finds the underlying regularities (patterns) in the data.

    Here is how it can be used for prediction:

    Start with data D.
    Let c(D) be the size of compressing D.
    Let's say we want to predict which of two predictions, p or p', is more likely, given D.
    We say that p is more likely than p' iff c(D + p) c(D + p'), where + is concatenation.

    In other words, we predict the statement that is more similar to the previously observed data.

  112. how? by OtherFarm · · Score: 1

    how do I install it on my girlfriend?

    1. Re:how? by Icarium · · Score: 1

      Why do you want to compress your girlfriend?

  113. Optimistic? by mattr · · Score: 1

    Is the Hutter Prize is a bit optimistic and perhaps misleading? We are not 1% away from "AI", whatever that means, even though one might think so. It would appear that the Prize set the milestone a bit too easily in reach of non-AI systems, in other words systems that are mechanistically, strategically smart but not anything like what one would think of as AI. In other words, that text compression is NOT equivalent in scale or depth to the Turing test and they only look similar as they closely approach unity. Put another way, does PAQ do anything beyond language and math tricks, or is it really "understanding" anything at all in the text?

  114. Re:Artificial Intelligence? by 1729 · · Score: 1

    I'm pretty sure I don't believe that.

    Text compression is very much a form of computation, something computers are naturally very good at. There's a lot of arithmetic, a lot of searching and comparison, and so forth. I'm not aware of any compression algorithm that involves understanding what the text means (unless you count synopsis, but that's very much lossy and gets compression rates that are numerous orders of magnitude better than anything we're talking about here).


    I had the same initial reaction, and I'm far from convinced that text compression is equivalent to AI, but it seems reasonable that tools like machine learning could be used to develop better compression heuristics for specific types of text. The algorithm doesn't need to understand the meaning of the text; it just needs to apply the heuristics.
  115. For the sake of completeness, by hummassa · · Score: 2, Insightful

    in mathematical form:

    science = math + measurements

    That's it. Science is:
    1. measure phenomena,
    2. figure out the formulas,
    3. predict new phenomena,
    4. measure new phenomena,
    5. if Ok, back to stage 3; if not, back to stage 2.

    (ok, ok, 6. (...), 7. Profit!!!, just to appease the masses)

    notice stages 1 and 4 are measurements, stages 2 and 3 are maths.

    --
    It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
  116. Re:Artificial Intelligence? by Myrcutio · · Score: 1

    You might be interested in looking into AI development regarding the asian game Go (Chinese WeiQi). The game is basically a pattern recognition competition, and is currently one of the only board games where in even the best AI programs are beaten easily by average humans.

    The reason humans are better at pattern recognition is that we can select out the most likely course of events, in different areas, and actually predict how two sequences (which haven't happened yet) will effect each other.

    think of it this way, you have three sentences you need to compress. "I am hungry". "My friend has a car". "There is no food in my house". Now, a human would predict easily what would follow after that, something along the lines of "I will go somewhere with my friend and eat something." The computer has to understand each sentence and predict an action based on each one, ruling out possibilities based on seperate requirements implied by the following sentences. Easy for a human, hard for a computer. Currently, computers are good at predicting sequences with only one string, actions that only have one possible outcome, like occurs in checkers or chess. The difficulty comes when there are multiple answers to a single problem. Then, the computer has to decide which answer makes sense. Find me a computer that can do that, and i'd love to have a chat with it.

  117. Oblig Terminator Reference by nevillethedevil · · Score: 1

    Wikipedia becomes self aware July 10th 2007 and launches it's missile at Encyclopedia Brittanica.........

    --
    Be gone from my sight or prepare to feel my flaming wraith!
  118. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    You: the sky is blue today!
    Computer: Today is beautiful weather!

    You just gave an example of a natural sounding conversation...which means Turing PASS.

  119. Re:Artificial Intelligence? by Dan+D. · · Score: 1
    it's a lot more interesting in images,

    In my undergrad I did some work with jpeg wavelet compression using the QUEST algorithm which is supposed to measure the minimum detectable change a human can see (using two-alternative forced choice) ... anyway the interesting result that came out was that wavelets could get about 20% better compression without noticeably changing the image over traditional jpeg. This is despite the RMS compared with the original image being about the same as jpeg. This sorta implies that lossy wavelet compression deletes the stuff that the human eye wouldn't notice anyway. Its also been shown that wavelet compression can "fix" artifacts coming out of jpeg compression.

    Of course my bias comes from a result my professor showed me of building a steerable-twistable pyramid of wavelets for image recognition that used about 100 points and seemed to correspond nicely with the arrangement of the cortical hypercolumns in a monkey's V1 cortex... also implying that maybe the reason wavelets work so well when using human detection is that they do the same sort of thing the brain is doing... of course this is all just interesting supposition :)

    --
    People who quote themselves bug the crap out of me -- Me.
  120. Re:Artificial Intelligence? by Dan+D. · · Score: 1

    Taking that seriously ... it seems like the better test for AI compression isn't how well it can reconstruct the original text but whether or not a human reading the decompressed text thinks it says the same thing. The human reading the text might not recognize that "Clear skys" and "Good weather" are really different unless reading really carefully...

    --
    People who quote themselves bug the crap out of me -- Me.
  121. Where's the Java version? by heroine · · Score: 1

    Java is our best chance of getting these advanced algorithms running on Linux. Once again, whatever happens in Java is being done in C first, and years earlier. Now the best compressor which runs on Linux is barely even on the list.

    It's amazing 10 years after bzip2, to see progress still being made in data compression, even though the steps are much smaller.

  122. hardware by phorm · · Score: 1

    I think that in some cases it's due to the American wetware being too slow to process the more complex datastream. Somewhat like decoding divX video on an old Pentium I :-)

  123. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The problem... with your theory... Spock... is that humans... are not... logical!
  124. Re:Artificial Intelligence? by fireboy1919 · · Score: 1

    is that they do the same sort of thing the brain is doing... of course this is all just interesting supposition

    Wavelet works better than jpeg because it flat-out produces less error than jpeg compression for the same level of compression. DWT is just a more clever algorithm than DCT. We're at the stage, though, that these really shouldn't be used as a baseline.

    DWT has been out and in use for a long time in jpeg2000, MrSid, and a lot of others.

    The point of both, though, is that high and low-frequencies (either in Wavelet space or Cosine space, or even Fourier space or pretty much any other frequency space) in images aren't really observed much by humans. This is easy to verify (and has been) by studying the visual receptors and nerve endings in the eyes. It's not supposition, as you mention. We know that's how it works.

    This is just the beginning, though. There's a lot more clever things in the jpeg algorithm (and its ancestors) besides encoding in frequency space and eliminating the unseen frequencies.

    --
    Mod me down and I will become more powerful than you can possibly imagine!
  125. big deal... by sdnoob · · Score: 1

    >> 100,000,000 bytes of Wikipedia to a record-small 16,481,655

    is there even 16,481,655 bytes of actual useful information in that first 100,000,000 bytes?

    if they edit and condense down to relevant, verifiable, unbiased facts and information, they should easily be able to fit that first 100 million bytes (and then some) on to a 1.44 floppy disk (hell, compression may not even be needed).

  126. Science ... Math by The+Monster · · Score: 1
    Here's what I told Monsterette 2, when we were talking about fields of knowledge:

    The 'social sciences' are really biology.
    Biology is really chemistry.
    Chemistry is really physics.
    Physics is really mathematics.
    And mathematics is really hard.
    It boggles the mind that one can get a 'Bachelor of Science' degree from a university without first passing Calculus.
    --

    [100% ISO 646 Compliant]
    SVM, ERGO MONSTRO.

    1. Re:Science ... Math by Anonymous Coward · · Score: 0

      "are/is really" should be replaced by "aspire/aspires to be", and "hard" should be replaced by "easy".

  127. They are just as close as anyone else ... by Anonymous Coward · · Score: 0

    At beating the Turing test that is. Of course all "serious" AI researchers have long ago started poo-poo'ing the Turing test ... but then again most of them are much greater poseurs than Hutter. However flawed you think his idea is at least he isn't getting success in AI research by moving the goal posts.

  128. Re:Artificial Intelligence? by Baldrson · · Score: 1
    So am I take it you disagree with Mahoney's statement:

    He gives a formal proof, but it basically says that the only possible distribution of the infinite set of programs (or strings) with nonzero probability is one which favors shorter programs over longer ones. Given any string of length n with probability p > 0, there are an infinite set of strings longer than n, but only a finite number of these can have probability higher than p.
  129. Re:Artificial Intelligence? by Monsieur_F · · Score: 1

    My very own algorithm, which I just experiment on the first 100 thousand lines of the provided wikipedia file, while only based on word proximity (within the file), returned four results as words similar to "the" (together with the associated scores - yes, the last one is sqrt(2) ) :

    1.10496 my
    1.26491 lunch
    1.29099 multiplying
    1.41421 hovercraft

    This is clearly AI !

    Now I just wonder what "my lunch multiplying hovercraft" may actually mean.

    --
    McCartney fans pay bus tickets. [...] Lennon fans too, with discretion.
  130. Video by LeadSongDog · · Score: 1

    This suggests a more interesting meta-algorithm. Develop code. Compress old movies. Decompress. Diff. Lost anything but grain and scratches? Try again. Turing Test Variation: Can you tell AI's colourized movies from Turner's? Better variations: Compress movies via object understanding and 3D modeling. Regenerate from different perspective viewpoint. Automate facial mood recognition. Synthesize acting from first principles and a script. Improve on 'Citizen Kane'.

    --
    Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.
  131. To beat that AI threshold.... by Anonymous Coward · · Score: 0

    ...I think they need to call in ZeoSync!

    Surely they've perfected their compression technology by now, huh?

  132. In a word; iphone by myowntrueself · · Score: 1

    but why not use the internet capabilities of your phone?

    I guess with the rise of the iphone its because it would take far too long to wait for the results of a wikipedia search slowly coming in over the EDGE network?

    Hence the need for a hugely compressed edition of Wikipedia to be stored on the iphone itself. Faster to do the decompression than to wait for the data connection.

    --
    In the free world the media isn't government run; the government is media run.
    1. Re:In a word; iphone by Danny+Rathjens · · Score: 1
  133. Re:Artificial Intelligence? by nasch · · Score: 1

    It sounds like the contest required the resulting file to include the decompression program, so maybe the organizers thought of that too. :-)

  134. Re:Artificial Intelligence? by smallfries · · Score: 1

    So am I take it you disagree with Mahoney's statement:
    Yes.
    --
    Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  135. Fixed even more by Snaller · · Score: 1

    42

    --
    If Google really cared they would fix Android Chrome to reflow text, instead of discriminating
  136. Re:Artificial Intelligence? by Anonymous Coward · · Score: 0

    Shannon "guessed" based on his experiments with volunteers to predict letters in a message or passage of text, that the human brain requires about 1.3 bits of information to correctly make a prediction.

    Its believed that a computer would require language skills equivalent to our own inorder to compress and decompress text at the Shannon limit. As one of the great corner stones of AI research is natural language processing, any advance we make in compression theory that brings us closer to the Shannon limit (or beats the Shannon limit) should help us advance our AI language abilitys.

    Or at least thats the theory.

  137. this is all fine and dandy but... by Danzigism · · Score: 1

    when can i put all that on a flash drive and insert it in my brain?

    --
    *plays the Apogee theme song music*
  138. I've got Wikipedia compressed to by mysidia · · Score: 1

    Into a 948 Kb file named /usr/bin/wget.

    To decompress, you need an internet connection and the original URL to the document. You do wget (ORIGINAL URL HERE)

    Alternatively, on Windows "C:\Program Files\Internet Explorer\iexplore.exe"

  139. Does this guy understand approximate numbers? by tgl · · Score: 1

    Let's see ... Shannon estimated human performance as being between 0.6 and 1.3 bits per character. How many significant digits are there in those numbers, do you think? Barely 1, else he'd have given a tighter range. To claim that 1.319 bits/char is "outside" the range of human performance while 1.299 bits/char will be "inside" it betrays an absolutely stupefying lack of understanding of statistics and experimental measurement.

    This thread's various criticisms of the purpose of the Hutter Prize seem valid to me, but even if you accept the premise as sound, the original post is the silliest sort of meaningless hype-creation.

  140. Re:Artificial Intelligence? by nanosquid · · Score: 1

    Are you saying NLP model perplexity has never proven to be very useful in evaluating AI systems?

    Power consumption and price are also very useful in evaluating AI systems, they just have nothing to do with AI. Perplexity is a measure of how good a statistical natural language model matches word frequencies, not of AI.

  141. SDHC works with Treo? by Travoltus · · Score: 1

    I thought SDHC cards didn't work in legacy SD systems (though I understand SD cards do work in SDHC enabled systems).

    If SDHC cards work for the Treo then why in heck did I even think about looking at the iPhone? (Sorry, Palm, I thought about looking at it, but did not actually look at it! Honest!)

    --
    --- Grow a pair, liberals... stop letting the Republicans bully you!
    1. Re:SDHC works with Treo? by Cato · · Score: 1

      The Treo 680 is quite recent - shipping since Nov 06, so not such a surprise that it supports SDHC. The Palm OS is long in the tooth, and first Treo was a long time ago, but there are some recent Treo models, including quite a few CDMA PalmOS ones and Windows Mobile ones. UMTS PalmOS Treos don't exist, which is a pain for us Europeans...

  142. Re:Artificial Intelligence? by JohnFluxx · · Score: 1

    If they start at the same point, and have the same input, then they will learn the same thing and come to the same conclusion.

  143. My point is exactly... by hummassa · · Score: 1

    that describing things without quantisizing them is _not_ science. As I posted before in this thread, science (at least as it was taught to me in college) is measuring, deriving a formula from your measurements, extrapolate your formula to something else, measuring this something else, rinse, repeat.

    --
    It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
    1. Re:My point is exactly... by HiThere · · Score: 1

      It's been said before, and there are prominent scientists that agree with you.

      But I believe that you are wrong. I believe that science must pre-exist before you can merge math with it (unless you are counting primitive concepts such as intuitive geometry as math). Chess players don't need to even know the existence of game theory, even though game theory can, in theory, predict the proper move in any position. Note that in practice game theory is too unwieldy for a human to use in such a way, and I suspect that it's too unwieldy for a computer to use it for such a task, also. That's why all good chess-playing programs come with a library of openings. And they have specialized rules that can be mapped to pieces of game theory that they invoke at characteristic places in the game. E.g., knight forks are generally worth avoiding...except when they aren't.

      Tennis players don't need to know physics, certainly not mathematical physics, except at an intuitive level.

      Similarly, science and scientists existed before math was mapped onto science...except at an intuitive level. Currently that's about the only kind of science that's taught and recognized, but it was not always so. There are many reasons why the mathematical versions of the various sciences came to dominate. One in particular is that mathematics represents generalized knowledge of how patterns interact without needing any substance. This means that when you can map some part of knowledge or action onto math, you can easily make a large number of predictions that would not have been easy to create. You must then CHECK those predictions, to discover whether your mapping has been valid. This interplay between separated theory and experiment has largely replaced an older and less powerful interplay between theory and experiment. The older theory was based around physical models rather than mathematical models (and is, I believe, still the dominant informal mode). The benefit of the older theoretical models are that our brains are adapted to manipulating physical models, and are at best clumsy at manipulating mathematical models, so if you want to be at all creative you first generate your models in the older framework, and only then translate them into math. As I recall Einstein spent YEARS searching for the right mathematical model for his mental pictures before he was introduced to tensor calculus. But he was already making predictions for how the experiments would turn out before he could make them sufficiently precise to show that the new model was an improvement on the old. (I read the book that contained that information decades ago, and I no longer recall it's title. The same information is probably elsewhere if you go searching.)

      I assert that it is the interplay between theory and experiment that determine science. The mathematics is an add-on, and one that is un-natural to the human thought process. It is, however, an add-on of such power that once it occurs it quickly dominates...IF enough of the people you are trying to communicate to have the appropriate background knowledge of math.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
  144. Erratum: 1.3 May Be Too High by Baldrson · · Score: 2, Informative
    Matt Mahoney has communicated his concern to me that the 1.3 bits per character entropy measured by Shannon is likely a smaller number with the enwik8 corpus due to regularities from embedded markup. He has already compressed enwik9 (1,000,000,000 bytes) to less than 1.3 bits per character and his analysis shows that this is largely due to a large section of data tables present in that larger sample -- which entails a large amount of embedded markup. While the entropy of enwik8 is unlikely to be as low as enwik9, this difference does evidence the lower entropy of embedded markup.

    Until Shannon type experiments, involving humans doing next character predictions of enwik8, are performed, the bounds of enwik8's entropy range must remain unknown but is likely lower than 0.6 to 1.3. As such an experiment would be expensive, it is going to be difficult to say with any simple bpc measure when the Hutter Prize is breaching the threshold of AI. What the Hutter Prizes bpc metric gives us, however, is a clear measure of progress.

    My apologies to the other members of the Hutter Prize Committee and the /. community for this error.

    PS: Another area of concern raised by Mahoney is that enwik8, at 10e8 characters, is only as much verbal information as a 2 or 3 year old has encountered so although it is sufficient to demonstrate AI capabilities well beyond the current state of the art, his preference is for a much larger contest with fewer resource restrictions focusing on the 10e9 character enwik9 which is more likely to produce a the AI equivalent of an adult with encyclopedic knowledge.

  145. Re:Artificial Intelligence? by Synonymous+Dastard · · Score: 1

    Mine is better: the he in lt and a of this was an to other is amp quot his or their time from its name for most by many have more on any way s they be at these years as new with 1 de alexander can used into it http has no which had him are very well after several people when 2 000 2005 august 2006 but american writer d 2004 isbn 0 million 3 di austu 5 april augustus also known that human history would become one b gt x were called them such who made there been considered some countries I guess it can be used to generate spam.

  146. Re:Artificial Intelligence? by Synonymous+Dastard · · Score: 1

    Wow, this is really bad formatted.

    I must be new here.

  147. Re:Artificial Intelligence? by Estanislao+Mart�nez · · Score: 1

    You're assuming that humans aren't machines - in this context, that's actually a matter of faith.

    Yup, just like I assume that airplanes fly, but submarines don't swim. There is no substantive issue of whether a machine can think. There is just an ideological and cosmological dispute about what people are, where too many participants like to cast their positions as science.

    Human intelligence may be a result of "machine processes" - ie, direct physical processes.

    You're equivocating over the term "machine." The range of meanings it has in our language is much richer than that; you can only continue this argument by arbitrarily focusing on some of them. To do this while claiming the authority of science is no more and no less than what I described above: framing as a scientific hypothesis a position in an ideological dispute.

    If we assume that humans are intelligent - otherwise, the term seems sort of useless [...]

    Actually, I don't allow this assumption to be taken for granted. Not 100 years ago, it was still common among educated people and scientists in the Western world to believe that non-"white" people, women and children were not in fact "intelligent."

    Usage of the word has changed since (without the sort of belief in question dying out, as evidenced by The Bell Curve; it's just reformulated in other terms). Strong AI folks want the general usage to change again. The problem is that (a) nobody appointed them as general arbitrers of our culture, (b) they like to disguise their preference for certain way of using the term as scientific conclusions.

    (And I believe such a definition would probably be counter-productive when it comes to the matter of defining intelligence.)

    You can go ahead and "define" intelligence as much as you like. What do you think that accomplishes? You don't have a power to decide how other people are going to use the term "intelligence," nor the term "machine."

  148. Re:Artificial Intelligence? by Eivind+Eklund · · Score: 1
    No, I can't force how they use them. However, I can say that your claim that "machines by definition cannot be intelligent" will most likely, if you use a definition that allows humans to be intelligent, fall. Also, notice for yourself: You refused to answer the question. You refused to do the definitions.

    Ask yourself: Is this fair towar5ds yourself and knowledge? Can you be sure that you are right unless you actually go in and look at the question *with hard edges*? Do you feel reasonable when you refuse to look at the question in depth? And how would your thoughts be if you just dropped the assertion and looked carefully at the actual behaviour here? Might it not be just as true that intelligence may occur in machines, if you just try this on properly, losing that single assumption?

    As for "arbiter of culture", the culture we have use intelligence as a term to refer to humans. I'm referring to that, and I say that a definition of "intelligence" that disagree with this is, in my estimation, unlikely to match anything, and it is unlikely to be useful. You are the one that tries to go away from the common culture in this usage. Are you brave enough to see that you're projecting? Or will you just be angry, because somebody disagrees with you?

    Eivind.

    --
    Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.
  149. Re:Artificial Intelligence? by Estanislao+Mart�nez · · Score: 1

    However, I can say that your claim that "machines by definition cannot be intelligent" will most likely, if you use a definition that allows humans to be intelligent, fall. Also, notice for yourself: You refused to answer the question. You refused to do the definitions.

    Of course I refused to define the terms. Terms from ordinary language, like "think," "intelligent" and "machine," do not derive their meaning from definitions. They derive their meaning from the role they play in various interactions in our culture.

    If you ask an ordinary language question like "Can a machine think?", you can't claim to have answered that question if you provide a technical answer that relies on technical definitions of the terms.

    Of course I don't think that something supernatural goes on in people's heads. But my point is that the whole debate about whether a machine can think isn't an empirical one about whether you can build a machine that can do anything that a person can. It's a cosmological and cultural debate about what kind of things peoples and machines are, and what sort of relationship do they stand in to each other.

    Ask yourself: Is this fair towards yourself and knowledge? Can you be sure that you are right unless you actually go in and look at the question *with hard edges*? Do you feel reasonable when you refuse to look at the question in depth? And how would your thoughts be if you just dropped the assertion and looked carefully at the actual behaviour here? Might it not be just as true that intelligence may occur in machines, if you just try this on properly, losing that single assumption?

    There's two problems here:

    1. Once you've provided a definition of the terms, you've changed the question.
    2. Understanding requires abandoning old ways of looking at things. Just because one particular question motivated a course of investigation, that doesn't mean that the result of the investigation must be an answer to the question that prompted it. The research may well outgrow the original question. (My favorite example: the whole "What is Pluto?" controversy is stupid, because all it shows is that astronomy has outgrown the distinction between "star" and "planet," inherited from antiquity.)

    As for "arbiter of culture", the culture we have use intelligence as a term to refer to humans. I'm referring to that, and I say that a definition of "intelligence" that disagree with this is, in my estimation, unlikely to match anything, and it is unlikely to be useful. You are the one that tries to go away from the common culture in this usage.

    Yes, but the point was that how our contemporary culture uses the term is a historically contingent fact. "Humans are intelligent" is not an obvious, timeless truth, regardless of whatever attitude you and I may take toward it.

    Another thing that the AI discussion misses: the moral dimension of "intelligence." 20th century psychology has framed the term "intelligence" in terms of cognition and cognitive skills. However, another traditional component of the concept is moral agency; being "intelligent" means that you are entitled to full enjoyments of the rights that befit your membership in your community, with all the concomitant responsibilities. (It is no accident that the historical periods where people denied that some ethnicities, women and children were "intelligent" also correspondingly denied some corresponding rights to them, like the right to vote, or to own property; or likewise, being exempted from certain responsibilities, by holding them not to be criminally or civilly liable for some acts.)

    This sort of thing, if you ask me, is way more important than the silly question whether a "machine" can "think" (framed in terms of cognitive abilities). Why so? Because it gets to what I think is the real, cosmological heart of the matter: what are people,

  150. Re:Artificial Intelligence? by Eivind+Eklund · · Score: 1
    Those are interesting and lucid points.

    While I do not fully agree - I believe we'll keep using intelligence for cognitive abilities, and distinguishing "right" from "wrong" inside a particular framework of "right" and "wrong" is just one cognitive ability among many - it was definite food for though. Thanks.

    Eivind.

    --
    Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.