Slashdot Mirror


Wikipedia Used for Artificial Intelligence

eldavojohn writes "It may be no surprise but Wikipedia is now being used in the field of artificial intelligence. The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. The concept is also on the forefront of artificial intelligence and progress towards an application passing the Turing Test and creating semantically aware applications. The article comments on uses of Wikipedia in this manner: '"... spam filters block all messages containing the word 'vitamin,' but fail to block messages containing the word B12. If the program never saw B12 before, it's just a word without any meaning. But you would know it's a vitamin," Markovitch said. "With our methodology, however, the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will correctly identify the message as spam," he added.'"

31 of 177 comments (clear)

  1. Wikipedia needs work for spam filtering.... by MoHaG · · Score: 2, Insightful

    With the example of using Wikipedia for spam filtering as mentioned in the post, maybe more articles need to be written on spam-slang for Viagra....

    1. Re:Wikipedia needs work for spam filtering.... by Metasquares · · Score: 4, Insightful

      Infer too much and the false positive rate skyrockets, though...

  2. uh oh, there goes wikipedia by ILuvRamen · · Score: 4, Interesting

    don't you think masses of spammers are going to screw with wikipedia strategically on purpose so that it doesn't work properly for that if it starts to work very well to block them? They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc. Type spam map into google image search to see how blatently obvious it is to see where the spam comes from. Something like 98% of spam can be pinned down to 0.01% of the world by square footage. If they added fuzzy logic instead of alterable AI and only block e-mails from south korea with the word vitamin and not block ones from Nebraska with the word vitamin, then the problem would be decreased dramatically.

    --
    Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    1. Re:uh oh, there goes wikipedia by WilliamSChips · · Score: 4, Insightful

      You don't think there are hundreds of thousands of zombifiable computers in the United States? And what about people with business connections in China or Korea?

      --
      Please, for the good of Humanity, vote Obama.
    2. Re:uh oh, there goes wikipedia by gradedcheese · · Score: 2, Informative

      most spam I get now looks to be from botnets rigged up using people's PCs here in the United States. Very little (in my inbox anyway) comes from the usual suspect geographical areas.

    3. Re:uh oh, there goes wikipedia by ScentCone · · Score: 5, Interesting

      You don't think there are hundreds of thousands of zombifiable computers in the United States?

      Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic. It's a system-by-system, admin-by-admin judgement call, but there's no question that Korea isn't doing nearly enough to stop this problem locally. If the local culture starts to realize that they're isolating themselves from large sections of the internet because they won't do something to prevent 99% of their outbound mail from being spam, then maybe the need to filter will also go away.

      And what about people with business connections in China or Korea?

      I have a lot of customers with contacts like that. All of them (their Asian contacts) use Yahoo, Gmail, and similar accounts specifically to avoid this problem. Businesses in China and Korea are totally aware that most ISPs in those areas have poisoned outbound SMTP relays and user desktops. Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.

      --
      Don't disappoint your bird dog. Go to the range.
    4. Re:uh oh, there goes wikipedia by Walt+Dismal · · Score: 2, Insightful
      I agree that using Wikipedia opens up the knowledge base to strategic contamination. Any party with a vested interest could alter certain information and bias AIs using it. That is why I think the Israeli approach cited will run into problems.

      In my own research I've looked at the problem of AI knowledgebase contamination and know that unless a truth validation system is employed, it is all too easy to condemn the poor AI to reasoning with flawed data. And it's very difficult to design a good validation mechanism. Can you use 'common' knowledge and opinion to check against? Well, the masses aren't always right. There are a lot of falsehoods floating around the Internet. Collecting a pool of information from various sources requires effort to cross-check and evaluate.

      Of course humans face the same problem, and a lot of people reason with incomplete, incorrect, invalid data. Which might explain why the dollar is dropping versus the Euro. :)

    5. Re:uh oh, there goes wikipedia by Mr+Chund+Man · · Score: 5, Interesting

      Spam Map

      "South Korea, Indonesia, and especially Nigeria, etc"
      While we're at it, why not block Alberta, California, North Carolina, Virginia, Colorado, Oklahoma, Kansas, Vermont, New Hampshire, Massachusetts, Spain, France and Portugal - all spam hotspots according to the map cited? What's that, you receive email from people in these places? Tough titties, if we're to block email coming from spam hotspots as you say.

      Also, you've managed to point a finger of blame at Indonesia and Nigeria who are saintly in comparison to some more developed nations. Go racism!

    6. Re:uh oh, there goes wikipedia by Gwwfps · · Score: 2, Insightful
      Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.

      I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.

  3. Nothing new here... by Bodrius · · Score: 5, Funny

    This isn't new to Slashdotters...

    For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

    --
    Freedom is the freedom to say 2+2=4, everything else follows...
  4. Gentlemen, I give you Be-12! by CRCulver · · Score: 2, Insightful

    Buy the federal phamacon regulatory agency's approved Be-12 from our licenced apotecaries! It's Be-12, the addition to your daily sustinence intake that makes it easier to just Be you!

    I suspect that any skilled spammer can work around such filters through circumlocution. Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex can be and yet still be immediately understandable.

  5. i prefer by macadamia_harold · · Score: 4, Funny

    For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers.

    I think it would be much more effective if we used a taxidermy-based solution to fight spammers.

  6. Re:Save me! Math. by CRCulver · · Score: 3, Interesting

    The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg file, thus making it seem innocuous, and then putting the real advertisement in a GIF or PNG file that would be displayed by HTML-capable mail readers. Bayesian analysis can still work, but only in combination with OCR software.

  7. Artificial intelligence! by tcopeland · · Score: 3, Informative

    And all this time you thought it was just if and switch statements!

    Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article on the Semantic web.

  8. Future trends... by __aaclcg7560 · · Score: 2, Interesting

    Articial Intelligence may evolve to the point that it may decide to rewrite Wikipedia from an human-centric point of view to a AI-centric point of view (i.e., World War II resulted in the deaths of six million AIs). Since people will believe anything and Wikipedia can't be wrong, it'll be one step towards the formation of the Matrix. After all, only the victors write history.

  9. UMMMM wordnet? by Anonymous Coward · · Score: 4, Informative

    this kind of technique has been used for a while..

    http://wordnet.princeton.edu/

    and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet
    (like all sophisticated software) has been in development since the mid eighties..

    WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing

  10. Since when by trifish · · Score: 3, Insightful

    Since when a database + automated search (keyword patterns and relations) = artifical intelligence?

    1. Re:Since when by timeOday · · Score: 4, Informative
      Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
      What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
    2. Re:Since when by Kjella · · Score: 2, Interesting

      Well, most of the defiitions on artifical intelligence go "intelligence by something artificial", then we're down to intelligence which is so fuzzily defined almost anything can be applied. The first definition on intelligence on wikipedia focuses on individuality, which in other words says it's a bunch of skills rolled up into one. The other is even fuzzier. Quote WP:

      A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994:
      "a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings"catching on", "making sense" of things, or "figuring out" what to do"

      If you're able to use wikipedia to assiociate words, disassociate meanings of the same words (like the disambiguation pages), understand subsets and supersets (B12 relates to vitamin, vitanmin doesn't always relate to B12) then you're certainly emulating a lot of human intelligence. and well... the Eliza test is all about emulating human intelligence. In other words "we don't know what it is, but if you're like us it's intelligence".

      In fact, there's a pretty big group of people which almost define intelligence as whatever only humans do. If animals do it, it's instinct and if computers do it, it's logic with no thought involved. Over the years we've been giving computers more and more "open" problems, not finite and deterministic as chess (which in itself was considered intelligence until humans got spanked in it) and it turns out, the computer isn't half bad at it.

      So we shrink intelligence to things that are unique or rare, and the computer lacks the in-depth understanding. Goodbye pattern recognition (statistical analysis) and inductive logic (bayesian filters, neural nets) as intelligence. Hell, we got computers hooked up to research labs essentially running the whole scientific method of characterisations, hypotheses, predictions and experiments and yet, intelligence is something else. I think that in the end, that "does computers have intelligence?" will be a question of philosophy along the lines of "do animals have souls?", because well... what we're doing isn't that magical.

      --
      Live today, because you never know what tomorrow brings
    3. Re:Since when by maxwell+demon · · Score: 2, Insightful
      What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

      The creative part?
      --
      The Tao of math: The numbers you can count are not the real numbers.
    4. Re:Since when by timeOday · · Score: 2, Interesting

      Maybe creative people just detect more abstract patterns (e.g. lower S/N ratio) than others?

    5. Re:Since when by sacrilicious · · Score: 2, Informative
      What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

      Paraphrasing to make a point: What part of computing is not detecting, storing, and applying patterns and relations?

      To be meaningful, "AI" should denote more than (as the article summary indicates is being done) doing a grep through a web repository to deduce associations. There are branches of AI founded on brain neurology (neural nets), evolution (Genetic Algorithms), Bayesian logic, and various other things. Not all of the variants I can think of necessarily should qualify as AI (IMO), but the ones I'm thinking of are all substantially more esoteric than the summary's described approach. I take the GP's point to be that using a web repository as a database is too pedestrian to qualify as AI.

      --
      - First they ignore you, then they laugh at you, then ???, then profit.
  11. Just make spam a crime! by D4C5CE · · Score: 3, Insightful

    However many academic papers and spam filters throw their ever-more-elaborate algorithms at this issue, it is an arms race that cannot be won by the "good guys", as long as lawmakers keep pretending that technology alone could prevent the effects of sociopathic behavior: unsolicited bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem. This should have happened more than a decade ago, but now the world is simply running out of storage, bandwidth and CPU cycles much too quickly to afford waiting another decade (or even a year) for serious, intransigent anti-spam legislation that is long overdue.

  12. Re:WikiTuring Test by Halo1 · · Score: 3, Funny

    I recently got quite funny attempt like that, pumping some stock in the image attachment (which moreover looked like a captcha in order to avoid ocr). The title of the spam was however "cocaine inexcusable", and the body, well (just two sample quotes -- and yes, the two first sentences appeared together like that):

    We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
    The Blue Rocket is a handy little clit massager that packs a mighty punch.

    Needless to say, it triggered the bayasian filter pretty heavily in spite of all the obfuscation attempts :)

    --
    Donate free food here
  13. Looks like good research by MarkWatson · · Score: 2, Informative

    I will read the paper when I get the proceedings for the International Joint Conference for Artificial Intelligence. From the article, this seems like a statistical natural language processing application: the examples looked like they collect statistics of associations for both single word and short word sequences.

    BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.

  14. Re:The B12 example is horrible by tepples · · Score: 3, Informative

    Suppose somebody was trying to sell me a B12 bomber.

    Then your e-mail account's Bayes map would have the map (word B12 -> folder Aircraft) with a high probability, which would outweigh (word B12 -> article Vitamin -> folder Drug Spam).

  15. Not very "intelligent" by iamacat · · Score: 4, Insightful

    There are lots of legit e-mails discussing vitamins, viagara or even penis enlargement, this post included.

  16. Re:Uhh by CoderDog · · Score: 2, Interesting

    Presumably, Aunt Sally will be in your white-list and be passed through whether she's you tipping to startling new developments for viagra, or B-12. Most of the anti-spam work is done in an effort to avoid building mammoth personal black-lists of mostly short-lived addresses. I doubt we'll get rid of white-lists anytime soon, if ever.

    What would impress me is an AI that filtered spam very effectively, but also noticed that Aunt Sally had a new email address and continued to deliver her mail.

  17. Not New, not newsworthy by Sub+Zero+992 · · Score: 3, Informative

    Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.

    The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.

    --
    They who would give up an essential liberty for temporary security, deserve neither liberty or security - Ben Franklin
  18. Hutter Prize by Baldrson · · Score: 2, Informative
    As has been previously reported on slashdot, The Hutter Prize for Lossless Compression of Human Knowledge uses a snapshot of Wikipedia for rigorously benchmarking AI (and it has already had it's first payout).

    The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.

  19. Text of IJCAI paper by gvc · · Score: 2, Informative

    http://www.ijcai.org/papers07/Papers/IJCAI07-259.p df

    While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.

    Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html

    The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.