Wikipedia Used for Artificial Intelligence
eldavojohn writes "It may be no surprise but Wikipedia is now being used in the field of artificial intelligence. The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. The concept is also on the forefront of artificial intelligence and progress towards an application passing the Turing Test and creating semantically aware applications. The article comments on uses of Wikipedia in this manner: '"... spam filters block all messages containing the word 'vitamin,' but fail to block messages containing the word B12. If the program never saw B12 before, it's just a word without any meaning. But you would know it's a vitamin," Markovitch said. "With our methodology, however, the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will correctly identify the message as spam," he added.'"
Won't this pose a problem for today's semantically challenged "geek".?
With the example of using Wikipedia for spam filtering as mentioned in the post, maybe more articles need to be written on spam-slang for Viagra....
I wouldn't be surprised if Mossad's been using this for a while.
don't you think masses of spammers are going to screw with wikipedia strategically on purpose so that it doesn't work properly for that if it starts to work very well to block them? They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc. Type spam map into google image search to see how blatently obvious it is to see where the spam comes from. Something like 98% of spam can be pinned down to 0.01% of the world by square footage. If they added fuzzy logic instead of alterable AI and only block e-mails from south korea with the word vitamin and not block ones from Nebraska with the word vitamin, then the problem would be decreased dramatically.
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
As explained above its entirly too simple and will flag way too many false positives. For example all the emails my dad sent me last week about vitamins would have been sent directly to my spam box... maybe im missing something here.
This isn't new to Slashdotters...
For years, Slashdot posts have used wikipedia as a form of artificial intelligence.
Freedom is the freedom to say 2+2=4, everything else follows...
Buy the federal phamacon regulatory agency's approved Be-12 from our licenced apotecaries! It's Be-12, the addition to your daily sustinence intake that makes it easier to just Be you!
I suspect that any skilled spammer can work around such filters through circumlocution. Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex can be and yet still be immediately understandable.
Suppose somebody was trying to sell me a B12 bomber.
That wouldn't be spam to me, but an exclusive offer that would cause me to act now.
For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers.
I think it would be much more effective if we used a taxidermy-based solution to fight spammers.
Push Button, Receive Bacon
"The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. "
So what happened to bayesian filters as our saviour?
All the trolls and spammers on W.P. will F-up this AI, and Skynet will be trolling and spamming mankind forever.
Table-ized A.I.
However, since Wikipedia is not the model of truth hopefully they are going to perform crosschecks with other sources? Or maybe they will just use peer reviewed pages or "feature articles"? Still, cross-checks with additional online encyclopedias would be a good idea.
It's not the words that the spam filter can't recognize that lets spam get through, its the increasing use of image spam. OCR and existing filters would do more to solve spam than would wiki-AI intelligent filters.
Of course, the minute anti-spam software/services use OCR is the minute that spam images start looking like captchas.
Two wrongs don't make a right, but three lefts do.
And all this time you thought it was just if and switch statements!
Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article on the Semantic web.
The Army reading list
Articial Intelligence may evolve to the point that it may decide to rewrite Wikipedia from an human-centric point of view to a AI-centric point of view (i.e., World War II resulted in the deaths of six million AIs). Since people will believe anything and Wikipedia can't be wrong, it'll be one step towards the formation of the Matrix. After all, only the victors write history.
This might be an interesting area of research, but I for one do not want my ISP deciding what is legitimate email. E.g., what if I WANT to email someone about vitamins??? I do not want to have the uncertainty that my email will be deleted as spam. That would destroy the usefulness of email as a major form of business and personal communication. If I configure a SPAM filter, or the filter is "advisory", that is fine. But using AI to decide and delete is not advisable IMHO. Going down the AI path seems to me like someone is going to start assuming that an AI filter can be smart enough to make guesses that I do not specifically configure. I do not want that. The real reason for SPAM is that email systems to not verify the sender. Sender verification is essential so that senders who spam can be blacklisted. Another problem is that people have global email addresses. What is needed is a unique address for each pair of sender and recipient. That way, if you give out your email address, it is unique to both you and the person you give it to (the person who you "invite" to contact you). This is similar to the concept of a "disposable" email address, except that there is no reason that it has to be disposable: it can be permanent. In effect, it creates a permanent way for an individual to reach you. E.g., you can create an address for person A to reach you as 'personA@mydomain.com'. If your email client then requires such unique sender/receiver addresses for all invited senders and requires sender verification for uninvited senders you have a very effective total anti-spam system.
B12 which is a vitamin which is also known to increase your health which your aunt sally sends you messages regularly on, so great, all messages from aunt sally are now blocked.
"You had this look that of an angel, it was such a bad disguise" --Dishwalla
this kind of technique has been used for a while..
http://wordnet.princeton.edu/
and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet
(like all sophisticated software) has been in development since the mid eighties..
WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing
Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
However many academic papers and spam filters throw their ever-more-elaborate algorithms at this issue, it is an arms race that cannot be won by the "good guys", as long as lawmakers keep pretending that technology alone could prevent the effects of sociopathic behavior: unsolicited bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem. This should have happened more than a decade ago, but now the world is simply running out of storage, bandwidth and CPU cycles much too quickly to afford waiting another decade (or even a year) for serious, intransigent anti-spam legislation that is long overdue.
All these word relation AI's make me laugh. We could have real AI if you wanted to put effort into it. Link
God spoke to me.
Do they substitute numbers for letters in their filtering?
I don't see how this is getting us anywhere except moving closer to having a spam filter that just returns "true" to anything that isn't white-listed.
Am I part of the core demographic for Swedish Fish?
I will read the paper when I get the proceedings for the International Joint Conference for Artificial Intelligence. From the article, this seems like a statistical natural language processing application: the examples looked like they collect statistics of associations for both single word and short word sequences.
BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.
There are lots of legit e-mails discussing vitamins, viagara or even penis enlargement, this post included.
Substitute "piracy" for "spam" and reread your post.
: Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex
:)
: can be and yet still be immediately understandable.
Hmm.... the Carry On films and Up Pompeii were doing that in the 60s and 70s...
Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.
The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.
They who would give up an essential liberty for temporary security, deserve neither liberty or security - Ben Franklin
Using Wikipedia for artificial intelligence makes far more sense than using it for actual intelligence.
It's not so much a collection of facts as it is a collection of widely believed notions.
God Be Gone
Spam blocking isn't rocket science.
Block: Anything from, through or similar to a gateway that has previously been marked spam.
Block: Anything that appears to be a price list.
Block: Anything that includes an attachment unless the sender is on your whitelist.
Block: Anything with URLs that are not from domains in the user's contact list or domain whitelist.
Or (preferred):
Block: Anything not on the user's whitelist.
If you want to get on my whitelist, find some other way to contact me first.
This is a little off-topic, but I guess the only way to take out this menace of spam is to make the average joe accountable.
If the spam originated from a botnet in his machine, make him accountable too.
If he has installed the latest updates from Microsoft and still the botnet could get in, then it is not an issue. But, if he has not taken the effort to download the patches for say, the last 6 months, and a botnet operated from his machine, causing discomfiture to all and sundry, then he is accountable for it.
Push forward legal actions against the 'joe' and we would see real increase in the understanding of computers fallibility and a real decrease in the amount of spam.
rajmohan_h@yahoo.com
http://threeseas.net/abstraction_physics.html
considering the article is from physorg......
and to think they plan to patent it? Abstraction Physics?
I don't think so...
A knowledge base with associative retrieval capability has eluded researchers but they have one in Wikipedia. Now if only they can get AI to successfully [and hopefully, correctly] modify the knowledge base...
(It's never too late to join the Renaissance)
You seem to be under the impression that the AI is designed to figure out whether a given e-mail is reliable or accurate. It is not. It is designed to figure out what the subject of a spam actually is. If a letter is titled "Hi! It's your Uncle Harold!" and inside is a Markov-chain generated letter on the subject of "v1aqra", a conventional spam filter may have trouble understanding that the letter is selling pills. Bayesian approaches come close, but they're in the hands of the spammers too... spammers just check their algorithms against the filter and try to get a low score.
What these researchers need is a large number of articles on a variety of subjects a human being would not describe as "nonsense." It doesn't matter whether the wikipedia article claims the common cold is caused by a virus or by swamp gas, the AI will still learn that the common cold is often associated with coughing, sneezing, sniffles or a mild fever. Viagra is associated with sex, ladies, satisfaction and inversely associated with penile pumps, spanish fly and oysters. A program that understands this is more likely to catch a cleverly generated spam.
My question is whether this program will associate the acronym "AI" with the adjective "burgeoning." The association with this cliche is so strong in my mind I was sure I saw it in the summary, but it seems I was wrong. That's how human brains work.
Google is burgeoning too.
The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.
Seastead this.
One would think that AI and Wikipedia is a great combination. Isn't it comforting to know, that our future metal overlords will have a profound understanding of 'goatse' and understand every single Simpsons reference?
This is the biggest threat to Wikipedia I've heard in a long time.
If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.
"How to Do Nothing," kids activities, back in print!
This was first pioneered by Princeton (http://wordnet.princeton.edu/) and MIT (http://www.conceptnet.org/). People are building 'conceptnets' all over the place
Not only are they not the first to build a conceptnet, they are also not the first to build one using Wikipedia as their source.
I will contest this personally if they try to patent it.
Conceptual processing is the ONLY way to deal with these issues.
For example, what if I'm getting information sent to me from acquaintances about life extension - references to vitamins and nutrients would abound. But it wouldn't be spam.
An AI spam blocker has to know what I'm interested in, what material I've received before that was cleared, AND has to be able to, in some sense, UNDERSTAND the content rather than just correlating it to other terms atomically in terms of frequency of occurrence. Otherwise, how can it weed out material that correlates BOTH with spam and non-spam?
Without some decent implementation of conceptual processing, this just isn't possible.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
Who needs AI for spam filters? Just use the lameness filter: if lameness(new_mail) > 94 { bounce(new_mail, bill.gates@microsoft.com); delete(new_mail); }
the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will [] identify the message as spam
Ha Ha! Blocked!
You didn't sink my battleship!
-
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
Personally, I think spammers are already much smarter than this. It may be my imagination, but if so it's surely coming, that spammers are grabbing text from places they harvest my name and just including that text in messages rather than trying to make up things from scratch. Since they want to sell me something related to things I do, doing this gives them natural camouflage since the text tends to be on topic anyway.
Also, filling Wikipedia with spam is the least of our problems. The more subtle problem is the apparent assumption in all the replies here that the spammers won't use the same technique. That is, if they know Wikipedia is being consulted to tell what words mean, then all they have to do is consult Wikipedia to find misunderstandings they can associate. e.g., it might find that B12 was a possible Bingo number, or perhaps it would find that Boeing once made an airplane named the B12, or they might find it's an isotope of Boron, or...
Knowledge is not a cure for anything. Learning, and improving technique, are ways of staying ahead, but also ways of upping the stakes. When everyone is on an even playing field knowledgewise, that knowledge is no longer a tactical advantage.
The front on which spammers could easily be brought down is not knowledge but money. The spammers don't have the money to pay for all that spam: they just penalize the rest of us for having made it free by abusing our good will. If email were made to be pay-only, it could destroy the economy of scale that spammers enjoy. And perhaps if Wikipedia becomes an important resource, making Wikipedia's use be pay-only could fix the problem. Not that it's likely to happen--I'm just observing the opportunity.
The same has been noted about the "recreational" drug trade, though: legalizing such drugs (whatever you think of the issue of use), would likely drive the price down. Speculation has it that they remain illegal in part because the illegal drug trade likes the price advantage of having things be illegal, and that they are some of the loudest to remind us that it would be immoral to legalize them. So it's hardly surprising that spammers are some of the first among us to scream about the immorality of pay-per-message email. In both cases, we continue to pay anyway: we just pay for spam removal and fighting the drug war. As long as we don't count those activities as a cost, we continue to think the price would be high to change the way things work.
Direct physmail, by contrast to email, is a minor irritation because it's paid for by the sender (even if at a discount that I might not agree with). And the availability of World Book or Compton's Encyclopedia in hardcopy has never been a way of overcoming that issue. The fact that money is charged for physmail postage is the thing that wins out. It means the sender must give thought to whether the recipient really cares, and must target mail in a way that's a win-win. No such thought is required in email because the cost is entirely negligible.
Kent M Pitman
Philosopher, Technologist, Writer
http://www.ijcai.org/papers07/Papers/IJCAI07-259.p df
While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.
Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html
The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.
Bayesian filters (and other statistical filters colloqually known as Bayesian) can work on any features at all; not necessarily text. In particular they can use the markup in the header of the message, the message encoding, and so on. Some of the best-performing filters don't use 'text' at all and simply treat the entire message, images and all, as a bit string; for example, compression-based filters. Another well performing filter, OSBF-Lua, uses orthogonal sparse binomial bigrams rather than individual tokens.
Recent standardized testing shows that these methods work just fine on image spam, without any OCR component.
Seriously. FWIW, I am for the most part a Google fanboy.
I have had my GMail account for what, two years or so, and I really don't think google's spamfilter has ever missed a beat. That is to say that all the real spam I receive every day (~40 to 100 spams depending on the day) ends up in the spam folder, not my inbox. Spam is a total non-issue for me. OTOH, my hotmail inbox is so atrocious and the spamfilter so bad that I can't use the account for anything important. I don't know what kind of black magic they have going on at the Googleplex, but it WORKS! Maybe they do use some form of AI, but I assume they don't use what I'd call 'smart' AI. If that assumption is correct, then the spam problem doesn't seem to need AI as its solution.
What could possibly hurt the security of the American people more than giving our own government the ability to hide its
I don't see what the problem is about.
Whitelist friends, family and information sources you know about.
Block non-English character sets. Block anything with attachments, especially images. Block anything with l33t or misspelled words. Block HTML mail. Block mail which has arrived at myaddr1 and myaddr3 as well as myaddr2. Block anything over a certain size. Strip any multi-sentence fragment which shows up in Gutenberg and recalculate.
Some people might say that this would block email from management, newbies, grandma, and those 'friends' who spam everyone with lame jokes and Youtube videos. I say "and your point is?"
This article got it right first, now everybody is playing catch up including Wikipedia's founder with his newly announced Wikia semantic search engine project. http://evolvingtrends.wordpress.com/2006/06/26/wik ipedia-30-the-end-of-google/
One might as well call it crap....
A common tactic to defeat spam filters is to misspell words. The filters should look at the output of the Slashdot editors over the past decade to see what the common mistakes are.
-- Ed Avis ed@membled.com
If the A.I. works like many editors on Wikipedia, then the end result will be another intolerant fundie:
ERROR! Your original research is not welcome!
We cannot use that data as the citation format has changed. REVERTED
anata sekai o kakumei surush ga nai deshou? Anata no susumu michi wa yoi shite arimasu.
Most certainly I'm missing something, but what does this mean for people whose work actually involves talking about B12 or other vitamins?
--- "The idea is to die young, as late as possible." -- Ashley Montague
So with a wealth of knowledge from people around the globe, the best use of AI derived from this pool of information is to create a spam filter? Oh yea, this AI deserves some government grants. Can't anyone think of a better way to use this?
But whatever the object, you must keep him praying to it. To the thing he has made, not to the person that has made him.
The issue isn't understanding how AI "works", it's understanding how to make AI work. AI isn't a moving target, we just keep assuming we're closer to it than we really are.
Obviously, Judgement Day will be triggered by Skynet in a final, frustrated attempt to eliminate spammers.
And just because your Aunt Sally doesn't want to receive spam about vitamins doesn't mean she wants to miss her weekly Bingo e-mails.
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
I think the point is that many, if not most email users find themselves wading through a sea of spam despite the multiple layers of content filtering that happen between the point of origin and their inbox. The AC is partly right. Content filtering has merely delayed the death of email.
College students these days are often heard to say, "I have an email address but I never use it." They prefer their cell phones because voice and SMS text messages are not yet flooded with spam. Email may not be dead, but it's definitely gasping for air.
If you mod me down, I shall become more powerful than you could possibly imagine.
Could we take this one step further and use Wikipedia in something from Cyberdyne Systems, programmed to seek out and apply napalm directly to spammers?
Ask me about my sig!
The word "vitamin" in a message means it is spam? Methinks that the intelligence should be applied to better test for what is spam rather than simple minded associated term collecting for hot words from various online sources. Bayesian filters are much better than this already and do not require wikipedia reading to do their jobs with 99% accuracy after fairly minimal training.
I think this is described in more detail in a paper that was presented at ijcai 07 this morning. It was fascinating. here is the paper and http://www.eml-research.de/english/homes/strube/pa pers/aaai06.pdf is another paper on the same subject
The stronger the filters get the more needs to be done to improve how we are shown spam.
The filter can say that it's 40% sure that that email is spam, but I can tell 100%. There needs to be better interfaces for manualy monitering what it has deemed by the filter to be spam implimented in major clients.
My prefered solution is to expose how certain the filter is that a message is spam via a colour coding system, and enable users to filter and sort via this certainty. You can then review only the top nth percent of your spam to make sure that it is definately spam.
Combined with the hopeful move in improving interfaces I welcome ideas like this that should, fingers crossed, help catch more spam.
It also has interesting applications in other areas, as the article mentions, and AI as a whole.