Wikipedia Used for Artificial Intelligence

Confusion by camelrider · 2007-01-07 06:31 · Score: 1

Won't this pose a problem for today's semantically challenged "geek".?

Wikipedia needs work for spam filtering.... by MoHaG · 2007-01-07 06:31 · Score: 2, Insightful

With the example of using Wikipedia for spam filtering as mentioned in the post, maybe more articles need to be written on spam-slang for Viagra....

Re:Wikipedia needs work for spam filtering.... by Anonymous Coward · 2007-01-07 07:01 · Score: 1, Insightful

by spam slang, do you mean stuff like V1AGRA or V14GR4 or V1I1A1G1R1A?
If so, I'm pretty sure thats a pattern recognition problem.
As long as the AI knew what the correct spelling for viagra,it would be able to recognise the characters of the word viagra in V1I1A1G1R1A.
Also you could train an AI to recognise 1 as I or L so that when the text V14GRA appears, it knows what viagra is, and realises it looks like V14GR4 so it raises the probability of the text being spam.

More abstract phrases would be harder to classify, but there is a link to slang words for stuff like http://en.wiktionary.org/wiki/Wikisaurus:penis#Eng lish

so stuff like "got wood?" etc could in theory be classified.
Re:Wikipedia needs work for spam filtering.... by Metasquares · 2007-01-07 08:31 · Score: 4, Insightful

Infer too much and the false positive rate skyrockets, though...

Probably in use for a while. by CookieOfFortune · 2007-01-07 06:32 · Score: 1

I wouldn't be surprised if Mossad's been using this for a while.

uh oh, there goes wikipedia by ILuvRamen · 2007-01-07 06:32 · Score: 4, Interesting

don't you think masses of spammers are going to screw with wikipedia strategically on purpose so that it doesn't work properly for that if it starts to work very well to block them? They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc. Type spam map into google image search to see how blatently obvious it is to see where the spam comes from. Something like 98% of spam can be pinned down to 0.01% of the world by square footage. If they added fuzzy logic instead of alterable AI and only block e-mails from south korea with the word vitamin and not block ones from Nebraska with the word vitamin, then the problem would be decreased dramatically.

--
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'

Re:uh oh, there goes wikipedia by WilliamSChips · 2007-01-07 06:50 · Score: 4, Insightful

You don't think there are hundreds of thousands of zombifiable computers in the United States? And what about people with business connections in China or Korea?

--
Please, for the good of Humanity, vote Obama.
Re:uh oh, there goes wikipedia by gradedcheese · 2007-01-07 06:52 · Score: 2, Informative

most spam I get now looks to be from botnets rigged up using people's PCs here in the United States. Very little (in my inbox anyway) comes from the usual suspect geographical areas.
Re:uh oh, there goes wikipedia by ScentCone · 2007-01-07 07:04 · Score: 5, Interesting

You don't think there are hundreds of thousands of zombifiable computers in the United States?

Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic. It's a system-by-system, admin-by-admin judgement call, but there's no question that Korea isn't doing nearly enough to stop this problem locally. If the local culture starts to realize that they're isolating themselves from large sections of the internet because they won't do something to prevent 99% of their outbound mail from being spam, then maybe the need to filter will also go away.

And what about people with business connections in China or Korea?

I have a lot of customers with contacts like that. All of them (their Asian contacts) use Yahoo, Gmail, and similar accounts specifically to avoid this problem. Businesses in China and Korea are totally aware that most ISPs in those areas have poisoned outbound SMTP relays and user desktops. Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.

--
Don't disappoint your bird dog. Go to the range.
Re:uh oh, there goes wikipedia by DavidShor · 2007-01-07 07:05 · Score: 1

I did as you said, it seems like Nigeria is rather insignificant, the problem is more one of population density.
Re:uh oh, there goes wikipedia by Walt+Dismal · 2007-01-07 07:17 · Score: 2, Insightful

I agree that using Wikipedia opens up the knowledge base to strategic contamination. Any party with a vested interest could alter certain information and bias AIs using it. That is why I think the Israeli approach cited will run into problems.
In my own research I've looked at the problem of AI knowledgebase contamination and know that unless a truth validation system is employed, it is all too easy to condemn the poor AI to reasoning with flawed data. And it's very difficult to design a good validation mechanism. Can you use 'common' knowledge and opinion to check against? Well, the masses aren't always right. There are a lot of falsehoods floating around the Internet. Collecting a pool of information from various sources requires effort to cross-check and evaluate.
Of course humans face the same problem, and a lot of people reason with incomplete, incorrect, invalid data. Which might explain why the dollar is dropping versus the Euro. :)
Re:uh oh, there goes wikipedia by NeutronCowboy · 2007-01-07 07:22 · Score: 1

Damn - that first sentence of yours took the words right out of my mouth. Unfortunately, I don't agree one iota with the rest of your post. But I'll just deal with the first point....

I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a more permanent community around posters. Personally, I'd like to see some moderation and meta-moderation system similar to what Slashdot uses, along with a decay that requires posters to be ranked higher and higher to modify articles as they get older and older. I think that'll will contribute a great bit towards making Wikipedia more stable and more useful, even if it comes at the expense of its lightning-quick response time.

--
Those who can, do. Those who can't, sue.
Re:uh oh, there goes wikipedia by Mr+Chund+Man · 2007-01-07 07:47 · Score: 5, Interesting

Spam Map

"South Korea, Indonesia, and especially Nigeria, etc"
While we're at it, why not block Alberta, California, North Carolina, Virginia, Colorado, Oklahoma, Kansas, Vermont, New Hampshire, Massachusetts, Spain, France and Portugal - all spam hotspots according to the map cited? What's that, you receive email from people in these places? Tough titties, if we're to block email coming from spam hotspots as you say.

Also, you've managed to point a finger of blame at Indonesia and Nigeria who are saintly in comparison to some more developed nations. Go racism!
Re:uh oh, there goes wikipedia by timeOday · 2007-01-07 07:48 · Score: 1

I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a more permanent community around posters.
What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications." Of course there's some truth to that; we're not going to make software that "understands" Wikipedia (in a slightly less weak sense than before) without also making spambots smarter, it's all the same. But focusing on Spam is very shortsighted. Different parties have always had an interest in skewing information sources to their own ends. The whole essence of Wikipedia is coping with that through mass participation. So now AI is fighting over the same info territory as people? Sounds like progress to me, the AI must be getting smarter.
Re:uh oh, there goes wikipedia by orkysoft · 2007-01-07 07:56 · Score: 1

Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.

So those servers aren't being abused by spammers, as their normal mail servers are? But who does receive mail from the abused servers, then? Since it's almost all spam, it seems like everybody would want to block them.

--

I suffer from attention surplus disorder.
Re:uh oh, there goes wikipedia by poopdeville · 2007-01-07 08:08 · Score: 1

Yes, this approach will fail miserably.

It works well in principal, and the algorithms to implement the approach are well understood. But they are phenomenally computationally expensive. And Wikipedia is a very large dataset to mine from.

This is pretty much true of all AI, unfortunately.

--
After all, I am strangely colored.
Re:uh oh, there goes wikipedia by maxwell+demon · 2007-01-07 09:14 · Score: 1

Maybe the AI just has to understand how to use the Wikipedia page history. And maybe run the Wikipedia pages through some spam filter ...

--
The Tao of math: The numbers you can count are not the real numbers.
Re:uh oh, there goes wikipedia by maxwell+demon · 2007-01-07 09:20 · Score: 1

It will get interesting when AIs start edit wars at Wikipedia :-)

--
The Tao of math: The numbers you can count are not the real numbers.
Re:uh oh, there goes wikipedia by maxwell+demon · 2007-01-07 09:24 · Score: 1

Maybe the spammers will find a new use for their botnets ... imagine all Windows computers on the net turning into a single, gigantic distributed AI!

--
The Tao of math: The numbers you can count are not the real numbers.
Re:uh oh, there goes wikipedia by NeutronCowboy · 2007-01-07 09:56 · Score: 1

What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications."

Err, no. I have no idea where you got this idea from. What I actually don't like is weak attempts at improving the intelligence of computers. Furthermore, I like even less weak attempts at improving the intelligence of computers whose direct and inevitable consequence is the corruption of an incredibly useful resource, which in turn will lead to the corruption of the AI - the initial goal of the project.

I don't have a problem with AI edit wars. I have a problem with edit wars whose sole purpose is to destroy useful information. And that's where this is approach is going to lead to.

--
Those who can, do. Those who can't, sue.
Re:uh oh, there goes wikipedia by Gwwfps · 2007-01-07 10:55 · Score: 2, Insightful

Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.

I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.
Re:uh oh, there goes wikipedia by FooAtWFU · 2007-01-07 11:05 · Score: 1

Maybe the AI is working from a local copy of the Wikipedia database that isn't vulnerable to live vandalism or anything silly like that. And maybe Wikipedia spammers are more interested in a) putting links to their sites at the bottom of articles to boost PageRank and to capture the attention of random viewers or b) putting in biased promotional material and and other advertisements in a relevant page. And maybe this is likely to be far more attractive of an option than spamming Wikipedia in irrelevant places in the vague hope to poison a Bayesian filter which may or may not exist and probably is unlikely to ever see a revision of the article with this irrelevant information before someone reverts it. (Remember, it's obvious, systematic vandalism that attracts the most attention).

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:uh oh, there goes wikipedia by syousef · 2007-01-07 12:51 · Score: 1

That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.

Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.

--
These posts express my own personal views, not those of my employer
Re:uh oh, there goes wikipedia by Incadenza · 2007-01-07 13:02 · Score: 1

Type spam map into google image search to see how blatently obvious it is to see where the spam comes from.
Since you were modded 'interesting', I did exactly like you told and found this page: http://mailinator.com/mailinator/map.html. Refreshed it 3 times now, and every time at least 4 balloons are pointing at the US, one at Canada and 2 or 3 at European countries. Interesting indeed.
Re:uh oh, there goes wikipedia by ScentCone · 2007-01-07 13:03 · Score: 1

I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.

Yup, good point. Which is why the same thing seems be true to/from, say... Romania, etc. also

--
Don't disappoint your bird dog. Go to the range.
Re:uh oh, there goes wikipedia by ScentCone · 2007-01-07 13:05 · Score: 1

Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.

You're missing the point. When the packets from entire Class B address ranges are, by empirical testing, almost entirely crap, they people who own those addresses have already broken their little corner of the internet. Preserving the non-poisoned portion of the wider network isn't "destroying the village to save it," it's just sort of like putting up those highway sound walls - unpleasant, but necessary.

--
Don't disappoint your bird dog. Go to the range.
Re:uh oh, there goes wikipedia by syousef · 2007-01-07 13:15 · Score: 1

Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.

You definitely do destroy not only the village but a connected community of villages with your solution. What should be happening is bringing pressure to bear against those who have had the address space allocated to them, then moving up the supply chain. Ultimately those addresses should be allocated elsewhere to others willing to play by the rules and block violators. There is bound to be someone who can see advantage to being given control over the address space and who's willing to use it correctly. You therefore CURE the village - that's how you prevent the fire-bombing.

--
These posts express my own personal views, not those of my employer
Re:uh oh, there goes wikipedia by ILuvRamen · 2007-01-07 13:30 · Score: 1

there's a percentage based one from 2004 with the top countries listed with their % of spam sent but I couldn't re-find it. Keep looking, it's incredibly accurate. About 14% of worldwide spam comes from South Korea. Only like 5% from china. I think 8% from the US. Nigeria's scam rating according to DNSSTUFF.com is 12 and I'm pretty sure it's on a scale of 1-10. Indonesia is 6, Israel is 3 as well as Pakistan and India if I remember correctly.

--
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
Re:uh oh, there goes wikipedia by ozmanjusri · 2007-01-07 13:56 · Score: 1

Something like 98% of spam can be pinned down to 0.01% of the world by square footage.
A rough assessment of the last 30 days spam stored on my server suggests more than 75% comes from the USA.
A quick look at http://www.mailinator.com/mailinator/map.html shows clusters in the south (Memphis seems to be a hotspot) and on the east coast.
I don't know about Korea, but blocking Tennessee, Missouri and Florida would cut my spam in half. Blocking the rest of the USA would reduce it by 75%.

--
"I've got more toys than Teruhisa Kitahara."
Re:uh oh, there goes wikipedia by ScentCone · 2007-01-07 13:58 · Score: 1

Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.

You're working too hard at this. The sound walls are an undesireable but nevertheless somewhat effective treatment for the symptom for a larger problem. The analogy is apt.

What should be happening is bringing pressure to bear against those who have had the address space allocated to them, then moving up the supply chain.

Yes. And the people to DO that are the people using those addresses that find it doesn't really HELP them to have those addresses because the owners/administrators aren't taking care of things. If my ISP wasn't doing anything about crappy fellow users, I'd put financial pressure on them by taking my business elsewhere or (in the case of a state-run provider) take my vote elsewhere. I'm helping those people apply that pressure, right now.

--
Don't disappoint your bird dog. Go to the range.
Re:uh oh, there goes wikipedia by Urza9814 · 2007-01-07 14:42 · Score: 1

Hit the first result when you google 'spam map'. The 'mailinator' one. What's it show? Most of the locations pinpointed are within the US. 10 out of 15. The others are in China, Spain, Germany, and Brazil.
Re:uh oh, there goes wikipedia by Anonymous Coward · 2007-01-07 17:30 · Score: 0

Excuse me for interrupting this flame-fest with facts, but I found a Spam map online:

http://postini.com/stats/world-spam-2048.jpg

Most spam comes from California, France, southern England, Japan, and, oh yes, Korea and China. And, no, I do not advocate blocking email from, say, California just because a lot of spam comes from there.
Re:uh oh, there goes wikipedia by syousef · 2007-01-07 18:37 · Score: 1

The analogy wasn't apt at all. It was awful. What you're advocating diminishes the internet. I'm suggesting you punish the administrators not just the end users. Take away their IP address allocation and give them to someone else who's willing to make proper use of them. Don't block IPs.

--
These posts express my own personal views, not those of my employer
Re:uh oh, there goes wikipedia by Anonymous Coward · 2007-01-07 19:45 · Score: 0

then what the fuck you waiting on bitch? your fucking blocker broke, pussy?
Re:uh oh, there goes wikipedia by the_digitalmouse · 2007-01-07 21:11 · Score: 1

interestingly, most of the Nigerian scam email i receive use Yahoo accounts, and Yahoo certainly hasn't done much to police them, so I think your point is kinda silly.

also, having looked at enough email headers from spammers, while they may originate from some of those countries you mentioned, i notice many use accounts like Yahoo and gmail from U.S. servers, which shoots your whole theory down.

--
http://about.me/jimm.pratt
Re:uh oh, there goes wikipedia by cyphercell · 2007-01-07 21:47 · Score: 1

http://tecfa.unige.ch/~nova/img/spam-map.jpg -- Spam map
http://www.skills-1st.co.uk/papers/brunel-mirror/w orld-on-your-desktop/internet-map.gif -- Internet map (circa '94)

You'll notice that every nation on the Internet produces spam. Interestingly, in the US the cities that would be most affected by a "block by geography" policy would be LA, Seatle, Dallas, and New York. Spam originates in all nations that have Internet and is most densely located in areas that are most populated. Your policy may work on a case by case basis, but if someone were to block all emails I get from New York I think my boss would be pretty pissed. Korea I can live without, but none of my spam seems to come from there, it comes from client addresses that are scraped by US companies.

--
Under the influence of Post-Cyberpunk Gonzo Journalism
Re:uh oh, there goes wikipedia by mixenmaxen · 2007-01-08 00:21 · Score: 1

I just typed spam map into Google Images, and judging by the results the entire East coast of the U.S should promptly be disconnected from the Internet to solve the problem
Re:uh oh, there goes wikipedia by ScentCone · 2007-01-08 01:11 · Score: 1

interestingly, most of the Nigerian scam email i receive use Yahoo accounts, and Yahoo certainly hasn't done much to police them, so I think your point is kinda silly.

also, having looked at enough email headers from spammers, while they may originate from some of those countries you mentioned, i notice many use accounts like Yahoo and gmail from U.S. servers, which shoots your whole theory down.

But, it's not a theory. I'm talking about what I actually see in logs and message queues, especially on receiving servers that don't have particularly sophisticated spam blocking. Of course a more industrious, dedicated Nigerian-style spammer can get things out using a Yahoo or Gmail account, but he can't send thousands (let along millions) that way - there's no way to do that through their web interfaces. Neither Yahoo nor Gmail will tolerate large-volume floods the way that some open relay or bot army running an outbound SMTP engine will. You see the occasional spam coming from places like Yahoo because some of those scammers are now getting desparate enough to doing them one at a time. That's nothing compared to the ocean of stuff that's getting blocked before you see it. When I review what's being filtered, the huge majority of it is coming from places like Korea, China, and eastern Europe. You're talking about the leftovers.

--
Don't disappoint your bird dog. Go to the range.
Re:uh oh, there goes wikipedia by SheeEttin · 2007-01-08 03:41 · Score: 1

New Hampshire
'Scuse me, I'ma go get my shotgun and go out for some hunting.
Re:uh oh, there goes wikipedia by syousef · 2007-01-08 10:02 · Score: 1

You shouldn't be targeting geography at all. NY or Korea, it makes no difference, some businesses may have a legitimate need to communicate with someone at a particular geography. The Internet's beauty is that with few exceptions (shipping costs, time zones, legislation) you don't even need to worry about someone's physical location.

I'm not suggesting you block a nation. I'm suggesting you strike a deal with someone else in that country to provide the same addresses, on pain of losing them if they can't control the spam.

--
These posts express my own personal views, not those of my employer
Re:uh oh, there goes wikipedia by Anonymous Coward · 2007-01-08 12:40 · Score: 0

Nigeria and Indonesia are countries. Racism is about races, not nations. It's not my fault that ethnic majorities or outright monocultures exist in some countries. Foreign relations is not about race, unless you make it so. You're really reaching.
Re:uh oh, there goes wikipedia by vuo · 2007-01-08 12:52 · Score: 1

"Something like 98% of spam can be pinned down to 0.01% of the world"

No, you got this wrong. 99% of demand for spam can be pinned down to 0.5% of singular countries in the world. And that isn't Nigeria, South Korea or Indonesia.
Actually, I could block 100% of spam with only a handful of possible false-positive sources that can be easily whitelisted by blocking all messages in English.
Re:uh oh, there goes wikipedia by Mr+Chund+Man · 2007-01-08 15:10 · Score: 1

You've done a sterling job of offering nothing at all to the discussion. Nice one!
Re:uh oh, there goes wikipedia by Gordonjcp · 2007-01-08 21:14 · Score: 1

They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc

Except that all the spam (bar maybe one or two a day) that hits my mail server comes from the US. At that, mostly from 0wned Windows XP machines in broadband IP pools.
Re:uh oh, there goes wikipedia by presidentbeef · 2007-01-10 16:59 · Score: 1

super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc.

That's funny, I did as you suggested. I see no spam coming out of Nigeria. Just to be sure, I took a trip to Postini and checked. Nope. There is a tiny bit from Egypt, but otherwise I don't see any spam originating from the entire African continent.
That doesn't mean you are wrong about spam being from very specific areas, but Nigeria just isn't supported by the evidence you mention.

--
Everything I need to know about copyrights I learned from Slashdot.

vitamins? by Anonymous Coward · 2007-01-07 06:33 · Score: 0

As explained above its entirly too simple and will flag way too many false positives. For example all the emails my dad sent me last week about vitamins would have been sent directly to my spam box... maybe im missing something here.

Re:vitamins? by asavage · 2007-01-07 06:42 · Score: 1

I think it just wasn't explained well. What it is supposed to do is recognize that an unseen word has the same meaning as a word the spam filter already knows and adjust the score of the email in the same way. Any email filter that filtered out emails based on the occurrence of any single word would have an unacceptable amount of legitimate email filtered.

Nothing new here... by Bodrius · 2007-01-07 06:35 · Score: 5, Funny

This isn't new to Slashdotters...

For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

--
Freedom is the freedom to say 2+2=4, everything else follows...

Re:Nothing new here... by Anonymous Coward · 2007-01-07 06:52 · Score: 0

For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

Which you must admit is a nice change from natural stupidity.
Re:Nothing new here... by Anonymous Coward · 2007-01-07 07:55 · Score: 0

For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

For those who don't know what AI is:

Artificial intelligence (AI) can be defined as intelligence exhibited by an artificial (non-natural, manufactured) entity. AI is studied in overlapping fields of computer science, psychology and engineering, dealing with intelligent behavior, learning and adaptation in machines, generally assumed to be computers.

Gentlemen, I give you Be-12! by CRCulver · 2007-01-07 06:35 · Score: 2, Insightful

Buy the federal phamacon regulatory agency's approved Be-12 from our licenced apotecaries! It's Be-12, the addition to your daily sustinence intake that makes it easier to just Be you!

I suspect that any skilled spammer can work around such filters through circumlocution. Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex can be and yet still be immediately understandable.

Re:Gentlemen, I give you Be-12! by dangitman · 2007-01-07 12:45 · Score: 1

Your penis gets spam? Damn it must hurt if you put it through a filter.

--
... and then they built the supercollider.
Re:Gentlemen, I give you Be-12! by Watson+Ladd · 2007-01-07 14:47 · Score: 1

Well, just filter out all Bloodhound Gang lyrics and we're ok.

--
Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD

WikiTuring Test by MillionthMonkey · 2007-01-07 06:35 · Score: 1

wife at the devil, and the wife certainly cuckolds her husband. Whereas, house of Austria acquired the seventeen provinces, and by the latter, his from Leipsig, to which he refers in a subsequent one, and which I upon, than 'la pluie et le beau tens'.

So which is it, Wikipedia? Should I open the big image attachment?

Re:WikiTuring Test by Halo1 · 2007-01-07 07:02 · Score: 3, Funny

I recently got quite funny attempt like that, pumping some stock in the image attachment (which moreover looked like a captcha in order to avoid ocr). The title of the spam was however "cocaine inexcusable", and the body, well (just two sample quotes -- and yes, the two first sentences appeared together like that):

We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
The Blue Rocket is a handy little clit massager that packs a mighty punch.

Needless to say, it triggered the bayasian filter pretty heavily in spite of all the obfuscation attempts :)

--
Donate free food here
Re:WikiTuring Test by MillionthMonkey · 2007-01-07 07:56 · Score: 1

We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
The Blue Rocket is a handy little clit massager that packs a mighty punch.
Want to see where their spider got this stuff?

The safe for children crap (since reworded)
The Intimate Intruder Anal Probe
The Wrist Rocket
Re:WikiTuring Test by mandelbr0t · 2007-01-08 08:16 · Score: 1

A good sample of the fake content that spam engines create. It seems intuitively obvious to me that this text is completely meaningless, but getting an AI to understand why is much trickier. Clues come from the fact that "latter" is used incorrectly (being no "former" to distinguish "it" from), pronoun "his" refers to no subject, comparative "than" doesn't compare two subjects, etc.

Unfortunately, humans make these sorts of semantic errors all the time. We're just extending a bayesian filter to make a statement about the spam probability based on the "makes-sense" factor of the message. Tagging doesn't really help much (tagging beta: Austria) either since we're just guessing based on word density and prominence.

I can believe that better language processors and increased electronic availability of information will help in syntactically and semantically deconstructing a message, but until humans in general are capable of more detailed conversations with their computer, all the semantic and syntactic analysis is not very useful to an end-user. Ultimately, you need to be able to tell your spam filter about what has been misidentified. Something like "you identified term1 in message0 as classification3, but it's really junk. Please update your understanding of classification3 with this new information." Most people can't update their understanding this easily, let alone know how term1 relates to classification3.

mandelbr0t

--
"Please describe the scientific nature of the 'whammy'" - Agent Scully

The B12 example is horrible by Anonymous Coward · 2007-01-07 06:36 · Score: 1, Funny

Suppose somebody was trying to sell me a B12 bomber.

That wouldn't be spam to me, but an exclusive offer that would cause me to act now.

Re:The B12 example is horrible by tepples · 2007-01-07 07:21 · Score: 3, Informative

Suppose somebody was trying to sell me a B12 bomber.
Then your e-mail account's Bayes map would have the map (word B12 -> folder Aircraft) with a high probability, which would outweigh (word B12 -> article Vitamin -> folder Drug Spam).
Re:The B12 example is horrible by maxwell+demon · 2007-01-07 09:37 · Score: 1

Of course, someone you want to meet in Germany could send you a mail how to get to him, containing the words: "Then you have to take the B12" (B12 here means Bundesstraße 12, i.e. federal street 12). Unless you get lots of mail with way descriptions from Germany, it's quite unlikely that "B12->german street" will have a high probability to your spam filter. OTOH this is the type of mail which you certainly don't want to get filtered out.

--
The Tao of math: The numbers you can count are not the real numbers.
Re:The B12 example is horrible by dkf · 2007-01-08 01:48 · Score: 1

Plus, a message discussing a B12 bomber would be likely to have other high-ham words, especially in the context of an ongoing discussion on the topic. Bayesian filters (or at least the ones that are any good) pick up on this sort of thing too, and it is part and parcel of what makes real content filtering so effective. But effective content filtering has to be done on actual mailboxes; it depends on the fact that individual people don't discuss that many different topics on a normal basis...

--
"Little does he know, but there is no 'I' in 'Idiot'!"

i prefer by macadamia_harold · 2007-01-07 06:38 · Score: 4, Funny

For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers.

I think it would be much more effective if we used a taxidermy-based solution to fight spammers.

--
Push Button, Receive Bacon

Re:i prefer by RobertLTux · 2007-01-07 12:01 · Score: 1

or could you say they need to hear a few high caliber sermons by the Quartet of James, John, Horace Smith and Daniel B Wesson?

Thus saith the Load STFU and get off my internet!

--
Any person using FTFY or editing my postings agrees to a US$50.00 charge
Re:i prefer by Anonymous Coward · 2007-01-07 13:30 · Score: 0

our post advocates a (X) technical ( ) legislative ( ) market-based (X) vigilante approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.) ( ) Spammers can easily use it to harvest email addresses ( ) Mailing lists and other legitimate email uses would be affected (X) No one will be able to find the guy or collect the money (X) It is defenseless against brute force attacks ( ) It will stop spam for two weeks and then we'll be stuck with it ( ) Users of email will not put up with it ( ) Microsoft will not put up with it (X) The police will not put up with it ( ) Requires too much cooperation from spammers ( ) Requires immediate total cooperation from everybody at once ( ) Many email users cannot afford to lose business or alienate potential employers ( ) Spammers don't care about invalid addresses in their lists (X) Anyone could anonymously destroy anyone else's career or business Specifically, your plan fails to account for (X) Laws expressly prohibiting it ( ) Lack of centrally controlling authority for email (X) Open relays in foreign countries ( ) Ease of searching tiny alphanumeric address space of all email addresses (X) Asshats (X) Jurisdictional problems ( ) Unpopularity of weird new taxes ( ) Public reluctance to accept weird new forms of money ( ) Huge existing software investment in SMTP ( ) Susceptibility of protocols other than SMTP to attack ( ) Willingness of users to install OS patches received by email (X) Armies of worm riddled broadband-connected Windows boxes ( ) Eternal arms race involved in all filtering approaches (X) Extreme profitability of spam (X) Joe jobs and/or identity theft ( ) Technically illiterate politicians ( ) Extreme stupidity on the part of people who do business with spammers ( ) Dishonesty on the part of spammers themselves ( ) Bandwidth costs that are unaffected by client filtering ( ) Outlook and the following philosophical objections may also apply: (X) Ideas similar to yours are easy to come up with, yet none have ever been shown practical ( ) Any scheme based on opt-out is unacceptable ( ) SMTP headers should not be the subject of legislation ( ) Blacklists suck ( ) Whitelists suck ( ) We should be able to talk about Viagra without being censored ( ) Countermeasures should not involve wire fraud or credit card fraud ( ) Countermeasures should not involve sabotage of public networks ( ) Countermeasures must work if phased in gradually ( ) Sending email should be free ( ) Why should we have to trust you and your servers? ( ) Incompatiblity with open source or open source licenses (X) Feel-good measures do nothing to solve the problem ( ) Temporary/one-time email addresses are cumbersome (X) I don't want the government reading my email (X) Killing them that way is not slow and painful enough Furthermore, this is what I think about you: (X) Sorry dude, but I don't think it would work. ( ) This is a stupid idea, and you're a stupid person for suggesting it. ( ) Nice try, asshole! I'm going to find out where you live and burn your house down!
Re:i prefer by Anonymous Coward · 2007-01-08 00:45 · Score: 0

our post advocates a

(X) technical ( ) legislative ( ) market-based (X) vigilante

approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

( ) Spammers can easily use it to harvest email addresses
( ) Mailing lists and other legitimate email uses would be affected
(X) No one will be able to find the guy or collect the money
(X) It is defenseless against brute force attacks
( ) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
(X) The police will not put up with it
( ) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
(X) Anyone could anonymously destroy anyone else's career or business

Specifically, your plan fails to account for

(X) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
(X) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
(X) Asshats
(X) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(X) Armies of worm riddled broadband-connected Windows boxes
( ) Eternal arms race involved in all filtering approaches
(X) Extreme profitability of spam
(X) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook

and the following philosophical objections may also apply:

(X) Ideas similar to yours are easy to come up with, yet none have ever
been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
( ) Sending email should be free
( ) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
(X) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
(X) I don't want the government reading my email
(X) Killing them that way is not slow and painful enough

Furthermore, this is what I think about you:

(X) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, asshole! I'm going to find out where you live and burn your
house down!

Save me! Math. by Anonymous Coward · 2007-01-07 06:38 · Score: 1, Insightful

"The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. "

So what happened to bayesian filters as our saviour?

Re:Save me! Math. by CRCulver · 2007-01-07 06:43 · Score: 3, Interesting

The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg file, thus making it seem innocuous, and then putting the real advertisement in a GIF or PNG file that would be displayed by HTML-capable mail readers. Bayesian analysis can still work, but only in combination with OCR software.
Re:Save me! Math. by Danny+Rathjens · 2007-01-07 07:07 · Score: 1

Bayesian analysis can still work, but only in combination with OCR software.
That is not entirely correct. Bayesian filters work with *all* textual tokens in a message, not just the visible text in the body of the message. e.g. if your image spam all have various combinations of debora@somerandomdomain in the mail headers as a recent spambot was doing or if your spam all used the same relays and consequently has the same Received: headers, then a Bayesian filter will still rank it higher than non-spam. I have yet to install the ocr plugin for spamassassin and yet the majority of image spam my company receives are still correctly marked spam with a high bayes score.

Obviously spamassassin has other tests to combine the bayesian test with, but that is the whole idea and the reason it works well; since no test is perfect but can give good indications. Using ocr to examine the images gives you one more test. So it improves the accuracy - at the cost of more intensive resource usage - but normal bayesian analysis is by no means completely ineffective without ocr.
Re:Save me! Math. by rjshields · 2007-01-07 07:41 · Score: 1

Yes, but OCR is too slow to actually be useful. Plus spammers are using slanted, wobbly, coloured text, random backgrounds and all manner of methods to prevent OCR from working effectively.

--
In this world nothing is certain but death, taxes and flawed car analogies.
Re:Save me! Math. by Anonymous Coward · 2007-01-07 08:50 · Score: 0

Bayesian analysis can still work, but only in combination with OCR software.

No... it can't! When will people wake up and smell the coffee?? Content based filtering is NOT working and will NEVER work!

There is no way that software will be able to decide what is legit and what is spam based purely on the content of the message, not reliably enough any way. And mean while you are allowing the spammers to eat up massive amounts of bandwidth and server resources by passing their messages in full all they way to your users desktop clients.

No, the way to stop spam HAS to be connection based filtering! Things like SPF and DNS RBLs. Refuse the connection BEFORE the spammers have a chance to waste your server bandwidth and resources! Refuse the email BEFORE it ends up cluttering the inbox of your users! We all need to focus on a stronger effort (by the email server admin community) to weed out what IPs are legit and what IPs are being used by asshole spammers!!

The advantages to connection based filtering far out weigh those of content based filtering. As I previously mentioned, if you refuse the connection you save bandwidth. But it goes beyond this. If you accidently refuse a legit connection the sending party will at least get a legit bounce notice back from their server. This is something you cannot get from content based filtering, as it is not safe and considered extremely rude to have your system send out responses to every message flagged as spam based on content. So with content based filtering the sender has NO IDEA if their message ends up getting filtered by the receiptient! With connection based filtering they would definitly get a bounce notice if their server was being accidently and/or incorrectly flagged as a source of spam. And the final advantage of connection based filtering? It provides a negative feedback to these spammer assholes!!! Even if you do the most incredible job in the world of using content filtering, the spammers are still able to send messages to your address and have NO idea you are not reading their crap. So they just keep sending more! If all of a sudden they find them selves unable to send messages to the majority of SMTP servers on the Net they would KNOW they are being blocked!

I say it's time we all started moving SPF and DNS RBL checking to the connection stage of our SMTP servers! I for one am already working on this, writting a proxy that can sit infront of any SMTP server and just drop unwanted connections using a MySQL DB to track SPF and DNS RBL tagged IPs...
Re:Save me! Math. by maxwell+demon · 2007-01-07 09:41 · Score: 1

Of course, a program which detects images consisting of slanted, wobbly, colored text with random background wouldn't have to OCR that text anyway: Any such image has almost 100% spam probability.

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Save me! Math. by DavidShor · 2007-01-07 13:25 · Score: 1

we've been trying to do that to them for years, we have now set the spammers against themselves!
Re:Save me! Math. by gvc · 2007-01-07 14:02 · Score: 1

Content based filtering is NOT working and will NEVER work!

I don't usually respond to ACs, but this particular belief is common enough that I feel I should say a few words. The overall goal of spam abatement is to enhance the probability that legitimate email will be delivered in a timely and efficient manner to its intended recipient. Content-based filtering is widely deployed in this context and it is fairly effective for its intended purpose. Demonstrably more effective, and less intrusive, than forcing the recipient to wade through spam and triage mail manually. And demonstrably more effective, and less intrusive, than refusing or challenging unfamiliar email as a matter of course.

To the extent that we measure these aspects -- risk of non-delivery, delay, intrusiveness of solutions -- the world will be better off. Unsubstantiated dismissal of a particular approach -- especially one for which there is extensive evidence of its efficacy -- is unhelpful.
Re:Save me! Math. by Anonymous Coward · 2007-01-07 18:16 · Score: 0

To the extent that we measure these aspects -- risk of non-delivery, delay, intrusiveness of solutions -- the world will be better off. Unsubstantiated dismissal of a particular approach -- especially one for which there is extensive evidence of its efficacy -- is unhelpful.

Hey, I am all for improving the situation in any way we can! How ever, I have yet to see a single content based filtering system that actualy solves the problem in the manner you describe:

Demonstrably more effective, and less intrusive, than forcing the recipient to wade through spam and triage mail manually

The problem with content based filtering is it either increases the amount of wading due to quality control needs or decreases the amount of wading at the expense of lost messages. With connection blocking it is at least known to the sender that the message did not get through. With content filtering you easily get a situation where the sender doesn't know the receipents email client has filtered the message, and the receipent is unaware of the message. I would think that in most cases this would create a longer delay in the communcation process between these two users. Now if the sender gets a bounce notice they know fairly quickly that their attempt to communicate failed and can take some other action, like a phone call. So to that end:

overall goal of spam abatement is to enhance the probability that legitimate email will be delivered in a timely and efficient manner

Is probably accomplished much better by a system that actualy gives one side of the communcation channle some clear indication of what is going on.

more effective, and less intrusive, than refusing or challenging unfamiliar email as a matter of course.

I am not talking about refusing unfamiliar email as a matter of course. I am talking about building better IP black lists and improving and inventing more technologies similar to SPF. A community effort, by all the legit email server admins, to track sources of spam on the Net and block the / shut them down. Some people think a community built black list cannot work, to that I say take a look at the damn Internet it self! The only reason it works right (most of the time) is that all the legit networks that are interconnected have agreed to followed certain rules regarding protocol. We all agree to use the same root servers, we all agree to pass each others IPv4 packets, why can't we all start agreeing on who is a legit part of our community? It is possible, granted I am sure the process will be painful at first, but with how bad the spam problem has become I for one am willing to go through the startup hassles!

I wouldn't bash content filtering if I thought it worked, but I have tried various software in various network environments for our clients and I have yet to find something that works well. So if you can show me a content based filtering solution that actually works, go ahead and point me in the right direction, I am not unwilling to try new things. How ever my expeirence up to this point has been that connection based rules seem to work much better than content filtering.
Re:Save me! Math. by APOLAUF · 2007-01-07 21:46 · Score: 0

Bayesian filtering on its own doesn't derive much from semantical information - it is based purely on probabilistic and deterministic data. That being said, it is quite possible, if not likely, that Bayesian learning techniques can be added to a system utilizing semantical inference, as this article suggests. For example, while a wiki article might contain references to a particular related subject; the number of references to the subject from other subjects can be used in a Bayesian methodology to predict the likelihood that the subject being analyzed is relevant to a semantical context. One could say that the field of AI is now much more concentrated on hybridized approaches, merging various techniques of AI as well as machine learning into more sophisticated methods. Indeed, it is quite possible that the system described here may represent yet another "plugin" into an AI researcher's and/or developer's toolbox of techniques - and a publicly endorsed method of relational analysis at that? Sounds like a winner.
Re:Save me! Math. by rjshields · 2007-01-07 23:45 · Score: 1

Oh, so you just need a slanted, wobbly, colored text and random background detector that doesn't FP like crazy ;)

--
In this world nothing is certain but death, taxes and flawed car analogies.
Re:Save me! Math. by gvc · 2007-01-08 02:26 · Score: 1

The problem with content based filtering is it either increases the amount of wading due to quality control needs or decreases the amount of wading at the expense of lost messages.

There's no evidence that the statement above is true. A user who has to wade through a mixture of spam and non-spam will overlook some of the non-spam. The question is whether the human or the machine will overlook more. A subsidiary question is, once overlooked, how likely is the message to be retrieved using some subsidiary mechanism (second look, scanning the quarantine, whatever). There *is* evidence that content filters are better than humans at the initial separation of good email and spam, *and* that separate good and quarantine folders improve performance on the second task.

Here are two content-based filters that work very well: OSBF-Lua and Bogofilter. SpamAssassin's "Bayes filter" works well, too, but you have to configure it a bit differently: http://plg.uwaterloo.ca/~gvcormac/spamassassin
I wouldn't bash content filtering if I thought it worked

You go a lot further than saying I don't think it works. You pronounce from great heights that it cannot possibly work. Such dismissive statements are without merit.
Re:Save me! Math. by T.E.D. · 2007-01-08 07:55 · Score: 1

The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg fil

I'm using Thunderbird 1.5.0.9, and it seems to work great on those "book attack" spams. I haven't seen one get through yet, so they appear to be less likely to get through than normal spams.

On a guess, I'd say that a random chunk of literature is far more likely to contain words never used in valid correspondence to me than any other kind of message. Just looking over the last one it caught, it uses literary words nobody would use in a one-to-one conversation like "modester", "idolatry", "mankind". It also goes on about the Pyrenees, Saxony, the Duchy of Brittany, etc. There's even a smattering of Latin in it. It would be tough to purposely make up something more likely to trip filters.

The root problem is that correspondence and literature are two *very* different styles of writing. Putting random literature is what is trying to pass as correspondence is doomed to failure. Please don't tell the spammers this though.

Err...yeah. There's no way to filter this stuff. We're all doomed!

Welcome your troll overlord by Tablizer · 2007-01-07 06:40 · Score: 1

All the trolls and spammers on W.P. will F-up this AI, and Skynet will be trolling and spamming mankind forever.

--
Table-ized A.I.

Seems like a concept. by Assassin+bug · 2007-01-07 06:43 · Score: 1

However, since Wikipedia is not the model of truth hopefully they are going to perform crosschecks with other sources? Or maybe they will just use peer reviewed pages or "feature articles"? Still, cross-checks with additional online encyclopedias would be a good idea.

Re:Seems like a concept. by Grismar · 2007-01-07 08:27 · Score: 1

Frankly, I'm not too worried about that. I'd worry myself about them being able to do anything useful with it in the first place.

It doesn't say a whole lot about how they plan to actually parse this information (that's not exactly in a standard format) or how that will translate into something that'll make sense for searching.

Ofcourse they might be using techniques that have been around for ages, for analyzing corpi and inferring contextual information. Picking WikiPedia to do this is just a clever way to attract attention to a project that's not all that new afterall.

I'm not saying we won't be seeing results before long, but I doubt these guys will beat others to the punch just because they use WikiPedia as base material.

Cool solution to yesterday's problem by G4from128k · 2007-01-07 06:43 · Score: 1

It's not the words that the spam filter can't recognize that lets spam get through, its the increasing use of image spam. OCR and existing filters would do more to solve spam than would wiki-AI intelligent filters.

Of course, the minute anti-spam software/services use OCR is the minute that spam images start looking like captchas.

--
Two wrongs don't make a right, but three lefts do.

Re:Cool solution to yesterday's problem by Anonymous Coward · 2007-01-07 07:25 · Score: 0

For f*cks sake now I'll have to try to decipher the damn spam as well?!

Why don't you just fill the image with black color and force me to use Acrobat reader to get the text while you're at it?!!

Can't even get decent spam these days...
Re:Cool solution to yesterday's problem by NoOneInParticular · 2007-01-07 08:37 · Score: 1

Hmm, so what's actually happening is that the spammers are coercing the spam-filter writers to create good enough OCR so that the spammers can turn around and use that to circumvent the captcha's on the www. Talking about a devious ploy! We're fucked.

Artificial intelligence! by tcopeland · 2007-01-07 06:44 · Score: 3, Informative

And all this time you thought it was just if and switch statements!

Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article on the Semantic web.

--
The Army reading list

Re:Artificial intelligence! by starkravingmad · 2007-01-07 16:48 · Score: 1

That article is full of errors and omissions, e.g. from
"- Count Dracula is a Vampire
- Count Dracula lives in Transylvania
- Transylvania is a region of Romania
- Vampires are not real"

he concludes that "You can draw only one non-clashing conclusion from such a set of assertions -- Romania isn't real.". That isn't true - the only conclusion you can draw is that Count Dracula is not real. (just because A implies B doesn't mean B implies A).

He uses examples like 'People who live in Brooklyn have a Brooklyn accent' is false because he lives in Brooklyn but doesn't have a Brooklyn accent.. in that case a more accurate representation would be '(some percent) of people in Brooklyn have a Brooklyn accent'. It's possible to have probability associated with your beliefs (see 'An Introduction to Probability and Inductive Logic' by Ian Hacking). It's also possible to have strength associated with your beliefs (see any book on the field of Belief Change dynamics), choose between contradictory beliefs, etc. It's an interesting field with many problems but this was one of the more ill informed opinions I have seen on it.

And just because I like nitpicking, here is another error of omission from that opinion. He says that from :
- US citizens are people
- The First Amendment covers the rights of US citizens
- Nike is protected by the First Amendment

we can conclude that Nike is a person because the First Amendment covers the rights of Nike.

This is untrue. The set of statements is incomplete because it doesn't define what Nike is or that the First Amendment covers people and corporations.

Take this example:

John likes dogs
Dogs are animals
Books are not animals.

Would you conclude that John doesn't like books?

The true problems with the semantic web is the volume of meta data and its credibility. I guess I should add to that list people who criticise over 2000 years of logic without understanding its fundamentals.
Re:Artificial intelligence! by rentmej · 2007-01-08 04:29 · Score: 1

The main problem that we run into with AI isn't just the ability to understand the semantic meaning of the word, but to also understand the syntax of a group of words and the context of a sentence.
Take a look at Shakespear's famous quote of "To be or not to be, that is the question."
Breaking "To be or not to be, that is the question" into it's constituent parts, the words themselves are very simplistic with only one word with two syllables. Looking at it semantically, we can understand the individual words, but even these have a plethora of meanings. "Be" itself presents us with a list of meanings. Dictionary.com gives us eleven different meanings, and the top one actually uses the quote itself as an example. Now if we look at the syntax of the words and begin to look at them in more meaningful groupings we extract even more meaning from them. Semantics gives us what the meaning of "to" and "be" and syntax lets us understand that this combination gives us a meaning of life; "to be" is "to be alive". Syntax also gives us more meaning of "the question" as "the most important question" or "the only question that truly matters". Now if we take everything together and place it in the proper context, we are finally looking at this as questioning of weather or not Hamlet should continue to go on with life, or if he should just end it all by committing suicide.
This is where the idea of useing Wikipedia to understand any text becomes usefull. It gives programmers a huge database of information for an AI to draw upon. But, it would also be required to draw it's own conclusions to actually be considered "intelligent".
So, by using Wikipedia to understand the meaning of individual words to build a library of syntactic meaning, relationships like those found on Wordnet http://wordnet.princeton.edu/ would be created. From there it should have a better understanding of the context of a message. This would prevent Spamers from using the technique of grouping a half dozen sentences together that individually make sence and get past a filter. The AI would learn that the context for each individual sentence is out of range for each proceding sentence and block it.

--
0100001001100101011010010110111001100111 0100100001110101011011010110000101101110

Future trends... by __aaclcg7560 · 2007-01-07 06:46 · Score: 2, Interesting

Articial Intelligence may evolve to the point that it may decide to rewrite Wikipedia from an human-centric point of view to a AI-centric point of view (i.e., World War II resulted in the deaths of six million AIs). Since people will believe anything and Wikipedia can't be wrong, it'll be one step towards the formation of the Matrix. After all, only the victors write history.

Great - it computers deciding what email I get by cjonslashdot · 2007-01-07 06:48 · Score: 1

This might be an interesting area of research, but I for one do not want my ISP deciding what is legitimate email. E.g., what if I WANT to email someone about vitamins??? I do not want to have the uncertainty that my email will be deleted as spam. That would destroy the usefulness of email as a major form of business and personal communication. If I configure a SPAM filter, or the filter is "advisory", that is fine. But using AI to decide and delete is not advisable IMHO. Going down the AI path seems to me like someone is going to start assuming that an AI filter can be smart enough to make guesses that I do not specifically configure. I do not want that. The real reason for SPAM is that email systems to not verify the sender. Sender verification is essential so that senders who spam can be blacklisted. Another problem is that people have global email addresses. What is needed is a unique address for each pair of sender and recipient. That way, if you give out your email address, it is unique to both you and the person you give it to (the person who you "invite" to contact you). This is similar to the concept of a "disposable" email address, except that there is no reason that it has to be disposable: it can be permanent. In effect, it creates a permanent way for an individual to reach you. E.g., you can create an address for person A to reach you as 'personA@mydomain.com'. If your email client then requires such unique sender/receiver addresses for all invited senders and requires sender verification for uninvited senders you have a very effective total anti-spam system.

Re:Great - it computers deciding what email I get by MoHaG · 2007-01-07 06:58 · Score: 1

This might be an interesting area of research, but I for one do not want my ISP deciding what is legitimate email. E.g., what if I WANT to email someone about vitamins??? I do not want to have the uncertainty that my email will be deleted as spam. That would destroy the usefulness of email as a major form of business and personal communication. If I configure a SPAM filter, or the filter is "advisory", that is fine. But using AI to decide and delete is not advisable IMHO. Going down the AI path seems to me like someone is going to start assuming that an AI filter can be smart enough to make guesses that I do not specifically configure. I do not want that.

The real reason for SPAM is that email systems to not verify the sender. Sender verification is essential so that senders who spam can be blacklisted.

Another problem is that people have global email addresses. What is needed is a unique address for each pair of sender and recipient. That way, if you give out your email address, it is unique to both you and the person you give it to (the person who you "invite" to contact you). This is similar to the concept of a "disposable" email address, except that there is no reason that it has to be disposable: it can be permanent. In effect, it creates a permanent way for an individual to reach you. E.g., you can create an address for person A to reach you as 'personA@mydomain.com'. If your email client then requires such unique sender/receiver addresses for all invited senders and requires sender verification for uninvited senders you have a very effective total anti-spam system.

A lot of ISPs and webmail providers are already filter spam, making it more effective will probably not change what ISPs filter users' email without giving the user an option to turn it off.

Of course nothing prevents you from changing ISPs if your ISP forces unreasonable policies onto you...

Even with sender verification anyone can still register a domain and run their own mail servers. Sender verification will probably require all email servers on the internet to be replaced with different software using different protocols. Manual verification will not work either, I will just choose to communicate with companies that does not require manual verification.
Re:Great - it computers deciding what email I get by tepples · 2007-01-07 07:27 · Score: 1

Of course nothing prevents you from changing ISPs if your ISP forces unreasonable policies onto you...
Unless you live in Qatar. Or more practically for residents of countries with an anglophonic majority, unless you live in an area where both the local cable company and the local DSL company have policies that you consider unreasonable.
Re:Great - it computers deciding what email I get by MoHaG · 2007-01-08 05:47 · Score: 1

For email you can always host your own servers. HTTP is not so easy. In South Africa where I live, AFAIK, all the ISPs except at least one cellphone company transparently proxy http traffic. ("In order to save bandwidth") So you can choose ADSL / dailup / iBurst / other wireless service with proxy or GPRS without proxy.
Re:Great - it computers deciding what email I get by tepples · 2007-01-08 08:24 · Score: 1

For email you can always host your own servers.
You mean "smarthosting" through an e-mail provider in North America or Europe, right? Otherwise, your cable or DSL connection is on the "dynamic IP" list as well as a "spam haven country" list.
Re:Great - it computers deciding what email I get by MoHaG · 2007-01-09 06:08 · Score: 1

For email you can always host your own servers.
You mean "smarthosting" through an e-mail provider in North America or Europe, right? Otherwise, your cable or DSL connection is on the "dynamic IP" list as well as a "spam haven country" list.
I mean hiring a dedicated server somewhere and installing a SMTP and pop server on it. The server can be located almost anywhere where there are hosting companies.

Else just jusing gmail for domains is an option (but they have spam filtering)

Uhh by unborracho · 2007-01-07 06:49 · Score: 1

B12 which is a vitamin which is also known to increase your health which your aunt sally sends you messages regularly on, so great, all messages from aunt sally are now blocked.

--
"You had this look that of an angel, it was such a bad disguise" --Dishwalla

Re:Uhh by DavidLeblond · 2007-01-07 07:28 · Score: 1

Please excuse my dear aunt sally.
Re:Uhh by tepples · 2007-01-07 07:30 · Score: 1

B12 which is a vitamin which is also known to increase your health which your aunt sally sends you messages regularly on, so great, all messages from aunt sally are now blocked.
In an e-mail system with sane defaults, wouldn't your aunt sally get whitelisted rawther quickly if you regularly reply to her e-mails?
Re:Uhh by CoderDog · 2007-01-07 07:32 · Score: 2, Interesting

Presumably, Aunt Sally will be in your white-list and be passed through whether she's you tipping to startling new developments for viagra, or B-12. Most of the anti-spam work is done in an effort to avoid building mammoth personal black-lists of mostly short-lived addresses. I doubt we'll get rid of white-lists anytime soon, if ever.

What would impress me is an AI that filtered spam very effectively, but also noticed that Aunt Sally had a new email address and continued to deliver her mail.
Re:Uhh by Anonymous Coward · 2007-01-07 07:53 · Score: 0

Unless Aunt Sally is a freakin' South Korean spam operator.
Re:Uhh by nelsonal · 2007-01-07 09:35 · Score: 1

Ha ha, I guess that's a pretty effective mnemonic (the firefox spell checker is the bees knees). I remembered that it was one, and remembered it, but had to google what it was supposed to be reminding me (even though I apply the order of operations nearly every day).

--
Degaussing scares the bad magnetism out of the monitor and fills it with good karma.

UMMMM wordnet? by Anonymous Coward · 2007-01-07 06:50 · Score: 4, Informative

this kind of technique has been used for a while..

http://wordnet.princeton.edu/

and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet
(like all sophisticated software) has been in development since the mid eighties..

WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing

Re:UMMMM wordnet? by Anonymous Coward · 2007-01-07 07:26 · Score: 0

yes, I can't imagine that wikipedia would be better for this than wordnet or even a similar ontology in the same vein.

I work for wordnet, and I'm really getting a kick out of these replies...

seriously. I did. I even wrote a few APIs for it, in java and python. Good stuff.
Re:UMMMM wordnet? by modeless · 2007-01-07 08:43 · Score: 1, Interesting

I can't imagine that wikipedia would be better for this than wordnet

You must not have a very good imagination. Wikipedia articles are far larger than wordnet definitions, with much more potential to hold useful information. Wikipedia has a much larger scope than wordnet, including huge amounts of cultural, historical, and scientific data that wordnet ignores. Wikipedia has a larger team of contributors. Wikipedia has data in several other languages besides English. Wikipedia is constantly updated with the latest information in all of its articles.

Wordnet is more structured and carefully maintained, but that is its sole advantage over Wikipedia as far as I can see. And IMHO, that's not really an advantage when talking about real-world AI problems like detecting spam. Spam is not structured or carefully maintained. A successful real-world AI needs to deal with unstructured, ambiguous, even malicious data. An AI that can't tolerate these things will undoubtedly fail.

--
Firebug. It will make your jaw hit the floor.
Re:UMMMM wordnet? by Anonymous Coward · 2007-01-08 03:24 · Score: 0

as author of GP, I'd agree that wikipedia probably would store more context, have
more languages available, etc..

but TFA (and it appears the so-called 'researchers') are trying to pass the
general semantic/contextual analysis of words and concepts as a new technique,
when it absolutely is not...

I'd probably use wikipedia to augment wordnet using a big hashtabley thing like framerd,
but then again, what do I know about 'new techniques'

Since when by trifish · 2007-01-07 06:54 · Score: 3, Insightful

Since when a database + automated search (keyword patterns and relations) = artifical intelligence?

Re:Since when by Flamesplash · 2007-01-07 07:20 · Score: 1

You have just descibed Data Mining.

--
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
Re:Since when by timeOday · 2007-01-07 07:54 · Score: 4, Informative

Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
Re:Since when by Kjella · 2007-01-07 08:03 · Score: 2, Interesting

Well, most of the defiitions on artifical intelligence go "intelligence by something artificial", then we're down to intelligence which is so fuzzily defined almost anything can be applied. The first definition on intelligence on wikipedia focuses on individuality, which in other words says it's a bunch of skills rolled up into one. The other is even fuzzier. Quote WP:

A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994:
"a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings"catching on", "making sense" of things, or "figuring out" what to do"

If you're able to use wikipedia to assiociate words, disassociate meanings of the same words (like the disambiguation pages), understand subsets and supersets (B12 relates to vitamin, vitanmin doesn't always relate to B12) then you're certainly emulating a lot of human intelligence. and well... the Eliza test is all about emulating human intelligence. In other words "we don't know what it is, but if you're like us it's intelligence".

In fact, there's a pretty big group of people which almost define intelligence as whatever only humans do. If animals do it, it's instinct and if computers do it, it's logic with no thought involved. Over the years we've been giving computers more and more "open" problems, not finite and deterministic as chess (which in itself was considered intelligence until humans got spanked in it) and it turns out, the computer isn't half bad at it.

So we shrink intelligence to things that are unique or rare, and the computer lacks the in-depth understanding. Goodbye pattern recognition (statistical analysis) and inductive logic (bayesian filters, neural nets) as intelligence. Hell, we got computers hooked up to research labs essentially running the whole scientific method of characterisations, hypotheses, predictions and experiments and yet, intelligence is something else. I think that in the end, that "does computers have intelligence?" will be a question of philosophy along the lines of "do animals have souls?", because well... what we're doing isn't that magical.

--
Live today, because you never know what tomorrow brings
Re:Since when by Luminus · 2007-01-07 08:17 · Score: 0

The part of intelligence that involves semantic content, or actually understanding what a symbol means. That part that, per The Chinese Room Argument, explains why there will never be such a thing as what most would consider "AI."

If all intelligence amounts to is pattern-manipulation (syntax), then the weather (for example) is intelligent. But if intelligence amounts to more than just syntax, and it does, then no collection of keywords/databases/searches/processor speed will ever amount to intelligence.
Re:Since when by blank+axolotl · 2007-01-07 08:41 · Score: 1

The intelligence is in the search part. The program has to figure out what word relates to the topic and in what way. EG, in this post I use the word 'figure', but it does not relate to the topic of intelligence and is a verb.

That is what artificial intelligence is about: getting info out of a big mash of data with no calculable pattern. For example, to solve chess, you have to figure out which moves are good out of all the possible moves (the big mash of data). Chess is currently not calculable because there are too many possible moves, so the program has to 'guess' or reason to some degree.

Your average database search is not AI because the data is organized.
Re:Since when by Alef · 2007-01-07 08:56 · Score: 1

That is the thing with artificial intelligence research. So long as the concepts are understood only by researchers, people call it AI and regard it as something mysterious, but as soon as it gets useful applications and reaches the public it becomes "just statistics" or "business rule engines" or something similar. What you describe is data mining, a concept on the verge of entering the public mind.
Re:Since when by timeOday · 2007-01-07 09:11 · Score: 1

I hope we do have a spirit that makes us innately different from machines, but I'll just point out that an AI that can exhibit human-level intelligence would revolutionize the world, whether "weak" or "strong." In fact I'd prefer they were "weak" so we wouldn't have to give them rights or feel guilty about making them work for us.
Re:Since when by maxwell+demon · 2007-01-07 10:06 · Score: 2, Insightful

What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

The creative part?

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Since when by timeOday · 2007-01-07 10:12 · Score: 2, Interesting

Maybe creative people just detect more abstract patterns (e.g. lower S/N ratio) than others?
Re:Since when by sacrilicious · 2007-01-07 10:38 · Score: 2, Informative

What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
Paraphrasing to make a point: What part of computing is not detecting, storing, and applying patterns and relations?
To be meaningful, "AI" should denote more than (as the article summary indicates is being done) doing a grep through a web repository to deduce associations. There are branches of AI founded on brain neurology (neural nets), evolution (Genetic Algorithms), Bayesian logic, and various other things. Not all of the variants I can think of necessarily should qualify as AI (IMO), but the ones I'm thinking of are all substantially more esoteric than the summary's described approach. I take the GP's point to be that using a web repository as a database is too pedestrian to qualify as AI.

--
- First they ignore you, then they laugh at you, then ???, then profit.
Re:Since when by coaxial · 2007-01-07 12:16 · Score: 1

That's not how wikipedia is being used. It's being used a reservoir for semantic information. You want to know if these two consecutive tokens are a name? Check wikipedia. Biographies are clearly labeled. Want to know if this token is a country? Check wikipedia. Want to know terms associated with a War of 1812? Check wikipedia. It's a data corpus made up of human anotated terms, and that's why it's valuable.
Re:Since when by petermgreen · 2007-01-07 12:37 · Score: 1

what intelligence is is a difficult question to answer.

personally i'd say its the ability to solve problems WITHOUT having been designed to solve those problems and the ability to see opertunities of improvement for the current way of doing things.

cats live in our homes, foxes roam in our cities neither of those animals were designed for those environments nor have they had time for significant biological evoloution yet they find ways to manage in those environments.

and we have in a couple of centuries gone from farmers who lived off the land in relatively primitive houses to living in suburbs with no fields for miles arround and commuting into mega-cities with huge towser blocks without being reprogrammed along the way.

current "AI" research is still a LONG LONG way from acheiving that kind of adaptability.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:Since when by naoursla · 2007-01-07 17:46 · Score: 1

It's funny how AI is a moving target. Once we are able to reduce, explain, and understand how some aspect of AI works, many people no longer consider it AI.
Re:Since when by trifish · 2007-01-07 21:01 · Score: 1

> What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

A red herring comment modded +5 Insightful? *Shakes head*

The keyword is part of intelligence. For instance, storing data is only a part of the "ability" called intelligence. By your logic anyone who is capable of storing is capable of artificial intelligence. However, the system advertised in this "article" has only parts of artificial intelligence. And those parts are considered rather trivial in CS.

Sheesh.
Re:Since when by trifish · 2007-01-08 02:25 · Score: 1

You call that "artificial intelligence", I call that a database. I don't think we should continue this discussion. Do your homework on AI first. Bye.
Re:Since when by coaxial · 2007-01-08 12:52 · Score: 1

You have no idea what you're talking about. If you did, you wouldn't be trying to conflate a data corpus and an algorithm. Also, if you had done the least bit of research into AI, and in this case information retrieval, you'd know just how simple real AI really is. I hate to tell you this. But AI is pretty much just simple search and table lookups. There's no magic dude. None what so ever. So I guess in that sense, it is like magic. It looks cool and amazing when you don't know how it's done, but when you know how it works, you're left thinking, "That's all there is?"

Perhaps you would like to actually read about reinforcement learning. Because as it stands, you have all the arogance and knowledge of the sophmore cs major that you are.
Re:Since when by trifish · 2007-01-09 08:28 · Score: 1

Dude, you have clearly no idea what you're talking about. You don't even know what intelligence is. If you believe that "AI is pretty much just simple search and table lookups" then you should ask yourself why we don't have any artificial intelligence around. Maybe you've seen some intelligent robots? I mean really intelligent.

AI does not exist in this world. It's all just AI-like-looking algorithms. Intelligence is much more than "simple search and table lookups". It's creativity, abstraction, insight and lost more things which you apparently haven't heard of and which are unique to human beings.

Don't reply to this message, I don't have time to talk to ignorants like you, who can't comprehend and appreciate the complexity of humans.

Just make spam a crime! by D4C5CE · 2007-01-07 06:58 · Score: 3, Insightful

However many academic papers and spam filters throw their ever-more-elaborate algorithms at this issue, it is an arms race that cannot be won by the "good guys", as long as lawmakers keep pretending that technology alone could prevent the effects of sociopathic behavior: unsolicited bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem. This should have happened more than a decade ago, but now the world is simply running out of storage, bandwidth and CPU cycles much too quickly to afford waiting another decade (or even a year) for serious, intransigent anti-spam legislation that is long overdue.

Re:Just make spam a crime! by Neoprofin · 2007-01-07 07:46 · Score: 1

For me, drug addiction, poverty, world hunger, nuclear proliferation, racism, sexual harrasment, and rising energy concerns have all been solved. Whew! Glad we got that out of the way.

Just because a problem is not having an obvious and overt effect on you personally doesn't erase your knowledge that something exists. Administrators are having a problem, they're telling you with their actions. If there was no spam there'd be no spam filters, if it wasn't getting worse they wouldn't need better ones. You clearly read /. you can't claim ignorance.
Re:Just make spam a crime! by Anonymous Coward · 2007-01-07 09:02 · Score: 0

bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem.
Sending spam without marking it as such is already a crime. Much as outlawing drugs has done nothing to solve drug problems, making spam a crime has done absolutely nothing to solve the spam problem. Most spammers are hosted in the United States. If this issue bothers you so much, you can certainly write to your President and ask him to enforce the CAN-SPAM Act of 2003. That, however will also not make a difference. It will simply further fill up our jails with petty criminals which is already the reason the United States has the highest incarceration rate in the world.
Writing stricter laws isn't going to go over very well with people who support the first amendment. The CAN-SPAM Act is already pushing the limits of regulation over speech. You may note the CAN-SPAM Act only limits commercial speech. If the law had gone further, the ACLU, EFF, et al would have been all over it.
If you want to cut down on spam, you will have to solve the problem on your end with automated filters or by simply not giving out your email address to those who can't be trusted.
In summary, your post advocates a
( ) technical (x) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

( ) Spammers can easily use it to harvest email addresses
( ) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(x) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
(x) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for

(x) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
(x) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
(x) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(x) Armies of worm riddled broadband-connected Windows boxes
( ) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
(x) Joe jobs and/or identity theft
(x) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
(x) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
(x) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
(x) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradual

For true AI, you need 3d spacial recognition by CrazyJim1 · 2007-01-07 06:58 · Score: 1

All these word relation AI's make me laugh. We could have real AI if you wanted to put effort into it. Link

--
God spoke to me.

Re:For true AI, you need 3d spacial recognition by smallfries · 2007-01-07 10:01 · Score: 1

The ironic part is that when I went to click on the link, the Geocities account was already dead. And yet I didn't need to read the page to understand that the author was a crank. That's the thing about intelligence that nobody has ever managed to capture to in a formal system.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php

how about pen1s en1argement? by gamer4Life · 2007-01-07 07:06 · Score: 1

Do they substitute numbers for letters in their filtering?

associations... by pedantic+bore · 2007-01-07 07:20 · Score: 1

Given that the link distance between randomly chosen wikipedia articles is about five (sorry, don't have a link to where I saw this... and it was a while ago so maybe it's changed...) practically everything is going to be strongly associated with spam keywords.

I don't see how this is getting us anywhere except moving closer to having a spam filter that just returns "true" to anything that isn't white-listed.

--
Am I part of the core demographic for Swedish Fish?

Looks like good research by MarkWatson · 2007-01-07 07:21 · Score: 2, Informative

I will read the paper when I get the proceedings for the International Joint Conference for Artificial Intelligence. From the article, this seems like a statistical natural language processing application: the examples looked like they collect statistics of associations for both single word and short word sequences.

BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.

Re:Looks like good research by saddino · 2007-01-07 07:33 · Score: 1

The computational effort for short word sequences is no longer much of an issue. For example, the web clustering algorithm in the free application CQ web computes clusters in corpus phrases up to seven words in length, and it runs without a hiccup on your standard Windows or Mac desktop.

Not very "intelligent" by iamacat · 2007-01-07 07:28 · Score: 4, Insightful

There are lots of legit e-mails discussing vitamins, viagara or even penis enlargement, this post included.

Just make word substitution a crime! by Anonymous Coward · 2007-01-07 07:34 · Score: 0

Substitute "piracy" for "spam" and reread your post.

Titter Ye Not... by Anonymous Coward · 2007-01-07 07:41 · Score: 0

: Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex
: can be and yet still be immediately understandable.

Hmm.... the Carry On films and Up Pompeii were doing that in the 60s and 70s... :)

Not New, not newsworthy by Sub+Zero+992 · 2007-01-07 07:42 · Score: 3, Informative

Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.

The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.

--
They who would give up an essential liberty for temporary security, deserve neither liberty or security - Ben Franklin

Re:Not New, not newsworthy by Virtual_Raider · 2007-01-07 13:11 · Score: 1

Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story. Along with the title, that is one of the most useless comments one finds in /.
It is news to many of us —the great majority of readers I dare say— because we are nerds that come from different fields. I bet I could come up with common knowledge from cellular telephony that you haven't heard about and it would be news to you. If it was sufficiently interesting, it would even be newsworthy even if it's been kicked around base stations for 4 years.
You make it sound like you have deeper knowledge of the subject and it would serve us much better (us being both you and we the aliens to the field) if you expanded with insider comments rather than saying "phew, I knew this all along".
And to try to get back on topic, I would be very interested in hearing about how are they going to use the general knowledge of the wiki to filter out advertisement. For instance, let's say that an email that contains B12 is talking about a plane and not the vitamin, what other elements should the program take into account to distinguish this? And what if a B12 vitamin mail is not an advertisement but rather a general-interest article that one of my co-workers is mailing to me and I don't have him on a white list because he hasn't mailed me before? Would a program be aware today (not in 4 to 35 years) that the content is informative rather than commercial in nature? Are they even aiming for that?

--
+Raider of the lost BBS
Re:Not New, not newsworthy by dodongo · 2007-01-07 15:34 · Score: 1

I would be very interested in hearing about how are they going to use the general knowledge of the wiki to filter out advertisement. For instance, let's say that an email that contains B12 is talking about a plane and not the vitamin, what other elements should the program take into account to distinguish this?

I do think you may have been a bit harsh on grandparent; I for one, having done some work in NLP, was wondering whether anyone else was really questioning the newsworthiness of the post. So you can, of course, imagine my relief to find someone with some NLP experience to be sneering. Now I feel a little validated.

Both you and grandparent bring up good points about searching databases WRT ontological / semantic automated processing. I have exactly the same questions you do about the potential for disambiguation.

Maybe I'm wrong -- and please, correct me, especially with sources for more information if you have them -- but it seems like there is an important separation between lexical (or dictionary, if you will) knowledge and encyclopedic knowledge. Just as in the two types of reference book, you look in lexical knowledge to find basic information about words and their senses. It's the encyclopedic knowledge that gives you detailed, background information about what's happening. Presumably, this is the most useful information in fine-tuning automated disambiguation, right?

But it should only be in extremely sophisticated instances of ambiguous words in text that a system has to go back to the encyclopedia to find the correct sense of a word. If your ontological implementation is reasonable, shouldn't you already be searching for "mouse" in computer-sense when dealing with a technology article? And shouldn't, likewise, the system "get" from deciphering a biology text that "mouse" there refers to a rodent?

But the problem, as parent points out, is how you actually implement such disambiguation, even with a large, agile encyclopedia to work off of. Specifically in the realms of jargon, etc., there are all sorts of instances where people come up with relatively novel abbreviations, coding systems, etc. Even rich encyclopedic knowledge or the best currently-functional ontological knowledge base wouldn't help this.

And then I also have to ask "well, what the hell, aren't we just playing the 'gotcha' game?" Props to the researchers for finding a new way of using Wikipedia. IMHO, NLP researchers are all to quick to fall back on the "world knowledge" excuse, half-knowing that this is a gaping problem with many implementations, while still confident in their response because it sufficiently distracts from the issue at hand. Nice work, folks -- I'm looking forward to seeing more!

Artificial is best by kbox · 2007-01-07 07:49 · Score: 1

Using Wikipedia for artificial intelligence makes far more sense than using it for actual intelligence.

It's not so much a collection of facts as it is a collection of widely believed notions.

--
God Be Gone

Hold the phone by Anonymous Coward · 2007-01-07 07:56 · Score: 0

Spam blocking isn't rocket science.

Block: Anything from, through or similar to a gateway that has previously been marked spam.
Block: Anything that appears to be a price list.
Block: Anything that includes an attachment unless the sender is on your whitelist.
Block: Anything with URLs that are not from domains in the user's contact list or domain whitelist.

Or (preferred):

Block: Anything not on the user's whitelist.

If you want to get on my whitelist, find some other way to contact me first.

Make the people accountable by thePig · 2007-01-07 08:00 · Score: 1

This is a little off-topic, but I guess the only way to take out this menace of spam is to make the average joe accountable.
If the spam originated from a botnet in his machine, make him accountable too.

If he has installed the latest updates from Microsoft and still the botnet could get in, then it is not an issue. But, if he has not taken the effort to download the patches for say, the last 6 months, and a botnet operated from his machine, causing discomfiture to all and sundry, then he is accountable for it.

Push forward legal actions against the 'joe' and we would see real increase in the understanding of computers fallibility and a real decrease in the amount of spam.

--
rajmohan_h@yahoo.com

Look up Abstraction Physics by 3seas · 2007-01-07 08:07 · Score: 1

http://threeseas.net/abstraction_physics.html

considering the article is from physorg......

and to think they plan to patent it? Abstraction Physics?

I don't think so...

Perhaps this is all that we were missing for AI by alexwcovington · 2007-01-07 08:14 · Score: 1

A knowledge base with associative retrieval capability has eluded researchers but they have one in Wikipedia. Now if only they can get AI to successfully [and hopefully, correctly] modify the knowledge base...

--
(It's never too late to join the Renaissance)

Re:Perhaps this is all that we were missing for AI by kalirion · 2007-01-08 05:05 · Score: 1

Something like wikipedia will definitely be needed for people attempting to create true AI. The best part is that it can be easily gotten on CD (or is it DVD?), so the computer with the AI can be completely isolated from the outside world. You know, to avoid the Skynet scenario.

They don't need truth. by Anonymous Coward · 2007-01-07 08:16 · Score: 0

You seem to be under the impression that the AI is designed to figure out whether a given e-mail is reliable or accurate. It is not. It is designed to figure out what the subject of a spam actually is. If a letter is titled "Hi! It's your Uncle Harold!" and inside is a Markov-chain generated letter on the subject of "v1aqra", a conventional spam filter may have trouble understanding that the letter is selling pills. Bayesian approaches come close, but they're in the hands of the spammers too... spammers just check their algorithms against the filter and try to get a low score.

What these researchers need is a large number of articles on a variety of subjects a human being would not describe as "nonsense." It doesn't matter whether the wikipedia article claims the common cold is caused by a virus or by swamp gas, the AI will still learn that the common cold is often associated with coughing, sneezing, sniffles or a mild fever. Viagra is associated with sex, ladies, satisfaction and inversely associated with penile pumps, spanish fly and oysters. A program that understands this is more likely to catch a cleverly generated spam.

My question is whether this program will associate the acronym "AI" with the adjective "burgeoning." The association with this cliche is so strong in my mind I was sure I saw it in the summary, but it seems I was wrong. That's how human brains work.

Google is burgeoning too.

Re:They don't need truth. by Assassin+bug · 2007-01-07 09:15 · Score: 1

Good point. I suppose the AI is learning habit more than word definitions.

Hutter Prize by Baldrson · 2007-01-07 08:26 · Score: 2, Informative

As has been previously reported on slashdot, The Hutter Prize for Lossless Compression of Human Knowledge uses a snapshot of Wikipedia for rigorously benchmarking AI (and it has already had it's first payout).

The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.

--
Seastead this.

I, for one, welcome our ... by bentrop · 2007-01-07 08:54 · Score: 1

One would think that AI and Wikipedia is a great combination. Isn't it comforting to know, that our future metal overlords will have a profound understanding of 'goatse' and understand every single Simpsons reference?

But spammers can add content to WIkipedia by dpbsmith · 2007-01-07 08:59 · Score: 1

This is the biggest threat to Wikipedia I've heard in a long time.

If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.

--

"How to Do Nothing," kids activities, back in print!

Re:But spammers can add content to WIkipedia by sbaker · 2007-01-07 11:00 · Score: 1

In the particular example given, a spammer trying to sell Vitamins using the word 'B12' would have a strong incentive to scan Wikipedia and remove all instances of the word 'B12' wherever it was found - and perhaps even to insert it spuriously in a few places where the end user might be white-listing words too.

This would be very bad indeed for Wikipedia because it gives a motive to vandals - and not just to the stupid vandals we have right now - but to the annoyingly inventive ones too.

Urgh!

--
www.sjbaker.org

This is not new by Anonymous Coward · 2007-01-07 09:08 · Score: 0

This was first pioneered by Princeton (http://wordnet.princeton.edu/) and MIT (http://www.conceptnet.org/). People are building 'conceptnets' all over the place

Not only are they not the first to build a conceptnet, they are also not the first to build one using Wikipedia as their source.

I will contest this personally if they try to patent it.

As I've Said Many Times Before by Master+of+Transhuman · 2007-01-07 09:35 · Score: 1

Conceptual processing is the ONLY way to deal with these issues.

For example, what if I'm getting information sent to me from acquaintances about life extension - references to vitamins and nutrients would abound. But it wouldn't be spam.

An AI spam blocker has to know what I'm interested in, what material I've received before that was cleared, AND has to be able to, in some sense, UNDERSTAND the content rather than just correlating it to other terms atomically in terms of frequency of occurrence. Otherwise, how can it weed out material that correlates BOTH with spam and non-spam?

Without some decent implementation of conceptual processing, this just isn't possible.

--
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!

Hutter Prize - a little realism is in order by Anonymous Coward · 2007-01-07 10:35 · Score: 1, Interesting

The new theoretic basis of universal intelligence allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts. That is a gross overstatement of both Hutter's success at solving useful AI problems and his influence in the AI community, to say the least. Just because it happens to be your favorite theory doesn't mean it has actually revolutionized the whole field of AI.

Lameness filter by compandsci · 2007-01-07 10:41 · Score: 1

Who needs AI for spam filters? Just use the lameness filter: if lameness(new_mail) > 94 { bounce(new_mail, bill.gates@microsoft.com); delete(new_mail); }

PBEM by Alsee · 2007-01-07 11:17 · Score: 1

the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will [] identify the message as spam

Ha Ha! Blocked!

You didn't sink my battleship!

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

The double-edged sword that is knowledge by NetSettler · 2007-01-07 11:55 · Score: 1

This is the biggest threat to Wikipedia I've heard in a long time. If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.

Personally, I think spammers are already much smarter than this. It may be my imagination, but if so it's surely coming, that spammers are grabbing text from places they harvest my name and just including that text in messages rather than trying to make up things from scratch. Since they want to sell me something related to things I do, doing this gives them natural camouflage since the text tends to be on topic anyway.

Also, filling Wikipedia with spam is the least of our problems. The more subtle problem is the apparent assumption in all the replies here that the spammers won't use the same technique. That is, if they know Wikipedia is being consulted to tell what words mean, then all they have to do is consult Wikipedia to find misunderstandings they can associate. e.g., it might find that B12 was a possible Bingo number, or perhaps it would find that Boeing once made an airplane named the B12, or they might find it's an isotope of Boron, or...

Knowledge is not a cure for anything. Learning, and improving technique, are ways of staying ahead, but also ways of upping the stakes. When everyone is on an even playing field knowledgewise, that knowledge is no longer a tactical advantage.

The front on which spammers could easily be brought down is not knowledge but money. The spammers don't have the money to pay for all that spam: they just penalize the rest of us for having made it free by abusing our good will. If email were made to be pay-only, it could destroy the economy of scale that spammers enjoy. And perhaps if Wikipedia becomes an important resource, making Wikipedia's use be pay-only could fix the problem. Not that it's likely to happen--I'm just observing the opportunity.

The same has been noted about the "recreational" drug trade, though: legalizing such drugs (whatever you think of the issue of use), would likely drive the price down. Speculation has it that they remain illegal in part because the illegal drug trade likes the price advantage of having things be illegal, and that they are some of the loudest to remind us that it would be immoral to legalize them. So it's hardly surprising that spammers are some of the first among us to scream about the immorality of pay-per-message email. In both cases, we continue to pay anyway: we just pay for spam removal and fighting the drug war. As long as we don't count those activities as a cost, we continue to think the price would be high to change the way things work.

Direct physmail, by contrast to email, is a minor irritation because it's paid for by the sender (even if at a discount that I might not agree with). And the availability of World Book or Compton's Encyclopedia in hardcopy has never been a way of overcoming that issue. The fact that money is charged for physmail postage is the thing that wins out. It means the sender must give thought to whether the recipient really cares, and must target mail in a way that's a win-win. No such thought is required in email because the cost is entirely negligible.

--

Kent M Pitman
Philosopher, Technologist, Writer

Text of IJCAI paper by gvc · 2007-01-07 13:26 · Score: 2, Informative

http://www.ijcai.org/papers07/Papers/IJCAI07-259.p df

While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.

Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html

The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.

OCR unnecessary by gvc · 2007-01-07 13:43 · Score: 1

The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg file, thus making it seem innocuous, and then putting the real advertisement in a GIF or PNG file that would be displayed by HTML-capable mail readers. Bayesian analysis can still work, but only in combination with OCR software.

Bayesian filters (and other statistical filters colloqually known as Bayesian) can work on any features at all; not necessarily text. In particular they can use the markup in the header of the message, the message encoding, and so on. Some of the best-performing filters don't use 'text' at all and simply treat the entire message, images and all, as a bit string; for example, compression-based filters. Another well performing filter, OSBF-Lua, uses orthogonal sparse binomial bigrams rather than individual tokens.

Recent standardized testing shows that these methods work just fine on image spam, without any OCR component.

Who needs AI? by DarkProphet · 2007-01-07 14:03 · Score: 1

Seriously. FWIW, I am for the most part a Google fanboy.

I have had my GMail account for what, two years or so, and I really don't think google's spamfilter has ever missed a beat. That is to say that all the real spam I receive every day (~40 to 100 spams depending on the day) ends up in the spam folder, not my inbox. Spam is a total non-issue for me. OTOH, my hotmail inbox is so atrocious and the spamfilter so bad that I can't use the account for anything important. I don't know what kind of black magic they have going on at the Googleplex, but it WORKS! Maybe they do use some form of AI, but I assume they don't use what I'd call 'smart' AI. If that assumption is correct, then the spam problem doesn't seem to need AI as its solution.

--
What could possibly hurt the security of the American people more than giving our own government the ability to hide its

What problem? by Anonymous Coward · 2007-01-07 19:01 · Score: 0

I don't see what the problem is about.

Whitelist friends, family and information sources you know about.

Block non-English character sets. Block anything with attachments, especially images. Block anything with l33t or misspelled words. Block HTML mail. Block mail which has arrived at myaddr1 and myaddr3 as well as myaddr2. Block anything over a certain size. Strip any multi-sentence fragment which shows up in Gutenberg and recalculate.

Some people might say that this would block email from management, newbies, grandma, and those 'friends' who spam everyone with lame jokes and Youtube videos. I say "and your point is?"

Wikipedia 3.0: The End of Google? by Anonymous Coward · 2007-01-07 19:47 · Score: 0

This article got it right first, now everybody is playing catch up including Wikipedia's founder with his newly announced Wikia semantic search engine project. http://evolvingtrends.wordpress.com/2006/06/26/wik ipedia-30-the-end-of-google/

Intelligence? by Anonymous Coward · 2007-01-07 20:34 · Score: 0

One might as well call it crap....

Mine Slashdot headlines by Ed+Avis · 2007-01-07 22:38 · Score: 1

A common tactic to defeat spam filters is to misspell words. The filters should look at the output of the Slashdot editors over the past decade to see what the common mistakes are.

--
-- Ed Avis ed@membled.com

on that subject by Ossadagowah · 2007-01-08 00:09 · Score: 1

If the A.I. works like many editors on Wikipedia, then the end result will be another intolerant fundie:

ERROR! Your original research is not welcome!
We cannot use that data as the citation format has changed. REVERTED

--
anata sekai o kakumei surush ga nai deshou? Anata no susumu michi wa yoi shite arimasu.

Pretty helpful for people in the pharma business by bpadinha · 2007-01-08 01:51 · Score: 1

Most certainly I'm missing something, but what does this mean for people whose work actually involves talking about B12 or other vitamins?

--
--- "The idea is to die young, as late as possible." -- Ashley Montague

The best we can do? by oftencloudy · 2007-01-08 02:45 · Score: 1

So with a wealth of knowledge from people around the globe, the best use of AI derived from this pool of information is to create a spam filter? Oh yea, this AI deserves some government grants. Can't anyone think of a better way to use this?

--
But whatever the object, you must keep him praying to it. To the thing he has made, not to the person that has made him.

The target hasn't moved by ClosedSource · 2007-01-08 03:16 · Score: 1

The issue isn't understanding how AI "works", it's understanding how to make AI work. AI isn't a moving target, we just keep assuming we're closer to it than we really are.

Skynet by mikeee · 2007-01-08 03:48 · Score: 1

Obviously, Judgement Day will be triggered by Skynet in a final, frustrated attempt to eliminate spammers.

I dinna think it means what the AI thinks it means by HTH+NE1 · 2007-01-08 06:24 · Score: 1

And just because your Aunt Sally doesn't want to receive spam about vitamins doesn't mean she wants to miss her weekly Bingo e-mails.

--
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?

Uhm... what color is the sky in your world? by Gary+W.+Longsine · 2007-01-08 06:31 · Score: 1

I think the point is that many, if not most email users find themselves wading through a sea of spam despite the multiple layers of content filtering that happen between the point of origin and their inbox. The AC is partly right. Content filtering has merely delayed the death of email.

College students these days are often heard to say, "I have an email address but I never use it." They prefer their cell phones because voice and SMS text messages are not yet flooded with spam. Email may not be dead, but it's definitely gasping for air.

--
If you mod me down, I shall become more powerful than you could possibly imagine.

Using AI to stop spam by InsertCleverUsername · 2007-01-08 07:02 · Score: 1

Could we take this one step further and use Wikipedia in something from Cyberdyne Systems, programmed to seek out and apply napalm directly to spammers?

--
Ask me about my sig!

intelligence, artificial or otherwise? by samantha · 2007-01-08 16:26 · Score: 1

The word "vitamin" in a message means it is spam? Methinks that the intelligence should be applied to better test for what is spam rather than simple minded associated term collecting for hot words from various online sources. Bayesian filters are much better than this already and do not require wikipedia reading to do their jobs with 99% accuracy after fairly minimal training.

paper by mountain_penguin · 2007-01-08 20:05 · Score: 1

I think this is described in more detail in a paper that was presented at ijcai 07 this morning. It was fascinating. here is the paper and http://www.eml-research.de/english/homes/strube/pa pers/aaai06.pdf is another paper on the same subject

Must improve interfaces by AJanuary · 2007-01-11 02:37 · Score: 1

The stronger the filters get the more needs to be done to improve how we are shown spam.
The filter can say that it's 40% sure that that email is spam, but I can tell 100%. There needs to be better interfaces for manualy monitering what it has deemed by the filter to be spam implimented in major clients.
My prefered solution is to expose how certain the filter is that a message is spam via a colour coding system, and enable users to filter and sort via this certainty. You can then review only the top nth percent of your spam to make sure that it is definately spam.
Combined with the hopeful move in improving interfaces I welcome ideas like this that should, fingers crossed, help catch more spam.
It also has interesting applications in other areas, as the article mentions, and AI as a whole.

Slashdot Mirror

Wikipedia Used for Artificial Intelligence

177 comments