Using AI for Spam Filtering (w/ Source Code)

← Back to Stories (view on slashdot.org)

Using AI for Spam Filtering (w/ Source Code)

Posted by CmdrTaco on Sunday July 11, 2004 @01:15AM from the i-can't-do-that-dave dept.

jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."

11 of 197 comments (clear)

Min score:

Reason:

Sort:

Buzzword Bingo! by Anonymous Coward · 2004-07-11 01:18 · Score: 0, Informative

artificial living organism"
BINGO!
Google cache by cs02rm0 · 2004-07-11 01:21 · Score: 5, Informative

Google cache
Re:already slashdotted :( by Anonymous Coward · 2004-07-11 01:31 · Score: 1, Informative

Even more mysteriously, who is Dave and what can't Taco do to him?

2001. Duh.
The Article by Maddog+Batty · 2004-07-11 01:36 · Score: 4, Informative
Introduction

Spam has become the first great plague of the 21st century. Over 60% of all e-mails are spam, costing U.S. corporations more than $10 billion annually, on top of the productivity lost from scanning through e-mail and deleting spam. Along with this, an estimated 5% of spam campaigns are a pure and outright scam, with the remaining majority pitching products that are dubious at best. It used to be parents had to worry about their kids surfing and finding pornographic websites, now we have to worry more about our kids opening an e-mail client and finding a pornographic spam message. Spam must be stopped before it cripples the infrastructure of the internet and drives users away from one of the greatest forms of communication, E-mail.
Can Laws Defeat Spam? No. This has to be one of the greatest misconceptions of users. The internet is just that, an "INTERnational NETwork" that cannot be governed by one country's laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant. Also, the decentralized, self-organizing design of the internet makes it nearly impossible to regulate by external means. It would be easier to regulate the weather than to regulate the internet.

Spam as a Living Organism

Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following:
1. Spam evolves and adapts based off the rules of natural selection
  Through the fight against spam, spam has demonstrated an uncanny ability to adapt to the conditions of its environment, namely the internet. When one barrier against a strain of spam is put up, another, resistant strain appears. This is similar to how bacteria builds immunity against antibiotics, the strains that are not immune will die, while the ones that are immune take over and become the dominant, drug resistant strain. This leads to the belief that spam will not die until the barriers of its environment evolve faster than it does.
2. Spam lives within an eco-system, and we're its food
  The internet is a complex chain of systems that all rely on each for the other's survival. Without an internet protocol, a web browser couldn't exist. Without web servers, the web wouldn't exist. Without ... (you get the picture). This chain of systems can be likened to an eco-system, with spam existing at a parasitic level of species within this system. It consumes resources (bandwidth, servers, time) in its attempt to reach its primary host: us. Once spam reaches its target, its sole purpose is to solicit its "food" from us, primarily money. If it is effective, that strain of spam lives and continues to propagate, otherwise it will die. Can the internet eco-system be modified so spam can't feed?
3. Spam has genetic traits and markers
  Just like any organism, spam contains certain traits that uniquely identify it. This can be a combination of words, information inside the header of the e-mail, the format of the message (HTML, plain text, rtf), the message encoding (base64), does it contain image links, the number of links, does it contain hidden text, so on and so forth. Up until recently, spam filters have primarily focused on just one of these traits, the wording of the e-mail. Spam, being an organism, evolved so this marker was hidden within its code, making it difficult at best to filter. It did this by including random, non-spam words in hidden areas of the e-mail, by modifying words like Viagra with V1@gr@, sending spam as image links, and by encoding the message in a format that filters could not read. The good news is this "gene" is still present, and can be unlocked by identifying the defensive genes wi
--
wot no sig
How is this news ? by janoc · 2004-07-11 01:45 · Score: 5, Informative

How exactly is this news ? It seems that the author of the neural network idea didn't do his homework - e.g. DSPAM includes neural network as an experimental classifier already. And compared to the proposed C# solution, DSPAM is a widely used and mature product already.

Regards, Jan
Entirely bogus by Anonymous Coward · 2004-07-11 01:57 · Score: 3, Informative

The entire concept is quite ridiculous.

The guy proposes picking nine well-known indicators of spam, ones that could be (and often are) implemented in rule-based spam checkers, then proposes we use a neural network to evaluate a message based these metrics.

Problems:

1) If you detected spam indicators, this is indicative of spam, no? The whole "fancy" bit of this technique is thus needless.

2) These indicators are not inherent to spam, just represent most current bypassing / obfuscation techniques. If you filter them out, they'll evolve. There is nothing that makes his spam filter follow the arms race.
This Guy's an Idiot by magefile · 2004-07-11 02:51 · Score: 3, Informative

For starters, he things Internet is short for "INTERnational NETwork" as opposed to a NETwork between entities (vs. network within an entity: intranet).

Then, his criteria:
Is the format of the e-mail HTML?
This is not a bad criterion.

Is the e-mail formatted in valid HTML?
Have you ever seen a commercial program (esp. word, used by Outlook) generate good, 100% valid HTML?

Is the e-mail encoding base64?
No argument here. Unless base64 could be confused with Unicode - don't think so, but not sure.

Does the e-mail contain image links?
Does the e-mail contain "hidden" text that the user cannot see?
Heck, yeah, block it.

Does this e-mail have a large number of recipients?
Most of the spam I get has less than 5 recipients, and a lot of my mail is from a listserv with more than 5 recips.

What's the ratio of links to words in this e-mail?
I generally see only one or two links in my spam. Although I do see zero links in most of my ham.

What's the ratio of misspelled words to words in this e-mail?
Dear lord, no. This is a worthless criterion. Maybe if you looked for a ratio of non-letters (@, |, etc) to letters, but not spelling.

What's the Bayesian spam probability of this e-mail?
WTF does this have to do with AI?

Basically, he's stated the obvious, then made some really idiotic assumptions. Plus a shitload of spelling and grammar errors.
Re:Ham filtering by david.given · 2004-07-11 03:53 · Score: 2, Informative

Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.
The problem with this approach is that you run the risk of throwing away ham. Because you're starting with mixed spam and ham, and you're picking out the ham, you don't know for sure that what's left is pure spam. Traditional approaches are safer, because the take mixed spam and ham and throw away only what is known to be spam. Therefore (unless the spam selection process is overeager) they won't throw away ham.
(I feel hungry now...)
I use a greylister. It's brilliant. It reduces the amount of spam I get from about 100 to 150 messages per day to about 5 --- and because it does this before the messages are transferred to my machine, I don't even get the overhead of running them through spamassassin or even my MTA.
Greylisting implements the old sender-pays spam filtering system by exploiting the SMTP system. It requires messages to be sent twice: the first time it's rejected with a try-again-later reply. This makes it the sender's responsibility to store the message and resend it --- this is the cost. As most spam engines aren't real SMTP servers, they usually don't bother to retry. Real messages, however, will arrive about half an hour late. (You then implement lots of optimisation so that you don't bother greylisting messages from known good senders, etc.)
Advantages? It's highly effective. It's completely standards-compliant. It's 100% safe; it won't lose ham unless an upstream mail server goes wrong. It can work before the message body is transmitted. It works against a lot of Outlook Express email viruses too. And, best of all, it's completely invisible to both sender and recipient: set it up, get it going, and it Just Works.
If you're interested, I strongly recommend the one wot I wrote<BLATANT ADVERTISING/>, because it's simple to set up and works on any MTA, but there are lots more around --- the earlier link is a major resource.
yawn. Baysian by itself doesn't work and isn't AI by CFD339 · 2004-07-11 04:17 · Score: 2, Informative

Baysian filters are bypassed just like any other. I'd bet most of us here have tried some form of adaptive filtering with varying results.

He's right in one key respect though -- spam is cheap to send, but spam DESTINATIONS (the links they try to get you to go to) are relatively expensive. You can't registered a hundred thousand domains a day. While its cheap to get one or two, massive domain registration is an expensive proposition. That's currently, IMO, the best way to catch spam once you've gone through the bonehead catch of faked headers.

Personally, I do two stages: First, I catch the obvious stuff -- it says its from AOL.COM but didn't come from their published servers. duh.

Then, I take those "known spams" and search for the call to action link -- what url are they trying to send me to. Take the primary part of that (the domain, plus a little more) and make a list of "probable spam destinations".

I do the same thing with known good mail (mail from people I have sent mail to).

Now have I have good baysian fodder -- actual destination lists both good and bad.

Making a baysian list out of those results in a fairly accurate secondary filter.

Email inbound to me now goes through three checks:
1) have I sent you mail before (whitelist)
2) is this obvious bonehead spam
3) how many links in the message are to the same place as the ones in the bonehead spam?

This works to stop 98% of the 400+ spams a day that get sent at me with a very very low false positive ratio.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Re:Is it any wonder it mimics humans??? by Jeremi · 2004-07-11 05:38 · Score: 2, Informative

spam does not evolve like an organism. Organisms slowly evolve while Spam content makes the occassional wild shift in both how and what is used to throw filters off the scent

Actually, "occasional wild shifts" are exactly how organisms evolve.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Qui Bono? Sue the ass off the profiteer by crovira · 2004-07-11 08:28 · Score: 2, Informative

Go after spammers' customers. If they have to pay $10,000 for every spam sent on their behalf, they'll soon stop,

Fuck the spammers. They are merely supplying in response a demand.

Dry up the demand by an internationally (I know of NO govm't who'd turn down money,) backed law making it illegal to have spam sent on your behalf.

The response to spam is NOT going to be technical.

--
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.