Using AI for Spam Filtering (w/ Source Code)

← Back to Stories (view on slashdot.org)

Using AI for Spam Filtering (w/ Source Code)

Posted by CmdrTaco on Sunday July 11, 2004 @01:15AM from the i-can't-do-that-dave dept.

jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."

4 of 197 comments (clear)

Min score:

Reason:

Sort:

Not new, not genetic, not A.I. -- it's Bayesian by orthogonal · 2004-07-11 01:46 · Score: 5, Interesting

Is Slashdot trying to jump the shark?

We already saw a plagiarized article green-lighted, and now this? Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it.

First, the submitter of this article has he email address jarhead4067@hotmail.com -- and so does the article's author.

Second, what is presented is not a genetic algorithm. The characteristics of the email to be considered to discover if the email is spam are finite and hard-core -- and even the threshold some characteristics must reach to qualify as spam are hard-core:

// This can be adjusted... Calculating the misspelled word ratio and // any Bayesian probability is time consuming if (stats.SpamProbability < .66)

A genetic algorithm is one in which the goal is hard-core, different means of reaching that goal are generated, and the characteristics of the most successful are used to generate the next "generation"; this is repeated until the goal is reached.

But in this model, each "chromosome" contains statistics about one email. The heart of this model is to train a neural network with known emails ("chromosomes") and then tests unknown emails ("chromosomes") against the network.

Neural networks have a checkered history in Artificial Intelligence research. A (very much simplified) model of biologic neurons, neural networks were for a time seen as a great hope for Artificial Intelligence. A neural network basically starts out with an array of input nodes and an array of output nodes, with each input node connected to each output. Each input corresponds to some characteristic of the items the network is trained with: for classifying animals, the inputs would be characteristic of animals, e.g., "furry", "bipedal", "feathered"; each output a classification, e.g., "mammal", "bird", "human".

To train the network, the input nodes are set to the characteristics of an item, and then the strength of the connection of those inputs to the correct outputs is increased (or that of other connections is decreased -- it's the same thing). With enough training, it's possible to isolate the salient characteristics from the ambiguous one sin a mechanistic way.

This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals. In a single layer neural network, the connection strength between input "bipedal" and output "mammal" would fluctuate, unable to describe humans or birds well. These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.

But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing.

Of course I have no idea if classifying spam is intractable or not, but I have to question whether using a neural network reliably can outperform Bayesian (or quasi-Bayesian) filtering. My guess is that since Bayesian filtering can judge email by the occurrence of single tokens ("words"), and not just "chromosome" statistics, and given that this "new" method also uses Bayesian filtering to generate one of those "chromosome" statistics anyway (and for only the most difficult to characterize emails to boot), this method itself probably mostly relies on its Bayesian sub-component.

So I'm a bit at a loss to see why this method is in any way revolutionary or even particularly interesting, or why it was green-lighted for Slashdot. Of course, I only gave the linke

--
Opinions on the Twiddler2 hand-held keyboard?
How is this any different... by Fooby · 2004-07-11 01:46 · Score: 5, Interesting

from SpamAssassin? It takes a bunch of rules, applies them, and uses a neural net to classify the message. Seems to me SpamAssassin does the same thing, only is more mature and extensible and uses a genetic algorithm rather than a back-propagation neural net.
Ham filtering by skinfitz · 2004-07-11 02:18 · Score: 4, Interesting

I've given up on Spam filtering and concentrating my efforts on Ham filtering.

Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.

What I am having success with is turning this on it's head and assuming that the bulk of incoming mail is bad, and filtering in messages that I want.

The way I am doing this is to use my address book as a whitelist - if an incoming message originates from someone in my address book, then it's delivered into the inbox. If not, then they are moved into a "not in address book" sub folder. Anything my ISP spam assassin based filtering marks, is sent into the "Spam" folder. Doing it this way means that I am only notified of incoming mail that is confirmed from someone in my address book. Periodically I check the other folders (obviously).

We have come to the point I think where the number of variables involved makes filtering in a less intensive process than attempting to deal with the myriad of underhanded techniques that spammers use. By limiting the mail I want to people in my address book, I make it so that spammers are the ones having to deal with the variables as they would have to guess addresses in my address book. If lots of people started filtering like this when we would see spammers using known bulk mail addresses (such as the address iTunes receipts are mailed from) however we can simply alter the filter to include the originating IP / mailer and so on.

Think of it like fishing - you wouldn't attempt to control an entire ocean and remove the water to leave the fish - you accept that the water is there and develop techniques to get the fish out.
Re:Spam really needs to be done away with. by slashname3 · 2004-07-11 04:13 · Score: 3, Interesting

I agree. I implemented spamassassin and it has worked wonders. We were seeing anywhere from 3000 to 7000 spam messages a day. Virtual all were tagged as spam by spamassassin.

This past week I implemented another tool called greylisting in the fight against spam.

Over a typical weekend for two days I would see something like 5000 to 8000 spam messages. Since implementing greylisting in the last two days we have seen 7 (yes seven) spam messages that were subsquently tagged as spam by spamassassin.

I never expected it to work that well but it has.

Highly recommended in this fight against spam.