Using AI for Spam Filtering (w/ Source Code)
jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."
I mean - hello, humans create it.
We're not up against a new being - it's the same type of beings that create scripts for the hell of it that wreak havoc on computer networks because 1) "We can" or 2) "To show them their weaknesses".
It was a very interesting read for sure - the genetic marker bit was quite interesting. Admittedly though I got about 2/3rds the way through it and lost interest.
Blame the spammers I say. ^_^
1. While the author proposes some marvelous cure based on treating spam as an organism, he just lists traits that any spam filter can use, and which most probably do, though he would suggest that most don't. I fail to see how the artificial-life observation improves spam non-spam determination from the list of traits he proposes filtering on.
2. The article reads like a sales pitch for the author's spam filter.
3. If 2 is true, and it is a sales pitch, then you have the irony of a very effect form of spam that makes it past the slashdot editors.
It's ALIVE!!!!
Letter To Iran
If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.
I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.
What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.
Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.
Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.
Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).
These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.
As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.
Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.
While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.
Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b