Using AI for Spam Filtering (w/ Source Code)
jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."
I mean - hello, humans create it.
We're not up against a new being - it's the same type of beings that create scripts for the hell of it that wreak havoc on computer networks because 1) "We can" or 2) "To show them their weaknesses".
It was a very interesting read for sure - the genetic marker bit was quite interesting. Admittedly though I got about 2/3rds the way through it and lost interest.
Blame the spammers I say. ^_^
This is really a testament of strength of yet another MS product.
No, more likely it's some guy trying to use Windows 2000 Pro as a webserver. It has a ten connection limit; you're supposed to use a server version of windows for live webservers. I've never seen that error from a server version of Windows.
Because we've realized that we don't have to read the article or understand the topic to post something here and get modded "Informative"?
The emperor is naked.
locomotion, respiration, ingestion, self-reproduction
Yeah, fire is alive.
"Huh? Since when do these three criteria determine if something is alive? As far as I remember from high school the criteria were: locomotion, respiration, ingestion, self-reproduction."
I believe you are missing the point the creator is trying to make. Spam imitates a living organism by adapting to it's surroundings in order to survive. Why does spam do this? Because it is sent by HUMANS which learn to "mutate" and change there message to bypass current spam filters in order for them to survive.
I think this is a very interesting approach and may help serve as an affective spam blocking tool while an improved mail protocol is accepted.
1. While the author proposes some marvelous cure based on treating spam as an organism, he just lists traits that any spam filter can use, and which most probably do, though he would suggest that most don't. I fail to see how the artificial-life observation improves spam non-spam determination from the list of traits he proposes filtering on.
2. The article reads like a sales pitch for the author's spam filter.
3. If 2 is true, and it is a sales pitch, then you have the irony of a very effect form of spam that makes it past the slashdot editors.
It's ALIVE!!!!
Letter To Iran
If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.
I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.
What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.
Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.
Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.
Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).
These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.
As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.
Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.
While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.
Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b
actually spam is very analogous with bugs (bacteria)..
spam filters kills spams,
antibiotics kills bacteria.
we have spam filters,
we have antibiotics.
the selection pressure posed to spam by spam filters makes spam become harder-to-filter one.
the selection pressure posed to bacteria makes them harder-to-kill bacteria.
we then have to develop other spam filters,
so as our antibiotics.
too much of a spam filter will result in adverse effect because you filter ham out.
too much of an antibiotic will result in adverse drug effect because of toxicity to human cells (e.g. nephrotoxicity, ototoxicity etc.)
"Of couse, this won't solve the bandwidth/ressource theft problem..."
No, it won't.
Obviously, to solve that problem you need to act earlier in the spam path.
Spammers abuse systems because they look for vulnerable systems and can find them, can distinuish them from secure systems. Think about that - it's true.
Securing systems (as a solution to spam) is based on the ridiculous notion that enough can be secured so that the spammers can't find them. Won't happen. But "distinguish them from secure systems" is still left. What can be done with that?
Well, if secure systems didn't look secure to the spammers they'd not be able to distinguish them and they'd try to abuse systems that can't be abused. That would mean they'd send the spam to traps and that the traps would not deliver any spam other than to what can be determined to be the spammers' own addresses, used to test whether the spam sent gets through (in other wordsd, to re-test to see whether the system is or isn't vulnerable to abuse.)
That's easy to understand, isn't it? If you want to stop the bandwidth theft youre almost surely going to have to act against he banwwidth theft. What's described above is a way to make bandwidth theft not work as well. Break bandwidth theft sufficiencently and the spammers won't get enough return on the spam to pay for sending it (or the ones paying the spammers won't get sufficient return - it's the same idea either way.)
With a single ancient Vaxstation and an obsolete MTA I stopped spam to millions of recipients elsewhere: AOL, Hotmail, a large number of destinations. To top it off that Vaxstation was a real email server, so it did two things (and it was slightly harder to stop the spam.) SEt up a fake server and everything that comes to it is some form of abuse: none need be delivered as though it is valid email (it isn't valid email. Of course you'd want to deliver the spammers' own test messages: that's what lets them fool themselves into thinking they've found an open relay.) Nowadays this idea works better if you fake an open proxy: open relay abuse is finally on the decline.
If you're an ISP with IP addresses that the spammers check for abusability or with IP addresses that have been abused you can do more than shut off the IP address (and please, I beg of you, do more. Find out where the abuse packets originate that come into the abused system and do whatever you can to get that abuse stopped. If you, for instance, disconnected the abused system and set up something that accepted the incoming abuse packets but sent out no spam that would be helpful. What you can do depends on the abuse and on the spammer - but the main point is that you don't have to only shut off access, you can do more. Why not do more? You are against spam, and doing more stops some spam. That's in the right direction.
And an oak tree isn't.
Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it.
That is not true. I have been using POPFile for 1 1/2 years now, and spam is no longer a problem for me. I see maybe 1 spam per week. I think that all filters' "bayesian part" is just about as effective, the differences come from the tokenizer. The more data you can extract from the message, the more data the bayesian classifier has to work with.
The article sounds like the author had just learned about neural nets and decided that they would be the best solution to spam without doing any real research on existing systems.