Using AI for Spam Filtering (w/ Source Code)
jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."
I dislike spam, in the same way only more than I dislike all the billboards along the highways. They get in the way of what I really want to see, and essentially make me feel inadequate. Billboards make me feel poor, because I can't afford a new home, or a meal at that expensive restaurant. Spam makes me worry that my penis is too small, my breasts are too small, I'm too fat, I don't send enough money to Nigeria. That said, it's illegal to saw down billboards, but it's not illegal to filter spam so I don't have to see it. The article is slashdotted, so I can't read it, but I think we already have good (free and open, no less) spam filtering available. I use Spam Assassin on my server, plus my mail client has a spam filter for double protection. Both have been learning more and more what constitutes spam, and it's rare that I even see spam anymore. If everyone would use these filters, spam would no longer be as profitable.
The emperor is naked.
Isn't Bayesian filtering system used in, Eg, Mozilla Mail classified as an AI?
Also, it ties up email servers meaning yours can take a little longer. I once got a spam message 2 weeks after it was sent, so what happened to legit email is a mystery.
I think for the damage it does both to servers (slowdown) and to people (moneydown), it could be called a plague
Get paid to search..It's geniune and
We already saw a plagiarized article green-lighted, and now this? Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it.
First, the submitter of this article has he email address jarhead4067@hotmail.com -- and so does the article's author.
Second, what is presented is not a genetic algorithm. The characteristics of the email to be considered to discover if the email is spam are finite and hard-core -- and even the threshold some characteristics must reach to qualify as spam are hard-core:
A genetic algorithm is one in which the goal is hard-core, different means of reaching that goal are generated, and the characteristics of the most successful are used to generate the next "generation"; this is repeated until the goal is reached.
But in this model, each "chromosome" contains statistics about one email. The heart of this model is to train a neural network with known emails ("chromosomes") and then tests unknown emails ("chromosomes") against the network.
Neural networks have a checkered history in Artificial Intelligence research. A (very much simplified) model of biologic neurons, neural networks were for a time seen as a great hope for Artificial Intelligence. A neural network basically starts out with an array of input nodes and an array of output nodes, with each input node connected to each output. Each input corresponds to some characteristic of the items the network is trained with: for classifying animals, the inputs would be characteristic of animals, e.g., "furry", "bipedal", "feathered"; each output a classification, e.g., "mammal", "bird", "human".
To train the network, the input nodes are set to the characteristics of an item, and then the strength of the connection of those inputs to the correct outputs is increased (or that of other connections is decreased -- it's the same thing). With enough training, it's possible to isolate the salient characteristics from the ambiguous one sin a mechanistic way.
This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals. In a single layer neural network, the connection strength between input "bipedal" and output "mammal" would fluctuate, unable to describe humans or birds well. These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.
But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing.
Of course I have no idea if classifying spam is intractable or not, but I have to question whether using a neural network reliably can outperform Bayesian (or quasi-Bayesian) filtering. My guess is that since Bayesian filtering can judge email by the occurrence of single tokens ("words"), and not just "chromosome" statistics, and given that this "new" method also uses Bayesian filtering to generate one of those "chromosome" statistics anyway (and for only the most difficult to characterize emails to boot), this method itself probably mostly relies on its Bayesian sub-component.
So I'm a bit at a loss to see why this method is in any way revolutionary or even particularly interesting, or why it was green-lighted for Slashdot. Of course, I only gave the linke
Opinions on the Twiddler2 hand-held keyboard?
from SpamAssassin? It takes a bunch of rules, applies them, and uses a neural net to classify the message. Seems to me SpamAssassin does the same thing, only is more mature and extensible and uses a genetic algorithm rather than a back-propagation neural net.
I've given up on Spam filtering and concentrating my efforts on Ham filtering.
Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.
What I am having success with is turning this on it's head and assuming that the bulk of incoming mail is bad, and filtering in messages that I want.
The way I am doing this is to use my address book as a whitelist - if an incoming message originates from someone in my address book, then it's delivered into the inbox. If not, then they are moved into a "not in address book" sub folder. Anything my ISP spam assassin based filtering marks, is sent into the "Spam" folder. Doing it this way means that I am only notified of incoming mail that is confirmed from someone in my address book. Periodically I check the other folders (obviously).
We have come to the point I think where the number of variables involved makes filtering in a less intensive process than attempting to deal with the myriad of underhanded techniques that spammers use. By limiting the mail I want to people in my address book, I make it so that spammers are the ones having to deal with the variables as they would have to guess addresses in my address book. If lots of people started filtering like this when we would see spammers using known bulk mail addresses (such as the address iTunes receipts are mailed from) however we can simply alter the filter to include the originating IP / mailer and so on.
Think of it like fishing - you wouldn't attempt to control an entire ocean and remove the water to leave the fish - you accept that the water is there and develop techniques to get the fish out.
I agree, however, that your concern about constructed attacks against detection of specific features is a killer, as it stands. But given a large enough set of features to look for in both form and contents the task becomes increasingly more difficult (hence SpamAssassin's success), would that problem tend to eleminate itself?
I'm using SpamAssassin now, and I think its primary weakness is lack of combinatorial weighing. Feature X is worth n point independently of the presence of other features in the message (or not? I might just have never found how).
-- MG
[1] Where "large enough" is the usual hard problem.
What this really points to is the need to have a common framework that a variety of classifiers can operate within. Consensus classification, using diverse techniques, creates a statistical highwire for the would-be spammer to walk. Significant computation can be engaged to calculate email contents that have higher probabilities of fooling bayesian classifiers; fooling two radically different techniques with a single message is pretty hard.
I want to be able to think up a new trait or technique, push it into the framework on a "trial" basis and be able to see the results of it.
Having a domain that's been out there for some time now, I receive about 7k to 12k spam messages a day. Most of these are from zombied PCs broadcasting mail to a random name at an email address. Recently my bayesian classifier has been giving spam scores on these as low as 40%. I have my threshold set at 50%, I think, and I may be lowering it again.
These messages hold hundreds of non-words, together with creatively "uglified" versions of common spam words. The trait I'd like to check for is "ratio of words never seen in ham"; seems like a nice and sensible thing to look for.
Without having a ton of history available and a framework, it's difficult to proceed.
To be honest, I also live in fear of losing my current, finely-tuned bayesian filter...which hasn't given me a false negative in months, and only delivers a few false positives a day.
Neural networks probably represent a better way of combining probabilities gained from multiple techniques. Bayesian stuff works pretty damn well, but we may need to give it a little more "traction" into the problem...
"But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing."
Not so. Maybe you're still thinking about extremely simple neural nets, because no such proof of intractability exists for larger more complex networks.
Here's proof: Neural Networks can emulate a Universal Turing Machine. Since they can also be emulated by a UTM their limitations are no greater or less than those of any UTM. One citation if this isn't obviously true.
This is exactly why Marvin Minsky has been accused of slandering neural nets unfairly, and hindering AI research. In his book _Perceptrons_ he demonstrated a simple problem that a trivial (one or two layers with no feedback) NN can't solve. A lot of scientists wrote off Neural Nets just as you have, because a toy was the only tool used. Never mind the fact that an only slightly more complex NN can solve such a problem easily. I find it telling that for a human to solve the same problem, one has to construct a strategy to do it. Not the sort of thing I'd assume any extremely simple machine could do. These days Minsky complains that AI isn't trying to build human brains. He's a brilliant man, but in some cases (as with many famous people) his chutzpah occasionally outstrips his judgement. I only wish that great scientists were immune to this.
Lots of less qualified people complain that neural nets aren't useful because they have some unpleasant experience with them. They have no idea of the variety of neural nets. It's like using a Playstation and complaining that computers are not useful.
As for spam filtering with AI, unless you have the narrow definition of AI, the Bayesian techniques of SpamAssassin are AI, as is the Latent Semantic Analysis done by OSX mail.app for spam filtering. LSA, while computationally expensive on a PC, is regarded as equivalent to a particular type of 3 layer neural net, (see Kohonen self-organizing maps.)
One thing you have right. Neural nets are "no new thing." They're as old as biological brains. Novelty is not a criterion for usefulness.
Assembly is the reverse of disassembly.