Using AI for Spam Filtering (w/ Source Code)

Spam really needs to be done away with. by ODD97 · 2004-07-11 01:25 · Score: 2, Interesting

I dislike spam, in the same way only more than I dislike all the billboards along the highways. They get in the way of what I really want to see, and essentially make me feel inadequate. Billboards make me feel poor, because I can't afford a new home, or a meal at that expensive restaurant. Spam makes me worry that my penis is too small, my breasts are too small, I'm too fat, I don't send enough money to Nigeria. That said, it's illegal to saw down billboards, but it's not illegal to filter spam so I don't have to see it. The article is slashdotted, so I can't read it, but I think we already have good (free and open, no less) spam filtering available. I use Spam Assassin on my server, plus my mail client has a spam filter for double protection. Both have been learning more and more what constitutes spam, and it's rare that I even see spam anymore. If everyone would use these filters, spam would no longer be as profitable.

--
The emperor is naked.

Re:Spam really needs to be done away with. by slashname3 · 2004-07-11 04:13 · Score: 3, Interesting

I agree. I implemented spamassassin and it has worked wonders. We were seeing anywhere from 3000 to 7000 spam messages a day. Virtual all were tagged as spam by spamassassin.

This past week I implemented another tool called greylisting in the fight against spam.

Over a typical weekend for two days I would see something like 5000 to 8000 spam messages. Since implementing greylisting in the last two days we have seen 7 (yes seven) spam messages that were subsquently tagged as spam by spamassassin.

I never expected it to work that well but it has.

Highly recommended in this fight against spam.

Bayesian filtering by sctprog · 2004-07-11 01:27 · Score: 2, Interesting

Isn't Bayesian filtering system used in, Eg, Mozilla Mail classified as an AI?

Re:This guy may take spam a little too seriously.. by bairy · 2004-07-11 01:41 · Score: 2, Interesting

Compared to aids there's no real contest. But spam is a real bastard to everyone on the net, not just because it's seriously annoying, but because some people fall for the scams (419 scam etc) and actually lose money.

Also, it ties up email servers meaning yours can take a little longer. I once got a spam message 2 weeks after it was sent, so what happened to legit email is a mystery.

I think for the damage it does both to servers (slowdown) and to people (moneydown), it could be called a plague

--

Get paid to search..It's geniune and

Not new, not genetic, not A.I. -- it's Bayesian by orthogonal · 2004-07-11 01:46 · Score: 5, Interesting

Is Slashdot trying to jump the shark?

We already saw a plagiarized article green-lighted, and now this? Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it.

First, the submitter of this article has he email address jarhead4067@hotmail.com -- and so does the article's author.

Second, what is presented is not a genetic algorithm. The characteristics of the email to be considered to discover if the email is spam are finite and hard-core -- and even the threshold some characteristics must reach to qualify as spam are hard-core:

// This can be adjusted... Calculating the misspelled word ratio and // any Bayesian probability is time consuming if (stats.SpamProbability < .66)

A genetic algorithm is one in which the goal is hard-core, different means of reaching that goal are generated, and the characteristics of the most successful are used to generate the next "generation"; this is repeated until the goal is reached.

But in this model, each "chromosome" contains statistics about one email. The heart of this model is to train a neural network with known emails ("chromosomes") and then tests unknown emails ("chromosomes") against the network.

Neural networks have a checkered history in Artificial Intelligence research. A (very much simplified) model of biologic neurons, neural networks were for a time seen as a great hope for Artificial Intelligence. A neural network basically starts out with an array of input nodes and an array of output nodes, with each input node connected to each output. Each input corresponds to some characteristic of the items the network is trained with: for classifying animals, the inputs would be characteristic of animals, e.g., "furry", "bipedal", "feathered"; each output a classification, e.g., "mammal", "bird", "human".

To train the network, the input nodes are set to the characteristics of an item, and then the strength of the connection of those inputs to the correct outputs is increased (or that of other connections is decreased -- it's the same thing). With enough training, it's possible to isolate the salient characteristics from the ambiguous one sin a mechanistic way.

This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals. In a single layer neural network, the connection strength between input "bipedal" and output "mammal" would fluctuate, unable to describe humans or birds well. These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.

But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing.

Of course I have no idea if classifying spam is intractable or not, but I have to question whether using a neural network reliably can outperform Bayesian (or quasi-Bayesian) filtering. My guess is that since Bayesian filtering can judge email by the occurrence of single tokens ("words"), and not just "chromosome" statistics, and given that this "new" method also uses Bayesian filtering to generate one of those "chromosome" statistics anyway (and for only the most difficult to characterize emails to boot), this method itself probably mostly relies on its Bayesian sub-component.

So I'm a bit at a loss to see why this method is in any way revolutionary or even particularly interesting, or why it was green-lighted for Slashdot. Of course, I only gave the linke

--
Opinions on the Twiddler2 hand-held keyboard?

Re:Not new, not genetic, not A.I. -- it's Bayesian by Epistax · 2004-07-11 03:26 · Score: 2, Interesting

You had a good piece on neural networks in there so I thought I'd reply about my own experiences. I've made a few networks from scratch in C++ and tried to train it a few things. From the problems I was having I came to the conclusion that we're training these analog thinkers to solve digital problems, and it's not working so well. Is this a mammal? That's a yes or no question and it is hard to teach a network to answer it. I think neural networks are much better at doing things such as "which". Which animal has the most "mammal essence". One thing I am thinking about doing is giving a cross-sectional view of a city and asking which building is the tallest. I think a network would be much better at answering that.

Another problem is the physical aspect: how many neurons does it take, how should they be linked, and can new ones be grown to solve the problem? I think the 2nd problem is very important. Will every problem be a straight shot input to output 2d map of neurons or will there be backwards traversal? Will these systems settle on a given output or be constantly slightly changing? If you look at an object and decide what it is, your mind will start making things out of it. If you ask a neural network what animal something is and show it a house cat, it's not at all incorrect for it to come up with "Lion" after selecting "Cat". The network is simply thinking about what it is seeing. Again, this implies feedback. I remember seeing one basic model of a neural network where every output node was also an input node. This is a good start but it assumes that no internal thoughts loop back which I believe is incorrect.
As for other issues.. how many neurons? may more grow? I suppose if we truely want a system to be completely organic then we want to start with just the input and output nodes. Let the network figure out that it can't figure it out, and try to guess at the best places to add neurons. I don't know if this has already been done, but I think it's safe to say it hasn't been done well.

I am very interested in this subject and being a computer engineer (er, in school) I am really looking forward to the hardware that can be designed using neural networks for processing.

How is this any different... by Fooby · 2004-07-11 01:46 · Score: 5, Interesting

from SpamAssassin? It takes a bunch of rules, applies them, and uses a neural net to classify the message. Seems to me SpamAssassin does the same thing, only is more mature and extensible and uses a genetic algorithm rather than a back-propagation neural net.

Ham filtering by skinfitz · 2004-07-11 02:18 · Score: 4, Interesting

I've given up on Spam filtering and concentrating my efforts on Ham filtering.

Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.

What I am having success with is turning this on it's head and assuming that the bulk of incoming mail is bad, and filtering in messages that I want.

The way I am doing this is to use my address book as a whitelist - if an incoming message originates from someone in my address book, then it's delivered into the inbox. If not, then they are moved into a "not in address book" sub folder. Anything my ISP spam assassin based filtering marks, is sent into the "Spam" folder. Doing it this way means that I am only notified of incoming mail that is confirmed from someone in my address book. Periodically I check the other folders (obviously).

We have come to the point I think where the number of variables involved makes filtering in a less intensive process than attempting to deal with the myriad of underhanded techniques that spammers use. By limiting the mail I want to people in my address book, I make it so that spammers are the ones having to deal with the variables as they would have to guess addresses in my address book. If lots of people started filtering like this when we would see spammers using known bulk mail addresses (such as the address iTunes receipts are mailed from) however we can simply alter the filter to include the originating IP / mailer and so on.

Think of it like fishing - you wouldn't attempt to control an entire ocean and remove the water to leave the fish - you accept that the water is there and develop techniques to get the fish out.

Re:Some comments by Montreal+Geek · 2004-07-11 05:58 · Score: 2, Interesting

I think you make a very good point, but given a large enough[1] training corpus, and being very conservative on the weight to assign to error backpropagation, wouldn't it be interresting to see if the decision hyperplane would be able to reshape itself quickly enough to include freshly "evolved" forms of spam as they appear? (Provided, of course, that those consist of variants on previous forms).

I agree, however, that your concern about constructed attacks against detection of specific features is a killer, as it stands. But given a large enough set of features to look for in both form and contents the task becomes increasingly more difficult (hence SpamAssassin's success), would that problem tend to eleminate itself?

I'm using SpamAssassin now, and I think its primary weakness is lack of combinatorial weighing. Feature X is worth n point independently of the presence of other features in the message (or not? I might just have never found how).

-- MG

[1] Where "large enough" is the usual hard problem.

Re:Some comments by rossjudson · 2004-07-11 06:04 · Score: 2, Interesting

What this really points to is the need to have a common framework that a variety of classifiers can operate within. Consensus classification, using diverse techniques, creates a statistical highwire for the would-be spammer to walk. Significant computation can be engaged to calculate email contents that have higher probabilities of fooling bayesian classifiers; fooling two radically different techniques with a single message is pretty hard.

I want to be able to think up a new trait or technique, push it into the framework on a "trial" basis and be able to see the results of it.

Having a domain that's been out there for some time now, I receive about 7k to 12k spam messages a day. Most of these are from zombied PCs broadcasting mail to a random name at an email address. Recently my bayesian classifier has been giving spam scores on these as low as 40%. I have my threshold set at 50%, I think, and I may be lowering it again.

These messages hold hundreds of non-words, together with creatively "uglified" versions of common spam words. The trait I'd like to check for is "ratio of words never seen in ham"; seems like a nice and sensible thing to look for.

Without having a ton of history available and a framework, it's difficult to proceed.

To be honest, I also live in fear of losing my current, finely-tuned bayesian filter...which hasn't given me a false negative in months, and only delivers a few false positives a day.

Neural networks probably represent a better way of combining probabilities gained from multiple techniques. Bayesian stuff works pretty damn well, but we may need to give it a little more "traction" into the problem...

You underestimate Neural Nets by obtuse · 2004-07-11 20:34 · Score: 2, Interesting

"But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing."

Not so. Maybe you're still thinking about extremely simple neural nets, because no such proof of intractability exists for larger more complex networks.

Here's proof: Neural Networks can emulate a Universal Turing Machine. Since they can also be emulated by a UTM their limitations are no greater or less than those of any UTM. One citation if this isn't obviously true.

This is exactly why Marvin Minsky has been accused of slandering neural nets unfairly, and hindering AI research. In his book _Perceptrons_ he demonstrated a simple problem that a trivial (one or two layers with no feedback) NN can't solve. A lot of scientists wrote off Neural Nets just as you have, because a toy was the only tool used. Never mind the fact that an only slightly more complex NN can solve such a problem easily. I find it telling that for a human to solve the same problem, one has to construct a strategy to do it. Not the sort of thing I'd assume any extremely simple machine could do. These days Minsky complains that AI isn't trying to build human brains. He's a brilliant man, but in some cases (as with many famous people) his chutzpah occasionally outstrips his judgement. I only wish that great scientists were immune to this.

Lots of less qualified people complain that neural nets aren't useful because they have some unpleasant experience with them. They have no idea of the variety of neural nets. It's like using a Playstation and complaining that computers are not useful.

As for spam filtering with AI, unless you have the narrow definition of AI, the Bayesian techniques of SpamAssassin are AI, as is the Latent Semantic Analysis done by OSX mail.app for spam filtering. LSA, while computationally expensive on a PC, is regarded as equivalent to a particular type of 3 layer neural net, (see Kohonen self-organizing maps.)

One thing you have right. Neural nets are "no new thing." They're as old as biological brains. Novelty is not a criterion for usefulness.

--
Assembly is the reverse of disassembly.

Slashdot Mirror

Using AI for Spam Filtering (w/ Source Code)

11 of 197 comments (clear)