Spam Detection Using an Artificial Immune System

← Back to Stories (view on slashdot.org)

Spam Detection Using an Artificial Immune System

Posted by ryuzaki0 on Monday July 10, 2006 @09:04AM from the lymp0cty3z-narf-poit!-claire-said-the-laundry-wheel dept.

rangeva writes "As anti-spam solutions evolve to limit junk email, the senders quickly adapt to make sure their messages are seen. an interesting article describes the application of an artificial immune system model to effectively protect email users from unwanted messages. In particular, it tests a spam immune system against the publicly available SpamAssassin corpus of spam and non-spam. It does so by classifying email messages with the detectors produced by the immune system. The resulting system classifies the messages with accuracy similar to that of other spam filters, but it does so with fewer detectors."

4 of 114 comments (clear)

Min score:

Reason:

Sort:

The utility of newer systems by CRCulver · 2006-07-10 09:08 · Score: 3, Informative

I have to admit, I don't see the need for these recent whizbang's additions to the spam-fighting repertoire. Sure, they might be ingenious, but on a practical level they don't do anything more than a properly-configured SpamAssassin system. I used to get a lot of spam coming through a default installation of SpamAssassin, but after spending some time with O'Reilly's book (the free docs may already be up to this level of reader-friendliness, it's been a couple of years) and tweaking my installation, I get spam once in a blue moon. There's just no need for anything more.
Not much by jfengel · 2006-07-10 09:28 · Score: 5, Informative

Ultimately, very little. At core, they're probably identical techniques, and if I were reviewing this as a scientific paper I'd ding them for not answering exactly that question. There are such strong parallels between the two (train them on known data, add up probabilities, cut stuff on a threshold) that I strongly suspect that they're identical.

There are useful things to be gained from a change of metaphor. For example, one difference between this and most bayesian spam filter implementations is that this explicitly incorporates a decay function. That could be useful, if a word that used to be common in spam no longer is (e.g. if I actually decided to buy a Rolex, it's no longer a strong spam indicator, whereas right now any email mentionining "Rolex" is 99.9999% certain to be spam).

You could easily modify a Bayesian filter to have time-decaying weights, but if the change in metaphor leads somebody to come up with a good insight, then perhaps this is useful. Mathematically, though, the equations look very similar.
1. Re:Not much by Anonymous Coward · 2006-07-10 18:57 · Score: 1, Informative
  
  Ah, commenting on things we don't know like we are an expert.... just like telling the doctor that a cold and flu must be the same because they have similar symptoms....
  
  Here, let me clarify the differences for you. The primary difference is in the nature of the tokens used to classify a message. The Bayesian system has words/tokens that are either predefined by a human or taken from messages verbatim. The artificial immune system has tokens that are randomly and automatically generated by the system using some method. In this case, the method is to build a regular expression from a pre-defined list of stubs. Random generation of genes (tokens) is one of the basic tenents of AIS. This random generation has some positives (generating antibodies for unseen pathogens) and negatives (generating antibodies that attack self). In comparison, a Bayesian classifier tokens are all pre-existing: either from human input or from the message. The AIS is a more dynamic and adaptive approach because it can contain tokens that match unseen values.
  
  Secondly, you're confusing the intended use of the collection of statistics. Both have a similar method for collecting statistics (mark how many spam and non-spam messages the token matches), but for entirely different purposes. For a Bayesian classifier, the statistics are used to assign a probability that the word indicates a spam message. The overall spam weight for the message is the sum of the probabilities of each individual word (and some also use the position of the word in the message to affect the probability as well). For AIS, the statistics are used to measure the worthiness of the detector. If a detector is detecting too many normal messages, it is not a good detector (think auto-immune diseases in humans) and should be given very little credibility. An AIS usually includes a very vital stage called negative selection where detectors that react to "self" (non-spam messages) are eliminated. This work seems to not include a full negative selection algorithm, where the bad detector is thrown out and replaced with a new one, and instead does something more akin to Bayesian classifiers, ie sum probabilities and update the statistics based on live data (classic AIS has has static detectors once they leave the training phase). They also seem to rely on chance to create memory cells rather than an affinity maturation process using genetic algorithms or other techniques. But the training phase is still intended to judge the worthiness of the candidate detectors in their AIS. They just don't follow through completely on this AIS technique.
  
  I would say this work is a very simplified AIS as it does not include all of the hallmarks of other AIS. As such, it does have some superficial resemblance to a Bayesian classifier, particularly the modified classifier algorithms used in the spam tools. But it would not take much modification include more AIS features into their system to make it more distinct from (and hopefully more accurate than) Bayesian classifiers. It's unclear if the authors skipped the more advanced features due to desire for low computational costs. It is however distinct from classifiers in token generation at the very least and could be converted to a more traditional AIS with a little work.
SpamAssassin does "decay" them. by khasim · 2006-07-10 09:40 · Score: 2, Informative

Look up "bayes_expiry_max_db_size". If your database gets larger than the limit you set then the lesser used tokens are deleted.