Fighting Spam with DNA Sequencing Algorithms

← Back to Stories (view on slashdot.org)

Fighting Spam with DNA Sequencing Algorithms

Posted by ryuzaki0 on Sunday August 22, 2004 @01:05AM from the crushing-the-mouse-with-a-mallet dept.

Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."

16 of 142 comments (clear)

Min score:

Reason:

Sort:

Wordfilter by bert.cl · 2004-08-22 01:18 · Score: 3, Insightful

While the numbers are impressive, this just looks like a filter that does combined wordsearches?
Even with training, isn't this just some regexp and searchting after particular strings.

And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?
Mozilla Firefox by nycsubway · 2004-08-22 01:21 · Score: 2, Insightful

I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.

With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.

--
http://github.com/gbook/nidb
1. Re:Mozilla Firefox by rokzy · 2004-08-22 01:30 · Score: 2, Insightful
  
  I've had mixed results with Thunderbird. in the beginning it seemed to work great, then I noticed it was junking all my legitimate email too. then I fixed that but it started letting through blatantly obvious stuff.
  
  the newest version has been doing better so far.
  
  I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
Misnomer, it's not "fighting spam"... by argent · 2004-08-22 01:30 · Score: 1, Insightful

This isn't "fighting spam", it's "adapting to spam".
1. Re:Misnomer, it's not "fighting spam"... by argent · 2004-08-22 02:28 · Score: 5, Insightful
  
  As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.
  
  People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.
  
  I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.
  
  I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.
Re:hm by Pigbot · 2004-08-22 01:35 · Score: 5, Insightful

wonder what the spammers will come up with to get around this...

Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.

Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

--
print "Oink!\n" if ( $tail =~ "pull" );
Works until the Spammers get a copy of it by G4from128k · 2004-08-22 01:53 · Score: 4, Insightful

This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).

Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.

--
Two wrongs don't make a right, but three lefts do.
1. Re:Works until the Spammers get a copy of it by Tim+C · 2004-08-22 02:54 · Score: 2, Insightful
  
  in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.
  
  How so?
  
  1) install software
  2) treat as black box
  3) spam spam spam
  4) see what gets through
  5) study, enhance
  6) goto 3)
  
  Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.
  
  --
  It's official. Most of you are morons.
2. Re:Works until the Spammers get a copy of it by Tablizer · 2004-08-22 10:12 · Score: 2, Insightful
  
  Personally, I've always thought that a simple spell check would do a good job as another layer filtering.
  
  Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody :-)
  
  --
  Table-ized A.I.
Interesting... Electronic evolution... by dnaboy · 2004-08-22 02:09 · Score: 5, Insightful

I think it's really interesting to watch the literal evolution of spam and spam filters. There are really amazing parallels to biological evolution.
First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).
Second, there seems to be some sort of equilibrium which is inevitably achieved, and
Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.
I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.
Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.
Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...
1. Re:Interesting... Electronic evolution... by devphil · 2004-08-22 08:56 · Score: 2, Insightful
  
  First, there's a constant tuning of both preditor and prey
  
  Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.
  
  (If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)
  
  there are occasional discreet major developments
  
  Um. "Discrete" is the word you want. Spammers are anything but discreet. :-)
  
  --
  You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Corrections... by littlewild · 2004-08-22 02:26 · Score: 3, Insightful

Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.
Re:Feng Shui hardware by DNS-and-BIND · 2004-08-22 02:40 · Score: 1, Insightful

It's hardly appropriate that such superstition should be given encouragement in this day and age. Penn & Teller did a great bit on "feng shui" on their show, "Bullshit!". They had 3 different feng shui consultants come in to a house, and each one recommended different changes for different reasons. Some discipline.

--
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Re:Stop This B\/llsh!t Filtering Crap by mikael · 2004-08-22 03:50 · Score: 2, Insightful

Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.

If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:This is all bull -- Change the law by koreth · 2004-08-22 06:54 · Score: 2, Insightful

This isn't going to work -- you simply can't solve a social / legal problem with technology.

You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.
Serious methodological flaws by YU+Nicks+NE+Way · 2004-08-22 12:56 · Score: 3, Insightful

It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.

The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.

The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.

This sounds like a fully buzzword compliant non-result to me.