Fighting Spam with DNA Sequencing Algorithms
Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
Even with training, isn't this just some regexp and searchting after particular strings.
And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?
I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.
With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.
http://github.com/gbook/nidb
This isn't "fighting spam", it's "adapting to spam".
wonder what the spammers will come up with to get around this...
Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.
Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.
print "Oink!\n" if ( $tail =~ "pull" );
This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.
For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).
Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.
Two wrongs don't make a right, but three lefts do.
First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).
Second, there seems to be some sort of equilibrium which is inevitably achieved, and
Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.
I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.
Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.
Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...
Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.
It's hardly appropriate that such superstition should be given encouragement in this day and age. Penn & Teller did a great bit on "feng shui" on their show, "Bullshit!". They had 3 different feng shui consultants come in to a house, and each one recommended different changes for different reasons. Some discipline.
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.
If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.
It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.
The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.
The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.
This sounds like a fully buzzword compliant non-result to me.