Fighting Spam with DNA Sequencing Algorithms

Feng Shui hardware by simp · 2004-08-22 01:12 · Score: 5, Funny

Excellent! This will go wel with my Feng Shui compliant wall of rocks that I use as a firewall.

Re:Feng Shui hardware by Pigbot · 2004-08-22 01:54 · Score: 4, Funny

Considering how much spam I get trying to sell me Viagra or porn, I have reservations about using someone's DNA to fight spam. It just sounds dirty. And sticky. Like someone should at least buy me dinner first.

--
print "Oink!\n" if ( $tail =~ "pull" );
Re:Feng Shui hardware by BJH · 2004-08-22 02:02 · Score: 5, Informative

If I'm not mistaken, Chung Kwei is the figure known as Shouki in Japanese. He's usually described in English as the "Demon Queller", which seems a suitable-enough symbol for an anti-spam program.

I mean, come on - don't anti-spam programs have the coolest names? SpamAssassin, Vipul's Razor...

Wordfilter by bert.cl · 2004-08-22 01:18 · Score: 3, Insightful

While the numbers are impressive, this just looks like a filter that does combined wordsearches?

Even with training, isn't this just some regexp and searchting after particular strings.

And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?

Re:Wordfilter by rokzy · 2004-08-22 01:23 · Score: 2, Interesting

91% detection is far from impressive. AFAIK the better filters today are 99.9% successful. the benefit of this one is its low false-positive rate.

personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

can someone point me in the direction of such a filter?
Re:Wordfilter by Incadenza · 2004-08-22 01:33 · Score: 4, Informative

personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

can someone point me in the direction of such a filter?

How about spamassassin?
Just add the following to /etc/mail/spamassassin/local.cf:
ok_languages en
And increase the score for BIZ_TLD and other tests you find more important than others. Scoring per test is fully configurable, complete list of tests here.

Mozilla Firefox by nycsubway · 2004-08-22 01:21 · Score: 2, Insightful

I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.

With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.

--
http://github.com/gbook/nidb

Re:Mozilla Firefox by rokzy · 2004-08-22 01:30 · Score: 2, Insightful

I've had mixed results with Thunderbird. in the beginning it seemed to work great, then I noticed it was junking all my legitimate email too. then I fixed that but it started letting through blatantly obvious stuff.

the newest version has been doing better so far.

I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
Re:Mozilla Firefox by danharan · 2004-08-22 01:33 · Score: 2, Interesting

I think you mean Thunderbird.

My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

This method is promising because it uses spell-checking and a better way to identify spammy string sequences, something none of the two main camps of spam-filters have seem keen to do until now.

--
Information: "I want to be anthropomorphized"
Re:Mozilla Firefox by littlem · 2004-08-22 02:21 · Score: 3, Interesting

My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

This shouldn't be all that surprising - Bayesian filtering is all based on probabilities. The reason "Outlook message rules" is so bad is because a friend of mine might send me a joke about Viagra, which I don't want to have deleted indiscriminately as spam. False positives are infinitely more annoying than false negatives, so I'd much rather have conservative filtering that let a bit of spam through.

I'm not saying Bayseian algorithms are perfect yet (though they'll improve) - my personal experience has been SpamAssassin, which got 97% of spam, and I've been experimenting with Thunderbird for a week, which gets 85%-90% and will no doubt get much much better as I train it in the next couple of weeks - but ultimately Bayesian filtering is enough to beat enough spam to make spamming not worthwhile (if everyone did it...)
Re:Mozilla Firefox by aussie_a · 2004-08-22 02:31 · Score: 3, Funny

I agree. The Mozilla Firefox spam filter works great for me. I no longer go to all those goatse sites that people link to thanks to the plugin :) But I have to keep uninstalling and reinstalling it, because after 2 days it says slashdot is spam.
Re:Mozilla Firefox by toxic666 · 2004-08-22 02:39 · Score: 2, Interesting

"I" being the key word in your assessment. Fine for the home user, not so good for a business.

Maintaining an enterprise mail system based upon user-controlled spam filtering software is not practical. That small percentage of users with consistent ID 10T errors adds up fast. Try correcting false positives for a user-configured filter. It's time-consuming.

The better approach from an administrative standpoint is controlling spam at the MTA- and MDA- levels of the mail server. I use postfix checks with MDA-level Bayesian filtering with reasonable success. The spam mbox is comprised of user-submitted and administratively approved mail. The user submits it, and the admin checks for things like filter poisoning text before moving it to the real spam mbox.

Most importantly, my false-positive rate is extremely low -- probably 10's of thousandths of a percent.
Re:Mozilla Firefox by Technonotice_Dom · 2004-08-22 03:43 · Score: 3, Informative

I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

There are a few databases out there that take hashes of spam e-mails (either sent to spam traps or reported) and use them for spam tagging. SpamAssassin can use their client programs to help tag messages also - I don't know if there's an extension or anything for Thunderbird, I don't use it.

The three that come to mind are DCC, Razor and Pyzor.

All have their advantages or disadvantages, but you have to remember that you're relying on somebody else's judgement. I think it's DCC that you can easily configure to say that you need x reports of the message before you class the message as spam, which gives you more control. But you only need one person who doesn't use it correctly to ruin the system and introduce lots of false positives.

You could always set up SpamAssassin on your local machine and proxy messages through that.

High tech for what ? by Ozh · 2004-08-22 01:21 · Score: 3, Interesting

Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address... and so simple that it will never stop :/

Thunderbird by bert.cl · 2004-08-22 01:23 · Score: 2, Informative

I think you mean Mozilla Thunderbird?

Bayesian Still Works by Admiral+Justin · 2004-08-22 01:30 · Score: 4, Funny

For now, Bayesian filtering still gets the job done most of the time, so I think we shouldn't get too excited.

Besides, you have to ask yourself some questions...

"What happens if you try to filter spam with RNA?"

"Just how good can ACT and G manage spam?"

and, most important of all...

"Are you sure this spam filter uses no portion of Keanu Reeves' genetic code?"

--
You will be baked, and there will be cake.

Re:hm by Pigbot · 2004-08-22 01:35 · Score: 5, Insightful

wonder what the spammers will come up with to get around this...

Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.

Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

--
print "Oink!\n" if ( $tail =~ "pull" );

Love SA... by ajs · 2004-08-22 01:41 · Score: 5, Informative

You have to love SpamAssassin for it's very Perlish approach to spam filtering... "hey, there's a cool new way to filter spam... throw it in!"

I love this mostly because it means that SA is a moving target. Spammers can figure out how to defeat pieces of it, but it deploys a wide range of static, dynamic, network-based and user-driven tests that changes so much that spammers simply can't afford to keep up.

Re:hm by Proud+like+a+god · 2004-08-22 01:45 · Score: 2, Informative

Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

You lucky g*t! :-P

The biggest problem I see, at the moment.... by Rahga · 2004-08-22 01:46 · Score: 3, Interesting

It looks like much of the spam I'm recieving today consits of either nearly-blank or e-mails containing news articles that seem to be designed to pass trough content filters just so users can send them back to their admins as spam, essentially making it easier for bayesian filters and such to mark legitimate e-mail as spam.... though honestly, it's more of annoyance for me, as it makes it easier for users to say "The spam filter isn't working, what are you doing wrong?"

Wrong title, I guess by stm2 · 2004-08-22 01:47 · Score: 5, Interesting

According to the ./ title, it seems they used an algorithm used for DNA secuencing, when in fact they used an algorithm used for DNA analisis (or DNA sequence analisis that is the same), more specifically, gene finding techniques. As you may know, most DNA in a genome is not translated into protein (some people still call it junk, but most of it is no junk at all). So there are programs to sort genes out from the rest of DNA.
I think we will see more and more applications like this with the growing cross-polination between Biology and CS.

--
DNA in your Linux: DNALinux

What could we do... by d3ity · 2004-08-22 01:49 · Score: 2, Funny

I'd love to meet the scientist that thought this up. It probably went something like this: Boss: Well we've made promising gains in the DNA reasearch project, Now what applications could this be used for Engineer: The possibilites are litless! we could cure cancer! We could invent a super puppy that combines the abilities of a lovable puppy and tux, the friendly linux penguin! We could use it to rengenerate limbs for amputees! Marketing: Lets use it to get rid of spam emails! Boss: Great idea! Lets go with that one.

Works until the Spammers get a copy of it by G4from128k · 2004-08-22 01:53 · Score: 4, Insightful

This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).

Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.

--
Two wrongs don't make a right, but three lefts do.

Re:Works until the Spammers get a copy of it by Donny+Smith · 2004-08-22 02:01 · Score: 2, Interesting

Good point - that's why, in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

Spell checker as anti-spam filter - that would create huge problems for most Americans :-)
Otherwise it's a good idea.
Re:Works until the Spammers get a copy of it by Tim+C · 2004-08-22 02:54 · Score: 2, Insightful

in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

How so?

1) install software
2) treat as black box
3) spam spam spam
4) see what gets through
5) study, enhance
6) goto 3)

Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.

--
It's official. Most of you are morons.
Re:Works until the Spammers get a copy of it by Tablizer · 2004-08-22 10:12 · Score: 2, Insightful

Personally, I've always thought that a simple spell check would do a good job as another layer filtering.

Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody :-)

--
Table-ized A.I.

Interesting... Electronic evolution... by dnaboy · 2004-08-22 02:09 · Score: 5, Insightful

I think it's really interesting to watch the literal evolution of spam and spam filters. There are really amazing parallels to biological evolution.

First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).

Second, there seems to be some sort of equilibrium which is inevitably achieved, and

Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.

I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.

Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.

Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...

Re:Interesting... Electronic evolution... by devphil · 2004-08-22 08:56 · Score: 2, Insightful

First, there's a constant tuning of both preditor and prey

Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.

(If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)

there are occasional discreet major developments

Um. "Discrete" is the word you want. Spammers are anything but discreet. :-)

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)

It is difficult to beat statistical spam filters by gvc · 2004-08-22 02:18 · Score: 2, Informative

Notwithstanding accepted wisdom espoused above, random words cannot defeat current statistical spam filters, and it is difficult to defeat such filters even if you have access to the algorithm and the recipient's mailbox.

John Graham-Cumming presented a talk Beating Bayesian Filters at the 2004 Spam Conference detailing these results. A video recording is available; alas, no paper.

In conducting a recent spam filter evaluation I observed (but did not report) that the statistical filter attacks were not particularly effective. The only attack that worked sometimes was to make the entire body of the message a current news item or joke, with only a URL linking to the spam payload.

Re:hm by great_snoopy · 2004-08-22 02:20 · Score: 3, Informative

In fact, they did. The last spams I receive are composed of two parts : the spammy part, and a longer part that is usually a news paragraph from a public news site like news.google.com or cnn. The second part usually has a very small or none spammy fingerprint, cloaking the first spammy part.

They'll.. by aussie_a · 2004-08-22 02:24 · Score: 2, Interesting

To get around this spammers will use DNA algorithms to create spam that gets around the blockers ;)

Corrections... by littlewild · 2004-08-22 02:26 · Score: 3, Insightful

Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.

Re:Misnomer, it's not "fighting spam"... by argent · 2004-08-22 02:28 · Score: 5, Insightful

As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.

People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.

I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.

I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.

Get the Feng Shui Motherboard by Kozz · 2004-08-22 03:04 · Score: 2, Funny

"We put the CPU in the center, because that is the chi, or life force for the entire board. A centered chi provides better performance." Now don't you want one?

--
I only post comments when someone on the internet is wrong.

Nice tool but greylisting does more right now! by slashname3 · 2004-08-22 03:26 · Score: 2, Interesting

This will make another nice tool to identify spam. But why not use greylisting at all the ISPs MTAs to simply refuse 99% of the spam that is being sent right now?

Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a list of addresses spewing messages out by the thousands. They do not queue messages or retry them if they get an error. Greylisting uses this to great effect and blocks spam while letting legitimate MTAs deliver messages.

True, it is not 100% effective, some small number of spam messages get through since some spam goes through legitimate MTAs and the message is retried. But once you remove the bulk of spam those can be tracked down and shutdown or blocked at the firewalls.

If the ISPs would implement this spam would become a non-issue over night. Email would once again become a mostly useful tool. But I guess the problem is that the ISPs have no vested interest in solving this problem. None of them will listen or implement this simple solution which does not block any legitimate email. With 70% of the email on the network being spam (number may be higher than that at this time) I would think they would jump at a solution that would reduce the loads on their servers. But I guess they make to much money from spammers to implement such a simple solution.

Re:Stop This B\/llsh!t Filtering Crap by mikael · 2004-08-22 03:50 · Score: 2, Insightful

Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.

If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads

For those who don't want to RTFA by Frankie70 · 2004-08-22 04:08 · Score: 2, Funny

Summary
1) Make your PC face the North, whenever you are checking Email.
2) Hang a metal windchime above your workstation.
It is important that the rods of the windchime to be hollow, so that the auspicious Chi can rise up the chimes.
3) Add a user account for the Dragon Turtle & make him the admin.

More correct than you know by Hao+Wu · 2004-08-22 04:09 · Score: 2, Interesting

Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address...

This is just like your own immune system, which uses such things as "V-D-J" recombination (and other tricks) to create billions of some what random different epitope to attack potential unknown pathogens. Cells they must further educate not to attack "self" in your own body.

If only computer geeks took some lesson from biologist, perhaps they could get a grip on principles to stop SPAM.

--
I suggest you read Slashdot

Giving birth to Artificial Intelligence... by mcrbids · 2004-08-22 04:16 · Score: 3, Interesting

It's my belief that the most likely source of the birth of Artificial Intelligence will be the SPAM filter.

Think about it - we now have software that "learns' what you like.

Sorry, but anything that "learns" fits a definition of intelligence - using past results to predict future outcomes. Note that I'm not saying "self aware" or "conscious", simply "intelligence".

As we move forward, we'll see more and more intelligence on the part of the spammers, and the warring factions of intelligence will likely provide massive financial and political impetus to build ever more intelligence solutions - thus AI is born.

The problem with other vehicles for developing AI is simply the budget. With SPAM, everybody has a direct, financial incentive to develop it, so development will definitely happen!

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.

Nothing new here, move along... by po8 · 2004-08-22 05:08 · Score: 4, Informative

As someone who's done some research on machine learning for spam filtering, this sure looks to me from their 8-page paper like yet another simplistic ML algorithm advocated by folks who don't know the field and tested using techniques of questionable sensitivity. Their "novel" method sounds an awful lot like feature set construction by clustering, a method that is widely used in the spam filtering literature, but with a somewhat novel clustering technique from biology.

Message filtering starts by throwing away line breaks for no obvious reason, then optionally removing the known ham from the training set for no obvious reason. Message headers are then thrown away, for no obvious reason.

No general method is given for corpus allocation. In the experiment reported later, the original corpus appears to have been split roughly in half. (For unreported reasons, none of these splits are exact. No rationale is given for the various corpus allocations.) The training corpus is then split into ham and spam, and the ham portion is split in half. The spam training corpus is used for "positive training": determining a complex feature set as described below. One half of the ham training corpus is then used for "negative training": filtering out complex features that are common in ham. The remainder of the ham corpus is used as a validation set to select thresholds described below. No justification is given as to the failure of the validation set to include spam messages, and the procedure is vague on this point.

The description of the key "positive training" phase is difficult to follow: it seems to assume the pre-existence of the "SPAM vocabulary" [sic] being constructed. The key idea seems to be to use positional index of words within the body as base features, and construct complex features by using a pattern recognition algorithm to find correspondences between sets of base features across spam messages. Patterns that appear across many spam messages are treated as indicating spam.

The final training step is to set thresholds for (1) minimum number of complex features in the spam message and (2) fraction of the message text covered by the complex features. One would expect these two criteria to be highly correlated: no effort appears to have been made to enforce or explore their orthogonality.

The classification phase proceeds by simply counting the number of patterns in a given test message and the percent coverage of the message by the patterns. If the result exceeds both thresholds, the message is classified as spam.

For the empirical evaluation, the corpus used seems to have consisted of approximately 130,000 messages, roughly 1/4 ham and 3/4 spam. No details of the construction or acquisition of this large corpus were given. Because of its volume, one would suspect a synthetic corpus from high volume sources. The details of this corpus construction are critical to the evaluation of the method, so no useful conclusions can really be drawn from the empirical evaluation other than that, like most machine learning methods, this method works well on some problem set.

The claimed accuracies from the technique are at a level that is highly suspect from previous experience: there are fundamental bounds on how well any ML algorithm can do in real situations that don't appear to be met here. Indeed, messages found to be misclassified as spam in the test corpus were manually reclassified, but no effort seems to have been made to identify messages that were "correctly" classified by the algorithm but misclassified in the corpus. The error rate before manual manipulation of the results (!) appears to be about 97%, which is well within the normal expected range. Computational efficiency appears to be good.

The vocabulary used in the paper is not particularly consistent with the vocabulary normally used in the spam filtering or machine learning literature. A few spam filtering and machine learning papers are cited, but not many: citations are primarily from the

Or... by sean.peters · 2004-08-22 05:35 · Score: 2, Funny

1) Acquire software
2) Decompile
3) Study code
4) Develop countermeasure
5) spam spam spam

It's not like spammers care about the EULA that says they can't look at the code. Oh, and before I forget...

6) ???
7) Profit!

Sean

Virus and worm detection! by Ungrounded+Lightning · 2004-08-22 05:54 · Score: 2, Interesting

That should work for virus and worm detection, too!

Even moreso, since viruses are much more a compilation of a set of previous constructions with a few mods than a new composition not necessarily based on the wording of old scams.

And Viruses and worms (especially worms) are more constratined by their environment, requiring an exploit of a vulnerability and the instation of work-doing code. Though gene-shuffling techniques might be able to bury much of the code, the basic exploit must continue to be some sort of match to the vulnerability's "receptor".

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

Re:hm by ca1v1n · 2004-08-22 06:16 · Score: 2, Interesting

The great thing about the similarity matching algorithms is that they read with noise filtering the same way that humans do. They also allow for like-character matching without any added computational overhead. This means that you can make a table of unicode characters that are similar to certain ascii characters that gets incorporated into the similarity matrix. By the power of these properties combined, your spam filter can recognize that c;al_is is intended to look like cialis, without a lot of expensive extra computations.

Now that we've neutralized that form of message garbling, we're left to dealing with bayes filter poisoning. This is something that entropy-based filtering deals with quite well.

All spam filtering techniques have weaknesses, but if you use a few different methods in concert, preferably within the same package to spare the poor user from having to set up a whole lot, you can get just about all of it.

Even using a few of these different methods together, I still get a few ads from companies I've done business with that have screwed up my communication preferences. This sucks, but most of these companies are clueless rather than malicious. Threatening to take my business elsewhere has never failed to correct these problems.

--
WARNING: there is a trojan on your

Re:This is all bull -- Change the law by koreth · 2004-08-22 06:54 · Score: 2, Insightful

This isn't going to work -- you simply can't solve a social / legal problem with technology.

You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.

Serious methodological flaws by YU+Nicks+NE+Way · 2004-08-22 12:56 · Score: 3, Insightful

It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.

The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.

The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.

This sounds like a fully buzzword compliant non-result to me.

Slashdot Mirror

Fighting Spam with DNA Sequencing Algorithms

45 of 142 comments (clear)