Seven Spam Filters Compared

Unadvertised by Anonymous Coward · 2003-08-23 06:56 · Score: 5, Funny

Sounds great, but until I hear about software products like these in my morning mailbox, I don't really trust that they're any good.

Re:Unadvertised by WIAKywbfatw · 2003-08-23 07:52 · Score: 1

Hmmm, perhaps we should send this story to everyone we know, everyone on usenet and everyone listed in all those online directories? I'm sure they'll all appreciate an article that'll help them cut down on the amount of spam that they receive.

Yeah, I'll think I'll be a good samaritan and do that ASAP. Now, where's that open relay...

--

"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
Re:Unadvertised by Timbo · 2003-08-23 08:11 · Score: 1

You look very silly.

Re:Link Please by Neophytus · 2003-08-23 06:56 · Score: 3, Informative

people/editors need to learn the a tag

Re:Link Please by woodhouse · 2003-08-23 06:56 · Score: 2, Informative

clicky

The Link. by AndyFewt · 2003-08-23 06:56 · Score: 2, Informative

Spam Filters

Re:Link Please by Ivan+the+Terrible · 2003-08-23 06:56 · Score: 1

http://freshmeat.net/articles/view/964/

Here's the link. by jpaz · 2003-08-23 06:56 · Score: 1

Click here.

Re:Link Please by skt · 2003-08-23 06:57 · Score: 1

silly editors. Yay for SpamAssassin..

Re:Link Please by rmohr02 · 2003-08-23 06:57 · Score: 1

Ahh--I have to copy the URL to the clipboard and paste it in the URL bar. Seems to me like I'm browsing text files.

Good testing, but not enough samples by TexTex · 2003-08-23 07:00 · Score: 4, Informative

The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.

For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.

The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.

--
-Barkeep, a draft of your most hazardous brew, for the world is slowly stepping into focus, and I don't like what I see.

Re:Good testing, but not enough samples by cly · 2003-08-23 07:13 · Score: 5, Informative

I guess you wrote this after reading the first two experiments.

In the third he used 1200.

Nice way to jump the gun.
Re:Good testing, but not enough samples by Sanctuary · 2003-08-23 07:18 · Score: 4, Insightful

They didn't train Spamassassin to use the bayes filter once during the test, and they used it with out all the other scoring tools for Spamassassin. This review really didn't completely test Spamassassin's full potential.
Re:Good testing, but not enough samples by bigberk · 2003-08-23 07:50 · Score: 1

For comparison; I am trying out a fresh spamprobe install and I'm finding that after training with about 150 messages (about 70% of which are spam) I'm seeing great results.
Re:Good testing, but not enough samples by arth1 · 2003-08-23 08:03 · Score: 4, Informative

I guess you wrote this after reading the first two experiments.

In the third he used 1200.

1273, out of which 1073 were spam. That leaves 200 non-spam messages, which isn't enough for Spamassassin's bayesian filtering to kick in, even if all messages were to be classifed as ham or spam, and not just let through.

To quote sa-learn's man page:
Another thing to be aware of, is that typically you should aim to train with at least 1000 messages of spam, and 1000 ham messages, if possible. More is better, but anything over about 5000 messages does not improve accuracy signif icantly in our tests.
The low number of emails, combined with no apparent manual reading on part of the author, makes me want to disregard this whole survey as pure drivel.

Regards,
--
*Art
Re:Good testing, but not enough samples by BrookHarty · 2003-08-23 08:04 · Score: 1

After training, SA does a good job, but really, it catches only around 90% of my spam. I do every once and while, sa-learn spam on my spam folder and ham on my mailing lists/work folders, to keep it updated. I've also put the blacklists in the global prefs to keep it updated. Still 90% reduction is nice.

But, Thunderbird catchs the rest, hardly any spam makes it through now.
Re:Good testing, but not enough samples by skookum · 2003-08-23 08:53 · Score: 3, Insightful

Agreed. The author made up the artificial constraint that "no program is allowed to contact the network" which means that SpamAssassin wasn't able to check the DNS blacklists for things like exploited open proxies/relays in the Received chain, or to check with distributed signiture services like RAZOR/DCC, etc.

If you're not going to let the program use its full capabilities, why test it?

Analogously, what kind of hardware review site would do a review along the lines of "This motherboard supports this extra feature that will improve CPU speed noticeably, but we're going to disable it for our tests (even though most of you would want to use it.)"
Re:Good testing, but not enough samples by hamster+foo · 2003-08-23 09:03 · Score: 5, Insightful

"Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough."

While I'm sure the recommendations set forth in Spam Assassin's man page are probably a good idea for all Bayesian training sets, he wasn't using the Bayesian filtering included in Spam Assassin, so you can't really fault him for not reading a section of the man page for a feature he was choosing to leave out.

It would have been nice to see him turn on Spam Assassin's Bayesian filtering at least in some of the tests. I don't think test results with a feature I would imagine the vast majority of users would used turned off is a very good comparison of the different packages abilities.

--
- b
Re:Good testing, but not enough samples by timeOday · 2003-08-23 09:17 · Score: 1

Being so data-hungry is a potentially crippling disadvantage to the bayesian approach. Anything that requires 1000 messages of each type just to get started is useless to lots of people. It would take me half a year to prime the thing on my home email address.
Re:Good testing, but not enough samples by Fred+Ferrigno · 2003-08-23 10:04 · Score: 1

Later on, he criticized sa-learn's manual for indicating that you should have an even ratio of spam to "ham", which was not born out by his tests. Obviously, he did read the manual.
Re:Good testing, but not enough samples by Fred+Ferrigno · 2003-08-23 10:15 · Score: 2, Interesting

Seems to me like it isn't an artificial constraint, but merely a practical one. It sounds like he scripted the programs to run through his data all at once, so querying the online resources a thousand times an hour would not be feasible. The Bayesian filters were at a similar disadvantage because of the automated testing: normally, each false negative gets added to the spam corpus, which would haved improved their accuracy over time.
Re:Good testing, but not enough samples by skookum · 2003-08-23 11:54 · Score: 1

In the case of the RBL (realtime block-lists), they use the existing infrastructure of DNS and so the load is fully distributed and cached. The first time you make a query for the status of a given IP address, you'll probably end up getting a response from one of the authoritative nameservers, but all subsequent queries for the same name will be cached without any extra burden on the nameserver. Additionally, there are usually many slave/secondary nameservers for the main RBLs, so load is not too much of an issue.

I'm not sure how DCC or Razor handles this, but remember that these services are used by high-volume mailservers all over the globe, so there's bound to be enough redundancy that a mere batch of 1000 messages would be pretty insignificant.

I think the author did it to be "fair", in that he wanted to test the innate ability of each software piece without falling back on any sort of "ask someone else" functionality. That sentiment makes a certain amount of sense (and it's a good idea to try to keep all your variables under control) but in this setting it makes no sense. If querying DNS blocklists and checksum clearinghouses significantly improves SpamAssassin's scores, and most people would be using it with this enabled, then it really seems unfair to test it with this disabled.
Re:Good testing, but not enough samples by sholden · 2003-08-23 17:58 · Score: 1

The testing was done a month after the actual emails were recieved. Using such resources would allow the filter the benefit of hindsight. As in foo sends lots of spam and ends up on a blacklist, but I recieved a bunch of spam from foo before it got on the list.

So it wasn't artificial. I mentioned in the article why I made that constraint.

I also didn't retrain bayesian filters on false-negatives before giving them later emails, which isn't normal use of them either.
Re:Good testing, but not enough samples by sholden · 2003-08-23 18:29 · Score: 1

I quoted from a section a bit further down in that manpage - which indicates I just might have the read the damn thing.

That spamassassin has a limit that my sample data didn't reach isn't of real concern to me. I can't just magically create some emails, I only have the emails that I have recieved.

A bayesian filter should work reasonably well with unbalanced training data. A Paul Graham style "let's ignore the huge amount of research in the field and make stuff up" filter will have problems because it ignores the term in a Naive Bayesian Classifier that deals with the ratios of each type of item.
Re:Good testing, but not enough samples by juhaz · 2003-08-24 00:04 · Score: 1

Sorry, but from a viewpoint of someone who receives crapload of spam but not that much real mail, the test with 1000 ham messages for pre-training would be the one that is pure drivel, as it wouldn't have any real-world relevance.

I won't probably receive 1000 messages in a YEAR (unless counting mailing lists), should that mean I shouldn't be able to filter spam?

If bayesian implementation requires that much to even get started it's crap and should be dumped for anyone expect perhaps mail gateway or something.
Re:Good testing, but not enough samples by eggnet · 2003-08-24 08:29 · Score: 1

The TTL values of the blacklists are low to allow for timely removal from the list.

Obligitory "here's my perfect spam solution" by ceswiedler · 2003-08-23 07:02 · Score: 2, Informative

IMO, the best way to go with spam is to combine a heuristic filter with a text/baysian filter, in my case SpamAssassin and SpamProbe. I run them both, and it does a noticably better job than either running alone.

SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html. A Baysian filter can't really catch that, but a heuristic filter can be written to notice the pattern.

Also, set up your Baysian filter to re-learn regularly from your spam folder. SpamProbe adds a unique ID to each message, so it won't process a message twice. Therefore, you can just manually move any false negative spams into the folder, and they'll be learned from.

Re:Obligitory "here's my perfect spam solution" by FyRE666 · 2003-08-23 11:17 · Score: 1

SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html.

Well it's not that clever, I've configured SA to mark obfuscated mail +20, so it's always caught immediately. The only people using this feeble trick are spammers, so there's no likelyhood of a false positive...

--
Code, Hardware, stuff like that.
Re:Obligitory "here's my perfect spam solution" by dsheeks · 2003-08-24 01:31 · Score: 1

A simple, effective but admittedly imperfect filter for a subset of spam is just matching HTML comments. The only problem I typically see with that is e-mail newsletters, but if they have a non-HTML version (preferred from my perspective) that isn't a problem.

Mozilla? by HBI · 2003-08-23 07:02 · Score: 4, Insightful

I have seen at least two of these comparisons and no one seems to want to roll Mozilla's spam filter into the mix and compare it. Therefore, the comparisons are kind of useless to me. I am guessing I am not the only person using Moz either, for specifically this reason (ease of use for Bayesian filtering).

What's up with that? I know it's not a proxy, so the methodology is different than most of the products in the comparison. I'm very interested in how well the filter works however, compared to these other products.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.

Re:Mozilla? by DarkSarin · 2003-08-23 07:06 · Score: 1

mod parent up. I love the mail filters in mozilla, and have completely switched from outlook to mozilla. I have even ditched evolution on the linux side.

--
"We don't know what we are doing, but we are doing it very carefully,..." Wherry, R.J. Personnel Psychology (1995)
Re:Mozilla? by bobintetley · 2003-08-23 07:12 · Score: 2, Insightful

Sensible people filter their email at the server and try to waste as little bandwidth as possible.

Mozilla is no good for this, as you have to download the mail via POP3/IMAP to filter it.

Don't get me wrong - Moz' spam filter is good at the user level, but you really would want to try and ditch the spam before then (particularly if you run a server for a number of users).
Re:Mozilla? by thinkninja · 2003-08-23 07:17 · Score: 2, Informative

Very true. I downloaded 1600 messages with Thunderbird today (backlog) and only about 30 weren't spam. That's a huge waste of bandwidth.

--
"The number of Unix installations has grown to ten, with more expected." (Unix Programmer's Manual, 2nd ed.; june 1972)
Re:Mozilla? by wilfie · 2003-08-23 07:25 · Score: 5, Insightful

The loss of bandwidth is not the main cost of spam these days.Certainly not internal bandwidth between our mail server and desktops. The excellent features of doing it on my desktop are that the filter is learning about what _I_ consider to be spam and ham, and that I have the stuff that's classified as spam to hand and can check it through once in a while. So far for me it's only thrown false positives when colleagues have sent stuff that was spammy in content. I have a presentiment that our CEO's habit of writing in red HTML (full of ff0000) will cause a false hit one day.
Re:Mozilla? by Anonymous Coward · 2003-08-23 07:32 · Score: 1, Informative

I've been using Mailfilter for a while now and I've built a pretty comprehensive list of keywords in the subjects of spam. It seems to just pull the message headers from the server without downloading the body.

One example rule:
DENY = ^Subject:.*v[i1l!|][a4@][g8]e?r[a4@]

Then I filter whatever gets through that with SpamAssassin.
Re:Mozilla? by Popocatepetl · 2003-08-23 07:33 · Score: 1

I like your idea of comparing client-based filters.

SpamKiller is another filter that operates on the user's machine.
Re:Mozilla? by hdw · 2003-08-23 07:39 · Score: 5, Insightful

Most people can't filter their email at the server, since most people doesn't have access to a server to filter at.

So the majority has to filter locally, either in the client or with a local pop/imap proxy (like PopFile).

// hdw

--
Executive Pope (small) Kallisti Engineering
Re:Mozilla? by thinkninja · 2003-08-23 08:51 · Score: 1

Cool, I'll check it out.

--
"The number of Unix installations has grown to ten, with more expected." (Unix Programmer's Manual, 2nd ed.; june 1972)
Re:Mozilla? by Sparr0 · 2003-08-23 09:47 · Score: 1

Mozilla's bayesian spam classification is a direct implementation of the original "Plan for Spam" algorithm. However, it is broken in a number of ways, both in terms of classification and problems with training. I have filed bug reports for all of the problems and some are being worked on. As it stands right now Mozilla performs worse than any other Bayesian filter and probably on par with the 'crippled' SpamAssassin used in this test.
Re:Mozilla? by Blain · 2003-08-23 11:36 · Score: 2, Informative
I have been using POPFile for months now, with a fairly complex setup, one of the things I like about POPFile versus the others I've seen (which are two or three bucket systems). It's classifying more than 99% accurately every month for the past three or four months (I reset my statistics around the first of every month) and has never been less than 95% accurate in a month (including its training month). For an idea of what my loads and buckets are like, this list of my buckets and the number of messages classified into them since the first of the month will help:
- ads -- 25 (0.58%)
- bounces -- 2 (0.04%)
- business -- 18 (0.42%)
- family -- 10 (0.23%)
- forwards -- 8 (0.18%)
- list -- 3,242 (75.72%)
- personal -- 68 (1.58%)
- politics -- 11 (0.25%)
- pornspam -- 136 (3.17%)
- scams -- 24 (0.56%)
- spam -- 678 (15.83%)
- webgenerated -- 57 (1.33%)
- website -- 2 (0.04%)
I've been using TB for a couple months now, and very much like it. I've used the built-in junk filtering since I first got it, and have found that it is only getting about 1/3 to 1/2 of the things already catagorized for my spam buckets, with a higher rate of false-positives than POPFile. I would like to see something more reliable, and hope updating the algorithm will help.

As complicated as my buckets may look, this system works very well for me -- with the addition of a "misc" folder that anything not classified goes into, and some filters based on the X-Classified line, almost nothing that gets into my inbox is anything other than personal email.

Re:Link Please by Arker · 2003-08-23 07:04 · Score: 1

Don't complain, they're trying to keep from slashdotting the server.

Ok, probably not, never attribute to benevolence what can be explained as well by stupidity.

The address is there though, before you complain about them being too stupid to make a link you might ask yourself if you're really too stupid to cut and paste. Cuts both ways.

Article isn't really anything new, but a decent quick rundown on the current state of the field I think.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.

OT: Disturbing? by Lead+Butthead · 2003-08-23 07:08 · Score: 4, Insightful

Does anyone find it disturbing that --

a. Spam Filter software company is now a "viable business."
b. Spam Filer is needed AT ALL?

--
ELOI, ELOI, LAMA SABACHTHANI!?

Re:OT: Disturbing? by phear_the_penguin · 2003-08-23 09:51 · Score: 1

Everyone knows that Spams actually originate from the anti-spam companies themselves, so that they can keep themselves in business... ;)

What I find disturbing is that they completely stole the idea from the Anti-Virus companies. ;)

I Smell a Lawsuit! :P
Re:OT: Disturbing? by otisaardvark · 2003-08-23 11:38 · Score: 1

The great thing about free (beer) spam filters is that there is no possible profit motive whatsoever... and unlike the AV industry which needs to be up to date in a matter of minutes (requiring expensive labs etc) spam filtering requires relatively few resources, especially Bayesian type filters which only run locally.
Re:OT: Disturbing? by BuilderBob · 2003-08-23 23:03 · Score: 1

Does anyone find it disturbing that --

a. Intruder Alarm company is now a "viable business".
b. Intruder Alarms are needed AT ALL?

The world is full of people who will do bad things to you in order to `improve' their life, (they're assuming it's a zero sum game I guess). Sometimes we have to take defensive action against them until it stops, or until it falls below our radars.

Spam numbers have been increasing exponentially for a year now (probably more), major ISP(s) are starting to provide spam filtering services in their packages just like 50 years ago some television programs were actually broadcast in colour(!). Eventually, spam will start to decrease and the spam-filter will be turned down with them (to allow some false positives and negatives through) until an acceptable threshold is reached.

Spam filter companies are viable now, just as vacuum tube manufacturers were 50 years ago, then somebody invented the transistor, it was sleek, small and stopped spam (hmm...). Some of your replies allude to Anti-Virus software. How many of the recent viruses were `proper' viruses, and not DOS attacks on windows services or VBA worms in Outlook. They aren't really viruses, their just malicious code exploit some bug in MS code.

Don't worry, the sun will rise tomorrow, and you'll get emails telling you how small your penis is, how much the Nigerian bankers make in 7 days and how to get expensive electronics for free, that can't be bad, can it?

Flawed Tests by Plix · 2003-08-23 07:14 · Score: 3, Informative

As was noted earlier, the set of messages given to the filters for learning was terribly small. Furthermore, SpamAssassin wasn't tested in a way useful to most as the tests in this article didn't take into account SA's Bayesian filter nor it's network-based tests (Razor, etc).

Re:Flawed Tests by CaptBubba · 2003-08-23 07:19 · Score: 1

"For the third test, the 1,273 pieces of July's mail were used as the training set." And that's a small training set? Sure, it isn't 10,000 messages, but you have to draw the line somewhere.

Spamassassin and Bayes? by menscher · 2003-08-23 07:21 · Score: 3, Interesting

Spamassassin > v2.50 supports Bayes, right? But TFA seems to imply that it's just heuristic. I'd be interested in seeing how spamassassin improves with a good training set.

Also, what's with keeping the spam threshhold score secret?

Re:Spamassassin and Bayes? by -tji · 2003-08-23 07:34 · Score: 1

He mentioned in the article that he disabled the Bayesian capabilities in SpamAssassin because there were already five other Bayesian based tools in the comparison.

I think he should have at least included a "full powered" spam assassin into the testing.. Which technology is best is an interesting test to perform. But, I'm really only interested in which application to install to kill spam.

I have been using Spam Assassin for a few months now, and find it to be excellent.

For my corporate mail, where I can't install tools on the server, Mozilla's spam filtering has also been excellent.
Re:Spamassassin and Bayes? by numbski · 2003-08-23 07:36 · Score: 4, Informative

Yup. I use it all the time. Save up spam and ham in seperate folders. Then do this:

sa-learn --spam --mbox ~/mail/myspamfolder
sa-learn --ham --mbox ~/mail/myhamfolder

As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

--
Karma: Chameleon (mostly due to the fact that you come and go).
Re:Spamassassin and Bayes? by arth1 · 2003-08-23 16:34 · Score: 2, Informative

As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

In addition to the above, it might be smart to create three files called "ham", "spam" and "forget":
#!/bin/sh # ham /usr/bin/sa-learn --ham --no-rebuild --single #!/bin/sh # spam /usr/bin/sa-learn --spam --no-rebuild --single #!/bin/sh # forget /usr/bin/sa-learn --forget --single
Complement with a cron job that runs sa-learn --rebuild every night.

Then, if you read your mail on the same box, and the headers doesn't say it was auto-learned, simply pipe the email to either ham or spam. If it was wrongly auto-learned as spam, pipe it to forget. If using pine, it's really easy:
| ham

Of course, if you use razor or other online services that lets you report spam, you might want to pipe some of the spam mails that weren't recognized to "spamassassin -r".

Regards,
--
*Art

Active Spam Killer by Admiral+Llama · 2003-08-23 07:24 · Score: 2, Informative

How the heck could Active Spam Killer be left out? I used to get about 150 spams a day and now I get ZERO. No false positives, no false negatives.
It is an autoresponder that checks the sender against a whitelist and a blacklist. If a new e-mail is in neither, then it bounces back an e-mail asking for a confirmation that the sender is a human. Simple!

Re:Active Spam Killer by Anonymous Coward · 2003-08-23 07:44 · Score: 1, Interesting

Oh yeah just two problems:
1: If i sent someone a mail and got an request to first prove myself i'll jjust write that person off.
2: Just wait for a spammer to fake your address in a spam to another person using that software, you get a nice ping-pong game.
Simple!
Re:Active Spam Killer by Admiral+Llama · 2003-08-23 07:51 · Score: 2, Insightful

1. If you thought it was worthwhile to send me an e-mail in the first place, then you'll probably click the respond button for the bounce message. If not, then I probably don't want to hear from you anyway.

2. If someone spoofs an e-mail to me from a spam victim, the spam victim will get an e-mail asking them to prove they're real. Fat chance of them ever doing that. Who knows? Maybe the spam victim will be so impressed with the sheer brutality of Active Spam Killer, they'll try it to.
Re:Active Spam Killer by wheany · 2003-08-23 08:55 · Score: 1

2. And what if they are using Active Spam Killer as well? Or some other program that uses the same principle?
Re:Active Spam Killer by macshit · 2003-08-23 10:21 · Score: 1

If you thought it was worthwhile to send me an e-mail in the first place, then you'll probably click the respond button for the bounce message.

That's the theory, but in fact this sort of thing annoys many people, to the extent that they'll just give up on the idea of sending you mail (even if it was easy to `click the button').

If not, then I probably don't want to hear from you anyway.

Well I suppose that works if you only ever correspond with a small circle of friends...

Good luck if you ever actually try to interact with a wider audience though.

--
We live, as we dream -- alone....
Re:Active Spam Killer by Gunfighter · 2003-08-23 10:29 · Score: 1

Sounds a lot like TMDA:

http://tmda.net/

-- Gun

--
-- Stu

/. ID under 2,000. I feel old now.
Re:Active Spam Killer by schnuf · 2003-08-24 07:01 · Score: 1

It is clear from reading the article why Active Spam Killer and systems like it were left out.

The tests involved running multiple different systems against the same body of email that the author had already received. Given that Active Spam Killer requires an interaction between the unknown sender of the email and the system it can't be tested this way.

Re:Link Please by Anonymous Coward · 2003-08-23 07:25 · Score: 1, Funny

I think OSDN can handle being slashdotted because they ARE slashdot

SpamAssasin had Bayesnian turned off?! by SuperBanana · 2003-08-23 07:25 · Score: 4, Insightful

I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)

I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!

--
Please help metamoderate.

Re:SpamAssasin had Bayesnian turned off?! by Anonymous Coward · 2003-08-23 07:44 · Score: 2, Interesting

However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

I use SpamAssassin with the flag threshold set at 5, the default. I have procmail send any message from 5-10 into a spam mailbox which I clean out occasionally, and messages at 10+ straight to /dev/null (after a couple of months of also keeping those in the spam mailbox).

Having a properly trained Bayes database makes a huge difference, not just for flagging spam but for not flagging mail. This is because messages which get a low Bayes probability receive a negative score (from the Bayes test, which offsets any heuristic tests that the message may happen to trip). I now find that nearly all legitimate mail comes in below zero, and nearly all spam comes in above 15. I have never once seen a false positive - either in my testing period, or since I started trashing spam (I occasionally look through the procmail log just to make sure). I see a false negative once every couple of weeks, which is just fine (it's remarkable how inoffensive spam becomes when it's an occasional thing ;).

So yes, now that you've trained it, you should be able to move the threshold again (I assume by "lowered" you actually mean you raised it, ie. had it flag messages as spam only when they scored 7.0 or higher).
Re:SpamAssasin had Bayesnian turned off?! by mindriot · 2003-08-23 12:17 · Score: 1
In my case, SpamAssassin run at my University's CS department has been working extremely well for me, even better since they updated to use Bayesian filtering. My statistics since 2003-03-05, i.e. for the last 174 days:
- 3324 True positives
- 88 False negatives
- 0 False positives
- Somewhere around 2500 "True negatives"; though some mailing lists I receive are effectively whitelisted since mails are sorted in their respective IMAP folders by their mailing list affiliation before being filtered by their Spam status.
That gives me a 97.42% ratio of correctly classified Spam, i.e. true positives vs. false negatives.

Maybe, if such a comparison is done, they should rather configure and train each filter to the best of its abilities, using the same reference data, and then compare the results... that would be a little more real-world usable. Additionally, one could of course also look at different use cases, such as the Office worker case or the Programmer case, where the latter one is subscribed to more mailing lists, for example.

My own spam solution.... by Kane+Skalter · 2003-08-23 07:26 · Score: 1, Redundant

I usually don't accept any email from people I don't know, so I simply set up my filter with a whitelist. That means that I filter it out if either of the following conditions are not true:

Contains my initials. I simply ask my friends to insert my initials in the subject line. They're all happy to comply.
If I opt-in to something, like /. updates, I allow *that* domain (*@slashdot.org, for example). No third party co-brands are accepted.

Fair enough?

Going after the wrong people... by Anonymous Coward · 2003-08-23 07:26 · Score: 1, Insightful

Instead of going after Spammers, why not go after the companies that hire them to send us Viagra/Penis Enlargement/etc mails? Without them, no Spam. Also, I'd like to know who the fucktards are that repsond to these mails and buy their products.

Re:Sad by Moth7 · 2003-08-23 07:27 · Score: 2, Funny

Maybe the site is too valuable to DoS?

What? No PopFile? by MrEnigma · 2003-08-23 07:28 · Score: 4, Interesting

They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.

Check it out Here.

--
GeekWares - Buy and Download Today!

What About PopFile by MBCook · 2003-08-23 07:29 · Score: 4, Informative

What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.

According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.

I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.

Re:What About PopFile by Natal+VC · 2003-08-23 08:51 · Score: 1

POPFile is excellent! Check out my stats:

Classification Accuracy
Messages classified: 8,291
Classification errors: 59
Accuracy: 99.28%

Messages Classified
Bucket Classification Count False Positives False Negatives
general 8,374 ( 79.42%) 17 26
news 738 ( 6.99%) 8 7
spam 1,431 ( 13.57%) 25 17
Re:What About PopFile by jedrek · 2003-08-23 08:51 · Score: 2, Interesting

I use PopFile as well and am equally satisfied. I make sure to reclassify all false negatives and positivies. Accuracy is at 97.65%, I've gotten 2,802 spams for 5,432 mails I've gotten since I installed it.

When me and my friend had a site featured on Yahoo, USA Today, NYT, etc. the spam just went THROUGH THE ROOF. But, thanks to PopFile I didn't have to see any of it.

PSAM by po8 · 2003-08-23 07:29 · Score: 4, Informative

See our PSAM project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.

The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.

Overall, though, this was an interesting evaluation, and I'm glad that the author published it.

Re:What? No PopFile? by MrEnigma · 2003-08-23 07:31 · Score: 1

Whoops! The real url is

http://popfile.sourceforge.net/

--
GeekWares - Buy and Download Today!

Re:So weird by brokencomputer · 2003-08-23 07:32 · Score: 1

forgot to mention the last part of my address is @notsohotmail.com and that isnt hard to remember. Havent ever gotten spam.

--
The Television Wiki

Use Spam Filters To Enlarge Your Penis by Tablizer · 2003-08-23 07:37 · Score: 5, Funny

That's right! Our company has found a high-tech way to use various anti-spam tools to enlarge your penis. My pennis is noww sso lrage that i Cannnot type curretcly. Itt gtes in teh way.

Please visit www.spamfilters2enlarge.com

Act before midnight and get a $30 discount.

--
Table-ized A.I.

WRONG. by imsabbel · 2003-08-23 07:38 · Score: 5, Informative

Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
Dont forget that the filter sees more than the eye...

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?

Web interface for spamprobe by bigberk · 2003-08-23 07:42 · Score: 2, Informative

If you decide to try out spamprobe or another bayesian filter, try this web interface which lets you easily reclassify mail, even those marked as spam. I found that "training" the bayesian filters was the hardest part; this definitely simplifies the process.

Re:Web interface for spamprobe by Fred+Ferrigno · 2003-08-23 10:34 · Score: 1

Why not simply use POPFile? It has a very nice web interface that makes it very easy to reclassify false positives or false negatives. It also supports multiple email categories; unlike most other Bayesian filters, you can filter email as more than just "spam" or "not-spam". I myself have five classification groups: spam, mail from my university, two mailing lists I'm on but don't read actively, and "real" email. It works very well in concert with Mozilla, and my email is automatically directed to the appropriate folder based on POPFile's extra headers.

Off topic but... by CGP314 · 2003-08-23 07:44 · Score: 2, Informative

It wasn't mentioned in the article, but I really must plug popfile. It filters out my spam yes, but it is also a general mail categorizer. It sorts ten yahoo groups for me, personal, work, and school related emails. I know you think you could do this with rules for the emails, but for example, I get several hundred emails a day from the Harry Potter for Grownups List. Popfile can sort them into 'probably interesting' and 'probably not' for me. Very nice.

Re:Off topic but... by lederhosen · 2003-08-23 08:03 · Score: 1

Why not redirect them all to /dev/null ;-)
Re:Off topic but... by Nucleon500 · 2003-08-23 14:44 · Score: 1

I use POPFile, and I really like it, but it's not quite ideal. It's very best feature is it's configurable buckets, which aren't just limited to ham and spam. I have buckets for "personal," "mailing list," "automatic," and "spam." One could get even more creative, with something like an "interesting" bucket.
What I really want is something with a more generic interface. POPFile's POP3 proxy and webserver interface mostly limits it to email. I'm thinking of starting a project to make a generic text-classifier in Ruby. A standard class interface, with "message" and "bucket" abstractions, whose functions could be used from a command line filter, as a plugin, and yes, as a POP3 proxy. In other words, following the unix philosophies of doing one thing well and being modular. But it could include black- and white-listing, neural nets, or any other way to decide which bucket a token string belongs in. (And of course the tokenizer would be similarly modular.) Sound interesting?

--
Litigious bastards
Re:Off topic but... by brettw · 2003-08-25 15:26 · Score: 1

Yes, but does it only work with POP3? Looking over the page makes that appear to be the case...

C/R and Bayesian filtering by pongo000 · 2003-08-23 07:46 · Score: 3, Interesting

An interesting thread here about how TMDA, a C/R filter, used in conjunction with SpamAssassin, can provide the best of both worlds. While TMDA is by itself effective, there seem to be some humanistic issues involving the assumption that all e-mailers are spammers unless they prove otherwise. The thread explains how Bayesian filtering can be improved by using a decent C/R filter like TMDA without alienating people that send legitimate e-mail.

Personally, I figure anyone thin-skinned enough to be insulted by my C/R filter probably isn't worth talking to anyways, but I digress...

Stop spam the low-tech way. by Futurepower(R) · 2003-08-23 07:49 · Score: 2, Insightful

The quickest way to stop spam in the U.S. would be to have a respected person such as the Surgeon General of the United States say that

1) There is no way to increase the size of your body parts,

2) The cheap Viagra is not Viagra,

3) and so on.

We can help by telling everyone we know not to buy anything from spam. Next time you are at a party or family gathering, make that point.

Spam would disappear if there were no buyers. We need to make it culturally unacceptable to buy anything that is advertised through spam.

Re:Stop spam the low-tech way. by Zocalo · 2003-08-23 08:00 · Score: 1

Nice idea in theory. Unfortunately I suspect it would have even less effect on the spam situation than the "Cigarettes may damage your health" warnings on cigarette packs. Let's face it, given the rate of reduction in smoking when your health is at risk, perhaps even your life as a result of Surgeon General warnings, what effect do you think this is going to have on the typical male with adequacy issues?

--
UNIX? They're not even circumcised! Savages!
Re:Stop spam the low-tech way. by hankwang · 2003-08-23 08:04 · Score: 3, Funny

> The quickest way to stop spam [...] say that [...] 1) There is no way to increase the size of your body parts, 2) The cheap Viagra is not Viagra,
Unfortunately, you risk that people just remember "cheap viagra" and "increase the size", with the opposite effect as a result.
In Netherlands, there is or may was an urban legend that a big tea brand will donate a wheelchair to whoever gathers one million tea bag labels of that brand. Presumably, the tea brand tried informing the world through advertisements in the newspapers, but that turned out to only increase the number people requesting more information.

--
Avantslash: low-bandwidth mobile slashdot.
Re:Stop spam the low-tech way. by Dark+Lord+Seth · 2003-08-23 08:39 · Score: 1

Never overestimate stupid people.

--
Hate me!

Re:Mozilla Thunderbird by CGP314 · 2003-08-23 07:49 · Score: 1

I don't know if thunderbird uses the same filter as mozilla, but for me, thunderbird is horrible at spam recognition. I have an account that gets about 50 spams a day, and one legit email from 'word a day'. It consistently screws up even after weeks of training. Thunderbird couldn't find a spam labeled 'young virgin sluts selling herbal viagra from the Congo'. But then again, it's only 0.1, so I'm more than willing to cut it some slack.

A message from a spammer by Anonymous Coward · 2003-08-23 07:58 · Score: 5, Insightful

As a professional sender of UCE, I just want to tell you slashdotters to keep on playing with your spam fileters. As long as you use spam filters on your e-mail, I can continue to reach my real intended targets, those non-slashdotters who do not know better and will buy my products or click through to my client's websites. You filters really help cut down on the complaints to the internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business. Of course, I still waste your bandwidth and mailbox capacity, but you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems. My yahoo and hotmail and other accounts for replies are lasting much longer before getting shut down because someone complained to these service providers. And my clients are even reporting that they can start mailing out 800 numbers like 1-800-901-3719 again and they will not have you damn spammers set up their modems to keep autodialing them, since you spend your own time and effort to filter the e-mail and only clueless users who might actually call see the numbers.

Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.

P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.

Re:A message from a spammer by Anonymous Coward · 2003-08-23 08:34 · Score: 1, Interesting

"this will take exactly as much effort as it would have to just check the e-mail when it first came in"

Not so. It's much easier to manually filter when you have a good idea what to expect. Since the content of the probable-spam mailbox is, er, probably spam, going through it is vastly quicker and more reliable than trying to sift out the randomly distributed real mail from a single unfiltered mailbox. Likewise, the few false negatives in one's inbox stick out much sorer when most spam has been diverted. Doing a best-automated-guess sort into separate piles beforehand really capitalises on the way a human brain distinguishes items in a set. The relative distribution within a pile is important.

Your earlier points are interesting, though.
Re:A message from a spammer by jpetts · 2003-08-23 11:05 · Score: 2, Insightful

This might be considered interesting, but I think it is really just a troll.

However, one interesting point that trollboy makes, is that the 1-800 numbers end up in the spam, and we don't see them: why not modify the filter so it automagically pulls out all such numbers from the spam, so that they can be easily on hand for those people who want to set up autodialers? In a way this is poetic justice, being analogous to the way the scumbag spammers harvest email addresses from web pages. So yet again, the classification allows an easy way to harvest spam 1-800 numbers from genuine ones.

Thanks for the suggestion, spammer or troll, whatever you are!!

PS Googled for the 1-800 number the idiot mentioned in his email, but nothing came up. Did anybody dial it? I'm nowhere near a public telephone at the moment. I'll try when I get back to civilisation if nobody else has already done it...

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
Re:A message from a spammer by mce · 2003-08-23 11:21 · Score: 3, Insightful

There's more to the time-spent-on-spam comparison than what you wrote. If you filter all the spam and quickly check it once a day or once a week, you only look at it whenever you "want" to: i.e. probably during a dead moment inbetween meetings or some such. But if you let it get into your inbox, whatever you're doing may needlessly get interrupted every so many minutes/hours. After all, each e-mail that reaches your inbox might (for instance) be that one important reply you're waiting for and have to process asap...

--
Linux user since early January 1992.
Re:A message from a spammer by kirkjobsluder · 2003-08-23 11:48 · Score: 1

Just spent about 6 minutes going through my spambox in search of false positives. The great thing about spam filtering is that the false positives stick out like a sore thumb in the midst of subject lines like "Can you satisfy her" and "&%IDU&*!". And 90% of the false positives are low-priority mail, like announcements for conferences I can't afoard to go to.
Re:A message from a spammer by kyz · 2003-08-23 15:03 · Score: 1

why not modify the filter so it automagically pulls out all such numbers from the spam, so that they can be easily on hand for those people who want to set up autodialers

Because this is unthinking vigilantism, real pitchforks and torches stuff, and spammers will just use your wrath to launch joe jobs against anti-spam companies and individuals.

--
Does my bum look big in this?
Re:A message from a spammer by Czmyt · 2003-08-23 15:22 · Score: 1

Also, having really good filters or blocklists in place makes it possible to receive/review your e-mail on a wireless device or a low-bandwidth dialup connection from a remote city. Without good technical solutions to the spam problem, spam would make using wireless devices very expensive and a major waste of time.
Re:A message from a spammer by Czmyt · 2003-08-23 15:25 · Score: 1

Another point, with a good filter like SpamAssassin that scores the spamminess of e-mail, you can pretty safely throw away anything with a high score, so you really only have to review the messages that are very likely spam, not the stuff that is most definitely spam.
Re:A message from a spammer by Nucleon500 · 2003-08-23 15:30 · Score: 1

So, you're saying that filtering spam helps spammers? I don't buy it. For one, it makes it much easier to complain, if you're so incined. Filtering spam doesn't legitimize it any more than locking your house legitimizes stealing. Spam filtering also has the effect of minimizing the unbridled rage spam causes, which will cut down on reactionary legislation. I think that's a good thing, because governments in general don't understand the Internet but aren't afraid to meddle with it.

--
Litigious bastards
Re:A message from a spammer by pueywei · 2003-08-23 15:44 · Score: 1

What is to stop us from setting a rule where classified spam is automatically forwarded to uce@ftc.gov? ;)
Re:A message from a spammer by Blain · 2003-08-23 19:11 · Score: 1

Indeed, spam filtering and monitoring the filtering is still much (much) faster than just reading through your inbox at anything that comes by. Setting up POPFile didn't take all that long, and requires no initial training -- you just train on errors. Setting up Mozilla/TB to work with it took me probably another hour or so, but I've got quite a few buckets and am subscribed to quite a few mail lists. The TB filtering system works quite well -- I can have a general "list" filter, and then specific filters and folders for individual lists that I can put in front of the general filter, so I don't have to generate folders and filters for my very low volume lists right away.

Checking my spam folders has gotten quite a lot faster too, using the junk mail settings on TB. I have it set so that anything I manually designate as "junk" is deleted. So I uncheck anything listed as junk, and then put my mouse on the "junk" setting for the first message. It takes a fraction of a second to verify that the message is spam and click the mouse button -- I can blow through a dozen or more messages in half a minute, maybe.

Of course, the point of the parent was that we should not be protecting ourselves from spam, but, rather, should be anti-spam activists trying to stamp out the spammers. I think there's a point in there, but it's pretty scant. Effectively fighting spam is something that takes a fair amount of time and effort -- it's mostly "for professionals only." Back in the day when spam was much more rare, I did my part to report spammers to their postmasters. I remember a particularly nasty exchange with a spammer who was threatening to have hundreds of people mail bomb me. These days, it's darn hard to figure out who it is you should be yelling at and how to reach them. This is not something that every user should try -- their time will be better spent using spam filtering tools. Setting up automagic forwarding to the ftc might not be a bad idea -- anybody want to make a TB extension along those lines? But my life is far too short to spend minutes and hours a day every day devoted to dealing with spammers.
Re:A message from a spammer by clambake · 2003-08-24 01:23 · Score: 1

Of course, I still waste your bandwidth and mailbox capacity, but you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems.

Actually, my email is set up to do exactly the opposite. Everything that SA tags as spam *automatically* gets forwarded to uce@ftc.gov without any effort by me.
Re:A message from a spammer by Pharmboy · 2003-08-24 14:02 · Score: 1

Because this is unthinking vigilantism, real pitchforks and torches stuff, and spammers will just use your wrath to launch joe jobs against anti-spam companies and individuals.

So can't we all just get along? This is the kind of excuse that just gets me. Sometimes a little vigilantism isn't such a bad thing. The way several people snail mail spammed a notorious spammer, by making so many requests for catalogs to be sent to his home address, that he couldn't read his mail because he was getting bags and bags every day? Yes, this is a proper application of vigilatism.

Most companies what give a 800 number AND spam, deserve to be autodialed until they quit spamming. Their actions make spam seem legitimate, and also makes autodialing/spamming them seem just as legitimate.

--
Tequila: It's not just for breakfast anymore!

Mozillas Filters + SA = Kick ass solution! by BrookHarty · 2003-08-23 07:59 · Score: 3, Informative

Dont know why we didnt see Mozilla's filters (Maybe thats covered under Bayesain filters?)

I'm using the standalone Thunderbird and it catchs everything that passes by Spamassassin. Spam is marked but never deleted, so I can go back and check. Some spam programs will delete email, which could delete a good email, unacceptable.

Basically, I'm using a mandrake linux box, imap, procmail, fetchmail and spamassassin. Easy, and I can send/receive email from my linux box, and port 25 is blocked from the Net so nobody can use me as a bouncer.

Only problem I had was, there was no complete document to set this up, I had to piece each part together.

So for anyone who wants to know, heres the quick steps.

1. I'm using mandrake, but had to update SA for the sa-learn utils. (Gotta train SpamAssassin)
2. Setup fetchmail in your personal account.
3. Setup .procmailrc in your home dir

DROPPRIVS=YES
VERBOSE=ON
LOGFILE=/home/useracc ount/procmail.log

:0fw

| /usr/bin/spamc
4. Setup your user_prefs in your local directory for SA. (mine, but im no SA expert, but it works)
required_hits 5
rewrite_subject 0
use_terse_report 1
report_safe 1
use_bayes 1
auto_learn 1
ok_locales en
use_pyzor 1
pyzor_max 9
pyzor_add_header 1
use_razor2 1
always_add_headers 1
always_add_report 1
spam_level_stars 1
pyzor_add_header 1
skip_rbl_checks 0
#timelog_path /home/useraccount/.spamassassin/timelog

5. As root make sure Imap,Spamassassin is running.
6. Load Thunderbird, use Imap, use filters on x-headers.

Re:So weird by lederhosen · 2003-08-23 08:00 · Score: 1

Neither has I, but then I do not publish my address
on the net. When I create "register accounts"
on hotmail.com or mail.com they get flooded right away.

Recommendations by fluxrad · 2003-08-23 08:17 · Score: 1

Anyone care to point out a decent way to use SA's bayesian filter with this setup:

I have a linux box running as my web/mail server that has spamassassin on it for anyone who wants to use it (setup .forward and .procmailrc to do this). I'm currently deleting spam (score = 5)

The problem is how to get spam and ham from Outlook back to the linux box correctly. To my knowledge, outlook doesn't export mail in any way that's readable by the sa-learn script. I'd like to setup a bayesian filter, but it seems like a lot of effort to get rid of the 4 or 5 spams that SA actually does let through each day.

--
"It is seldom that liberty of any kind is lost all at once." -David Hume

Re:Recommendations by dagarath · 2003-08-23 08:39 · Score: 1

imap is the simple solution, setup imap and sort your spam / ham messages into imap folders. Then use sa-learn against the imap folders on the server.

SpamBayes works really well for Outlook. by RNLockwood · 2003-08-23 08:21 · Score: 5, Interesting

I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.

That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.

Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.

To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.

--
Nate

Re:SpamBayes works really well for Outlook. by Anonymous Coward · 2003-08-23 08:40 · Score: 1, Interesting

that filter has NEVER given me a false positive for more than 2000 SPAM.

Actually the false positive rate should be measured against the pool of non-spam messages (from which false positives are drawn). The spam incidence is irrelevant.

Spam isn't an acronym btw (doesn't need to be IN CAPS ;).
Re:SpamBayes works really well for Outlook. by howhardcanitbetocrea · 2003-08-23 10:40 · Score: 2, Interesting

Totally agree. I have tried Spam Pal - which was good. Spam Assassin which was OK and have now been using SpamBayes since finding it via another story on /.
Spambayes is excellent.

--

President ISES
(International Society for Elimination of Sigs)
Re:SpamBayes works really well for Outlook. by RNLockwood · 2003-08-23 10:57 · Score: 1

>Spam isn't an acronym btw (doesn't need to be IN CAPS ;).

Damn, I knew that!

Thanks

--
Nate
Re:SpamBayes works really well for Outlook. by jpetts · 2003-08-23 11:14 · Score: 2, Insightful

I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.

Anybdoy looking for a can of spam might want to check out the Ling Spam corpus created by Ion Andoutsopoulos, also available here.

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
Re:SpamBayes works really well for Outlook. by RNLockwood · 2003-08-23 13:37 · Score: 1

I started using the uppercase form, without engaging my brain, when some spellchecker suggested it. Well, it's not as bad as ATM Machine...

--
Nate

Re:So weird by arcanumas · 2003-08-23 08:32 · Score: 2, Insightful

I am not sure about getting spam with such an addres ssaf4502@E8Hkl3.biz . I AM certain , however, that i would not receive regular mail.
You can not put it in a bussiness card, people will always type it wrong. You definately cannot pronounce it over the phone.
In fact, most would give up on contacting me through e-mail just looking at this monster.

--
Slashdot Sig. version 0.1alpha. Use at your own risk.

Re:So weird by frovingslosh · 2003-08-23 08:32 · Score: 3, Insightful

Don't spend time trying to filter-- get an obscure email adress like saf4502@E8Hkl3.biz

This is a pretty bogus "fix". It might work if you set up such an account and never use it, but if it's used and gets into a spam database the computers can proprigate this e-mail address just like they can any other. The spam database computers simply don't care if the name is "joe" or "saf4502", they deal with both exactly the same. All you'll really do is make it harder for you to pass along an e-mail address verbally to someone.

Spammers get these addresses any number of ways. Many are harvested tens of thousands at a time. If you ever use that e-mail address in a usenet news group, for example, it will get harvested. Of course, you can munge it and give instructions in the post for how someone wanting to reply should unmunge it (replace the number in my name with the square root of the number) but realistically few are going to bother to go to extra work to unmunge an e-mail address, so if you made a post to really try to get some information back rather than to just hear yourself talk, that's a big waste.

Same if you want to post a contact e-mail on your website.

Businesses you deal with are even less likely to unmunge your e-mail address, and if they do you certainly have no protection that they are not the ones about to sell their mailing list database to a spammer.

And even if you just keep your e-mail adderess for close personal contacts, one of them may eventually come across what they think is a "cute" electronic greeting card site on the web and give them your address to send some damn picture of a dancing bunny, or use your e-mail address on some site with an "e-mail to a friend" link for a story they think you would be interested in, or even just let their computer get infested with some worm that goes through address books, and your adddress is in some spam database, soon to be in thousands. Having a hard to remember e-mail address is no more protection than having an easy to use one is.

I even created a dummy e-mail address one time on Mindspring, with a very uncommon name and numbers. Never used it. It started getting spam after a while. Either Mindspring sold the names, or they had a bad security system and some employee sold the names, or they had a really bad security system and someone hacked in and harvested the names.

--
I'm an American. I love this country and the freedoms that we used to have.

massing spam for training purposes. by herrd0kt0r · 2003-08-23 08:38 · Score: 3, Interesting

since the filters do better after being trained with lots of spam, anyone think of gathering up a huge collection of spam to give to other people? i mean exporting a corpus of spam from outlook, sticking it up for download somewhere, and letting other people import it into a spam folder. then other people could run their filter of choice and train it!

you could even make it all official-like, and somehow guarantee that the spam that's up for downloading is "official" and "virus-free" and "safe for your computer." you know, do geek stuff like check hashes or whatever it takes to verify that the spam collection is legit. whatever it takes to ensure that someone else hasn't filled it with a ton of virus/trojan/etc. attachments. or whatever. i dunno. you know, somehow guarantee it's safe.

imagine it! download spambayes, get spambayes to connect to the official spambayes spamcorpus server, and download the latest 2000 spams! instant training.

anyway. just an idea. mod me down as -1, herrd0kt0r. 8P

Re:massing spam for training purposes. by bobbozzo · 2003-08-23 09:55 · Score: 3, Informative

YES: http://spamarchive.org/
Also remember you need to feed nonspams to bayesian filters also.

--
Nothing to see here; Move along.
Re:massing spam for training purposes. by jpetts · 2003-08-23 11:16 · Score: 1

Also remember you need to feed nonspams to bayesian filters also.

Yes indeed, but don't forget: ham is personal, while spam is universal, so you need your own corpus of ham.

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender

Consumer Reports did an article on that too by Stavr0 · 2003-08-23 08:48 · Score: 3, Informative

Ratings - Spam-blocking software

SAProxy for Windows (Based on SpamAssassin) got the highest marks.

Re:Consumer Reports did an article on that too by cyways · 2003-08-24 07:01 · Score: 1

While I'm not surprised to see no reviews of filters for OS's other than Windows and MacOS, I am a bit surprised that there was only one free filter reviewed. Is it really the case that none of the dozens of open source filters has been ported to windows?

And why not Mozilla? Outlook Express and Apple Mail are included.

Isn't part of consumer "protection" protecting consumers from having to spend money when they don't need to?

It is learning the words. by Population · 2003-08-23 08:51 · Score: 1

Simply adding random text to a message is not enough to get it past SpamAssassin.

I run SpamAssassin, I know that it catches that stuff.

The reason it does catch it is because it used a WEIGHTED system for classification. If the message has the characteristics of spam, but has random words in it, it will still be considered spam UNLESS those random words have been used previously in ham messages that it has learned.

Now, the odds of the spammer hitting upon words that my version of SpamAssassin has learned as ham are very slim.

And if he did manage it, those same words would most likely not be in someone else's ham list.

So spam that can get through to me will not get through to 90% of the other SpamAssassin users.

I'm running SpamAssassin at work and it is catching over 1,000 spam messages for every false positive or false negative that it lets through. Despite the spammers including random words and random text and all of their other tricks.

mail.app by b17bmbr · 2003-08-23 08:52 · Score: 1

i use apple's mail.app with bayesian filtering. i have received maybe 4 or 5 true spam emails in over a year. i haven't yet missed any real emails either. i would have to say that's pretty good. otoh, our groupwise system at work is fscking horrible. i get tons of fscking spam. i have had to set dozens of rules, and it still doesn't matter.

--
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.

Pine by fluxrad · 2003-08-23 08:58 · Score: 1

I use Pine when I'm at work (ssh into my box at home), but generally speaking, I use Outlook when I'm in winders or Moz when I'm booted into linux on my desktop.

I suppose (as the other poster) mentioned, that I could turn on IMAP, but like I said before, it sure seems like a gigantic pain in the ass to do nothing more than filter out a few extra emails a day.

--
"It is seldom that liberty of any kind is lost all at once." -David Hume

Re:So weird by brokencomputer · 2003-08-23 09:01 · Score: 1

I dont think you get the point. Notsohotmail.com is easy to pronounce and remember. You are purposly ignoring my second post to prove your point. I am surprised that this was modded up for insightful when i clarified my point already. I was kind of joking and made a mistake about that.

--
The Television Wiki

Re:So weird by maxume · 2003-08-23 09:03 · Score: 1

If you don't worship your tinfoil hat, you should check out mailinator.com. You don't need to set anything up, you just check the mailbox that you just made up.

--
Nerd rage is the funniest rage.

Comment removed by account_deleted · 2003-08-23 09:03 · Score: 1

Comment removed based on user account deletion

Five baysian filters were enough by Sits · 2003-08-23 09:09 · Score: 4, Informative

Here's a quote from the article:

Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough.

If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...

Bandwidth is cheap, disk is cheap, CPU is cheap by Moderation+abuser · 2003-08-23 09:10 · Score: 1

My time and the time of 100,000 users is not.

And since the stuff like the spam filters are getting pretty generic, they can be configured and replicated to numpty users reducing spamming effectiveness by several orders of magnitude.

Poor attempt at irony BTW.

--
Government of the people, by corporate executives, for corporate profits.

Some comments by zaad · 2003-08-23 09:11 · Score: 2, Interesting

I'm not disagreeing with the posters that stated that he has low sample size. It might be one of the problems why he doesn't have a higher catch or recall rate.

The main problem I see with bayesian filters is that they are complicated and nontrivial to set up. I've been playing with Bogofilter for several months. And even with sub 1000 corpuses, I get a very high catch rate (greater than 90-some %, though I don't have exact numbers).

The method that I've employed is start with a small set of three hundred or so ham and spam corpuses, then to train on error over time. It's a pain in the ass because I still have to continually inspect the results and tweak the databases.

In addition to that, there are at least a half a dozen parameters that contribute to the success or error rates. So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.

So give the guy a break. I wouldn't say his results are robust enough for an academic publication, but it isn't worthless. It's interesting enough for a read. It's more work than many of us are willing to do.

Also an interesting read is Comparing Bayes Chain Rule with Fisher's Method for Combining Probabilities.

Re:Some comments by Entropy_ah · 2003-08-23 16:11 · Score: 1

So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.

Correct me if i'm wrong but I think bogotune has nothing to do with success/error rates. It deals with the berkely db backend for speed.

--
my other penis is a vagina

IMAP by Sits · 2003-08-23 09:16 · Score: 1

I second the comment about using IMAP. I have been using it very successfully and it makes it easy to move spam inbox messages from whatever email program I'm using into a spam mailbox. I then have a script called learn-spam.sh that could be set to run each night to reclassify spam / ham.

Bayesian filters vs. IMAP by LauraW · 2003-08-23 09:33 · Score: 1

I've got a related question that doesn't rate "Ask Slashdot" status, so I'll ask it here...

I use IMAP to read my mail, mostly because that makes it easy to read from both work and home, and occasionally when I'm on the road. Right now I'm using the bayesian filter in Mozilla. It's great, but since it's client-based that means I have three seperate filters I need to train. Sometimes I'll run into weird problems where two of the filters think an email is good but the third thinks its spam. If I accidentally left the third one running at home when I went to work, it will sometimes decide to re-classify my inbox and make messages "magically" pop in to the junk mail folder behind my back. Not good.

What I'd love is a filter that I could run on my server box at home and point at the IMAP mailboxes at my ISP. I'd want it to filter the messages and move the spam to the Junk IMAP folder rather than a local one. That way all of my mail clients would be seeing the same thing and using the same training data. I'm not sure what the UI to this would be -- there would need to be some way to train the filter in both bulk (this folder is all spam) and individual (this one message is spam) modes.

I've done a bit of looking for a tool like this, but I haven't found anything that looks ideal yet. Some of the filters mention that they support IMAP, but it's unclear whether they're optimized for a multiple-client setup like this. For example, the IMAP-aware Outlook plugins (in SpamBayes?) wouldn't do the trick.

Does anyone know if such a thing exists? I'd prefer one that ran on Windows, since that's what my server runs right now. (I know, I know. But it was very easy to set up, and I'd rather spend time improving my programming skills instead of leaning to be a Linux admin. I was a 4.3BSD admin way back in the day, but it's been a while.) If there were a great solution that only ran on Linux, that might motivate me to switch, though.

Any advice?

Re:Bayesian filters vs. IMAP by Gaza · 2003-08-23 09:50 · Score: 1

Check out SpamBayes's IMAPfilter, might do the trick for you.
Re:Bayesian filters vs. IMAP by LauraW · 2003-08-24 06:50 · Score: 1

Hmm. I'll have to try installing SuSE on a spare machine and see what happens. How hard is it to keep the thing up to date with security patches? With M$, it's easy to keep it patched, but there's so damn many patches that you never know when you'll get caught in the "window" between an exploit being found and the patch being released.
>It's what we're recommending to anyone that would be stupid enough to buy that M$ crap.
There's no way I'd pay full price for it. I have a couple of copies of XP Pro that a friend who works at M$ got for me. (I'm running a copy of XP Corporate from a friend, though, so I don't have to deal with activation.) And my copy of Win2K server cost $2. On a street corner in Malaysia. :-) I don't feel particularly guilty since I have assorted copies of XP and Win2K pro laying around uninstalled.

POPFile is an MOST EXCELLENT Classifyer by OverlordQ · 2003-08-23 09:36 · Score: 1

Messages classified: 6,116 Classification errors: 88 --- Accuracy: 98.56% And THAT is with 8, yes EIGHT, different buckets for sorting my mail. Of course 79% of my mail is spam so :)

--
Your hair look like poop, Bob! - Wanker.

Really? Are you sure? by djkitsch · 2003-08-23 09:37 · Score: 1

Surely this article should have been written by Spam Holden?

--
sig:- (wit >= sarcasm)

Re:How about Spam Filter + Authentication? by DavidTC · 2003-08-23 09:43 · Score: 2, Insightful

So, you're taking a message you suspect might be spam, and sending a message to the 'sender'.

When, of course, most spam has forged senders.

Whee, looks like another idiotic pattern I have to bock.

--
If corporations are people, aren't stockholders guilty of slavery?

Interesting article but unsound methodology by Henry+Stern · 2003-08-23 09:59 · Score: 2, Insightful

Sam's article was a very interesting read, but his results need to be taken with a grain of salt.

To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:

The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.

The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.

Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).

Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.

However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.

Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!

Good luck!

Henry

Re:Interesting article but unsound methodology by sholden · 2003-08-23 17:05 · Score: 1

I did a ten fold cross validation.

I even did some stats stuff and found that there was a significant performance difference between some of the filters - but I don't trust my stats knowledge enough to publish such things without getting them checked. SInce I didn't get them checked, I didn't include them.

If the article was meant for a machine learning journal then obviously it's a joke. But it wasn't it was meant for freshmeat, the requirements are much lower.

Missing Programs by Goo.cc · 2003-08-23 10:09 · Score: 1

Personally, I wish that he has included DSPAM and CRM114 in his testing. Otherwise, I thought that it was an enjoyable review.

Re:Missing Programs by Goo.cc · 2003-08-23 14:23 · Score: 1

I have not heard of this program before. Thanks for the link.

Re:Mozilla Thunderbird by kpansky · 2003-08-23 10:23 · Score: 1

Could you please forward that email to me... I have a friend from Nigeria that would probably like to become a business partner with this group from the Congo.

--

--Kevin

overtraining? by MacJedi · 2003-08-23 10:40 · Score: 1

Does anyone know if it is possible to overtrain a bayesian spam-filter? It would seem that this could potentially be a problem...

/joeyo

--
2^5

Automatic Spam Training by Stinky+Cheese+Man · 2003-08-23 10:40 · Score: 5, Interesting

I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this: We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam. So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample... nikola: "|/usr/local/bin/bogofilter -s " cal: "|/usr/local/bin/bogofilter -s " bwilson: "|/usr/local/bin/bogofilter -s " fayre: "|/usr/local/bin/bogofilter -s " (If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.) To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days... # remove records older than 30 days from spamlist.db /usr/local/bin/bogoutil -a30 -m /home/bogofilter/spamlist.db This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends. Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.

Re:So weird by arcanumas · 2003-08-23 10:42 · Score: 1

I ignored your second post because it does not compute. You mean your address is: saf4502@E8Hkl3.biz@notsohotmail.com ?
especially when you put two "@" characters. Postfix, interprets this as
saf4502@E8Hkl3.biz.notsohotmail.com
(replaces all @ after the first one with dots.)
So there is something wrong with this.

--
Slashdot Sig. version 0.1alpha. Use at your own risk.

Re:So weird by SCHecklerX · 2003-08-23 11:18 · Score: 1

With something so obfuscated, you are more likely to end up in your friends' and family members' address books. I'm sure you have a great time when worms like SoBig, Klez, etc hit.

Maybe you don't have this problem. I never did until becoming the list manager for my cycling team, then the bounces and spam (to a list alias that is NOT advertised anywhere) started flooding in.

Mod parent up by frovingslosh · 2003-08-23 11:18 · Score: 1

Mod parent up

--
I'm an American. I love this country and the freedoms that we used to have.

Re:Thank you from a Spammer by jpetts · 2003-08-23 11:23 · Score: 1

Jeez, what a saddo. He or she just set up a /. account to post this crap again...

However, I need to take exception with this bit:

Just go through lots more work to set up special filers on your computer

I have just set up a Network Appliance F840 filer for NAS on our network, and it we very easy indeed!

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender

Comment removed by account_deleted · 2003-08-23 11:36 · Score: 1

Comment removed based on user account deletion

delete key is tied to your ISP's abuse box by wayne · 2003-08-23 12:16 · Score: 1

As a professional sender of spam, I just want to tell you slashdotters to keep on playing with your spam filters. As long as you use spam filters on your e-mail, I can continue to reach my real intended targets, those non-slashdotters who do not know better and will buy my products or click through to my client's websites.

Complete BS.

Geeks are ones that set up the spam filters for everyone else. End users will no more have to install spam filters than they have to install DNS entries, multi-peered lines ot the backbone, etc. (In fact, the problem is that often ISPs don't tell you they are filter, or give you the chance to turn it off.)

Your filters really help cut down on the complaints to the Internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business.

Sorry, but my delete key is tied to your ISP's abuse box.

Ok, I actually have a separate "this is spam" key that send the spam off to spamcop. I also use the following procmail script to report anything that scores too high on spamassassin:

:0 fw | spamc :0 cw: * ^X-Spam-Flag: Yes * ^X-Spam-Level: \*\*\*\*\*\*\*\* * !RAZOR | spamassassin -r :0 cw: * ^X-Spam-Flag: Yes * ^X-Spam-Level: \*\*\*\*\*\*\*\*\* * !VIRUS[0-9] | spamassassin -d | head -c 25000 | spamcop_report

The spam_report script is very simple, it just encodes the spam and sends it off to spamcop. It can be found on http://spamcop.net/reporter.pl. I modify the number of stars (spamassassin score) depending on how much time I have on my hand right now. If too many reports get sent to spamcop for me to deal with, I increase the number of stars, when a spammer pisses me off, I decrease the score.

Even a small number of vindictive anit-spammers reporting spam will get the spammer's IP address onto spamcop's DNSBL, which feeds back into things like spamassassin.

The amount of spam that reaches my inbox in the last 6 months has been far lower than any time since the mid 1990s. Even with the reporting to spamcop, I'm spending less time dealing with spam now that two or three years ago. Over the last year or so, I've come to believe that Spammer's days are numbered.

Oh, one final note. The original article complained about the fact that spamassassin mine-defangs the spam and then says that it is hard to get the original email back. This isn't true at all. On older versions, you just run it through "spamassassin -d". While you can still do that with newer versions (as per my scripts above), they now create an attachment so you can just click on it if you want to see it.

--
SPF support for most open source mail servers can be found at libspf2.

SpamBayes by ewn · 2003-08-23 12:40 · Score: 1

I recently switched from bogofilter to SpamBayes. While it still shows the minor issues a young project always has (incompatibility with the dumbdb in Python 2.2.2 of SuSE 8.2 so you have to use gdbm as the internal db driver etc), i consider it one of the most promising spamfilters around.

It has several frontends. There's the outlook filter, there's a hammiefilter.py program that can be used with procmail as well as on the server side and there's a pop3proxy that does just that and can be used with a HTML interface to retrain the filter in a quick and easy way.
It has three categories: Spam, Ham, and Unsure. This actually helps a lot. I don't redirect spam to /dev/null, so i have to take out the trash manually. This becomes a lot easier if the filter presents you 50+ mails "I'm sure this is spam" and 10 more "Better double-check these".
It has fewer false negatives than bogofilter and works with smaller training sets. In my experience bogofilter is inefficient when trained with less than 400 spam mails. Even with larger sets, bogofilter used to filter only about 80% of the actual spam i receive. SpamBayes on the other hand runs on my machine trained with currently 262 spam and 288 ham, all collected in the past few months. Last week, i received 71 emails, 16 ham and 55 spam. SpamBayes classified all 16 hams as ham, 12 as unsure and 43 as spam. False Positives: 0, false Negatives: 0.

Sure, it's only one data point, and next week will be different, but i think i'll stick with SpamBayes for now.

Apple's Mail.app Filtering by ZackSchil · 2003-08-23 13:10 · Score: 1

It's a combination of Bayesian filtering and whitelists based on your address book. When you first start the application, it goes into pure training mode in which junk mail is flagged but not filtered out of your inbox automatically. You train it for a while, labeling junk yourself and correcting false positives. After the training mode is sufficient (no more false positives at all for a set period of time, though there usually aren't any as anyone in your address book is whitelisted and everyone you hold a correspondence with is added automatically) the filter then prompts you to go into automatic mode, in which it separates junk into its own box. After 10 days or so in the junk box (you can set the exact time, including never), the messages get deleted. And for those annoying people who forward jokes to you but are whitelisted anyway, enough training can actually selectively overcome the whitelist, it's really very cool. For the occasional piece of SPAM that makes its way into your mailbox, you can select it and press the junk button and it immediately banishes it to the junk box and learns for the mistake.

I understand that it wouldn't have worked out considering the methodology behind the tests but I'd be interested to see how Apple's Mail.app compares.

Re:free SPAM filters are not good enough by Yorkshire · 2003-08-23 13:24 · Score: 1

SpamAssassin is used in quite a few of the commercial offerings, it can also filter before passing on for internal delivery. I'd guess that one or two of the others can too, it's not difficult

--
Custom Rules For SpamAssassin

The true solution to spam. by ModernGeek · 2003-08-23 13:53 · Score: 1

It seems that the worst part about spam is wasted bandwidth and processing power. Wasted electricity from undesired messages being shoved through fiber optic cable seems like a waste, then even more power to process and discover if it is spam or not, then you have false positives. I think a better solution would be to weed out all the spammers, maybe take the internet away from countries that allow spammers or somthing?

--
Sig: I stole this sig.

SpamAssassin now has a bayesian filter by Coppit · 2003-08-23 14:00 · Score: 1

So the results aren't quite up to date. I've trained it on a couple months of spam and non-spam and it seems to significantly improve its classification.

Since we're posting stats... by Call+Me+Black+Cloud · 2003-08-23 14:58 · Score: 1

Messages classified: 3,545
Classification errors: 110

Accuracy: 96.89%

This is with 4 buckets. My spam bucket received 2,561 ( 72.24%) of those e-mails, with 7 false positives and 9 false negatives.

Oh yeah, POPFile is cross-platform...Windows, Linux, anything that will run Perl (Windows users, don't be afraid. The installer installs an interpreter for you - you'll never know it's there!)

SA holds its own even when crippled by Yorkshire · 2003-08-23 15:30 · Score: 1

the crippled SpamAssassin did pretty damn good though.

I don't think it's totally unfair to run SpamAssassin with the bayes disabled in these tests, a lot of people run it that way in the real world, especially on mail gateways where no provision has been made for training & retraining.

We just need to remember that on every score for SpamAssassin in those tests, it can do a lot better. I've heard good things about a few of the others, but SpamAssassin's nothing short of a miracle here, 2 false negatives and one false positive last month on approx 54K messages.

--
Custom Rules For SpamAssassin

Re:SA holds its own even when crippled by leviramsey · 2003-08-23 19:14 · Score: 1

SpamAssassin can also auto-learn; a message that scores sufficiently high will be fed to the Bayesian system as a spam and something that scores sufficiently low will be fed to the Bayesian system as ham. This in turn allows SA to develop other tests.

tell us something we dont already know by RouterSlayer · 2003-08-23 16:33 · Score: 1

How many times are we going to re-review the same old crap over and over again?

btw I agree with most readers here, the comparison is useless.

this is aside the fact its pointless for windoze users (the generators of most spam). Where is a review of Popfile ?

Ohh, I love the BS line about (Paraphrased!) "we turned bayesian filters off for spamassassin because 5 other filters were good enough" - wtf ?

With low data-sets like that, the article is useless, plus this is not a valid method of dealing with spam anyhow.

Has anyone else noticed how this topic keeps get regurgitated over and over ad-naseum?

blatant plug - anyone who wants to discuss anti-spam in real terms contact me (I'm in the process of setting up a sourceforge page too!) :)

Re:Thank you from a Spammer by Skapare · 2003-08-23 17:05 · Score: 1

Actually, unless he/she really is a spammer (which I doubt), he/she is just role playing. But the analysis has a very good point. Spam filters really do have the effect of helping spammers focus their mailings to those people who aren't going to complain, and especially might even buy something as a result of the spam. The trouble is that this help is paid for not by the spammer, but the victims of the spammer. I'll dismiss, for the moment, that the buyers of spamvertized products/services are victims. The rest of us incur costs as a result of spam, ranging from the time it takes to press delete, write complaints to whoever, use more disk space because the spam folder doesn't get cleaned out, process each message with the latest craze in content analysis, and just handle each incoming message, or connection, or SYN packet on our bandwidth. Spam filters don't decrease these costs; they just shift them around. They're doing nothing more than sweeping spam under the rug ... for us. And they will help justify even more people and businesses becoming spammers. The oft heard argument is that if spammers have no market to sell to, they will quit and find something else to do. But that presumes 100% coverage of spam filters. It will never happen, and for the most part, the targets of the spammers won't be covered at all, while it will still be cheaper to not clean the lists.

--
now we need to go OSS in diesel cars

SpamAssassin the best Bayesian filter? by leviramsey · 2003-08-23 19:21 · Score: 1

I've been thinking lately that SpamAssassin might have the best Bayesian implementation, with only a slight change.

AFAIK, most/all Bayesian scanners out there simply tokenize the mail and then use the tokens as the basis of the rating system.

However, SpamAssassin adds an X-Spam-Status header to all mails (by default), which contains a list of the various tests (regex, network, or Bayesian) that the mail triggered. If SA were to move the Bayesian scan to after all other tests have completed, then this list of tests passed could be (or might already be) considered by the tokenizer for the Bayesian algorithm.

The benefit to this is that regex's can discern more patterns in the code (or more correctly, equate patterns) and the network tests are fairly reliable. In a large sense, this is using Bayesian techniques to develop a self-adjusting rating scheme the tests. Using this, one could assess, for instance, how much having a host in the relay chain in an RBL influences the spamminess of an email (for instance, a large amount of email originating from SPEWS-listed IPs is not spam; this would imply that SPEWS would have a lower confidence rating in picking out a spam).

I've guess you've never used... by Ayanami+Rei · 2003-08-23 19:44 · Score: 1

the external filtering stages of Sendmail, postfix, qmail or Exchange's SMTP engine. You know, the place where you can run an external program on the email message. Since ALL of the reviewed spam classifiers were chosen because they run from the command line with only the message as standard input and a classification as the output, I'm sure you can write a quick perl script to use it in that context and acheive the mail accept/reject feature you need.

Maybe you add in an extra header: (X-Int-Spam: Yes) to let downstream clients deal with delivery options.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Statistical approach is inferior by tuxlove · 2003-08-23 20:04 · Score: 1

The only spam filter that is acceptable is one that utilizes challenge-response. There are NO false positives with filters based on this. I cannot tolerate even one false positive with my email. And the only spam that gets through are those with actual real-life return addresses. These are rare, and since there's a live address on the other end I have the luxury of sending the spammer a bitch-o-gram.

Must add my voice to this... by caitsith01 · 2003-08-23 20:45 · Score: 1

POPFile rocks. It is incredibly easy to use, and very, very accurate. I initially started using it for spam reduction purposes but now I find it's best use is actually sorting my mail... waaaaay better that pre-defined mail filters.

I strongly recommend people check it out if they want a very effective solution that is easy to use and configure.

--
Read Pynchon.

Re:Unadvertised (OT) by WIAKywbfatw · 2003-08-23 21:51 · Score: 1

Hey, it's not my fault that you can't appreciate sarcasm.

You're not the same Timbo of "Timbo's goals" fame are you? If so, any predictions for the season? I can't believe we didn't try to get Mendieta if he was available for free.

--

"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg

Re:Mozilla Thunderbird by juhaz · 2003-08-24 00:08 · Score: 1

Try cleaning your training file (\training.dat) and retraining it, there's probably something wrong with it, and it can't "unlearn" whatever is screwed. It should be able to do a LOT better than that.

And yes, it uses the same filter.

Re:Mozilla Thunderbird by juhaz · 2003-08-24 00:12 · Score: 1

that was supposed to be \training.dat, but /. ate it.

SpamOracle is missing by G�tz · 2003-08-24 05:41 · Score: 1

My current spam filter is SpamOracle. It's a simple procmail filter based on Bayes' formula. It's really efficient, I haven't had a spam mail in my inbox for a week. The only bad thing is that it's written in ocaml which might not be on everybody's machine. Mandrake users can install a contribs package and don't need ocaml at all.

Re:Link Please by jo42 · 2003-08-25 09:00 · Score: 1

...and when is FreshDeadMeat going to use fonts that are so boogeringly fsckin' BIG...? Must be composed and viewed under Linux or somethin'...

165 of 213 comments (clear)