Sorting the Spam from the Ham

← Back to Stories (view on slashdot.org)

Sorting the Spam from the Ham

Posted by CmdrTaco on Thursday June 26, 2003 @05:27AM from the for-the-beginners dept.

MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond." Math buffs might enjoy reading these pages or browsing this writeup and its many links.

7 of 249 comments (clear)

Min score:

Reason:

Sort:

Why not here? by Anonymous Coward · 2003-06-26 05:31 · Score: 5, Interesting

What happens if Slashdot runs a Bayesian filter which runs a day after the stories are posted and programs itself with all the -1 comments as "Spam" and all the +5 comments as "Ham". Then let the Bayesian filter adjust all incoming messages by up to 2 points.

I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.
1. Re:Why not here? by bmongar · 2003-06-26 05:59 · Score: 4, Interesting
  
  Very interesting but I think it wouldn't work well, since most of the trolls and flamethrowers are talking about the same topics the same words will show up in both ham and spam posts. But if someone could come up with a word pattern algorithm that could differentiate that would rock.
  
  --
  As x approaches total apathy I couldn't care less.
What I want by Nate+Fox · 2003-06-26 05:32 · Score: 5, Interesting

is a scalable popfile for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock. I've heard spamasassin does Bayesian, but I couldnt see how it was trainable (and I dont want other people on my server to read each others mail, obviously).
SpamAssassin works for me (even on Exchange) by AssFace · 2003-06-26 05:35 · Score: 5, Interesting

My own personal account is on a shared server at pair.com, and I run SpamAssassin (the perl script, can't put the spamc/d on there since I'm not root).
I have written on here before how I have saved myself a lot of hassle over the last few months by installing SA. I now stop 100+ messages a day (usually more like 140 now).
My stats tell me that since Feb, I've stopped over 15K Spam messages. Hot damn.

Where I currently work now we have Exchange and I wanted SpamAssassin on there, but we weren't getting the money approved to put it on.
So I hacked in SpamAssassin via an Exchange 2000/2003 EventSink.
If you want the code for it, feel free to grab it from http://www.cardboardutopia.com/ExchangeSpamFilter. zip

But do note that if you have many users on your machine, you aren't going to want to use this - an EventSink on Exchange runs in serial, so SpamAssassain's Perl script (the spamc/d doesn't work under Win32) will get executed on every incoming mail, and it will have to wait until it is done before it gets the next one.

We process about 2000-5000 incoming messages a day and it does okay, but we have a very light load.

--

There are some odd things afoot now, in the Villa Straylight.
Spam filtering altogether by ToadMan8 · 2003-06-26 05:40 · Score: 5, Interesting

I sat on the E-Mail policy team (a branch of the Strategic Planning team) for Miami University (Oxford, OH, not Florida) this last year (as a technical advisor, student and support desk employee. We looked at all sorts of spam solutions, as the president decided this should be a main focus (apparently the Viagra adds hit a bit too close to home for comfort ;)).

The problem in the educational market, though, is that, not being a business that can make rules and force people to live by them, educational establishments have annoyed customers (students and faculty) sometimes if any spam is blocked. (research, etc) False positives absolutely can't be tolerated. So a ranked system (spam assasian) that suggests the possibility of spam is not on the best but the only solution we have avalible. Mail will be ranked and users can make rules that trash everything but a guarenteed perfect mail, if they so desired. Or they can leave them all alone. So intelligent filtering is a necessity, not just a bennefit.

On another page, I had an odd place during this discussion of the team. I do not receive spam. (Please, don't start now). My MUOhio.edu address simply doesn't get a single piece of spam e-mail. I have had the account for two years. I have over 3000 messages in various folders. And none are spam at all. I just haven't signed up for anything with it. I put the e-mail addy on webpages too (that I author) and haven't gotten a single thing. But oh my the trash "spam" account gets 60 a day. On AOL. That blocks 80% of incoming mail. Ironically, they had MUOhio.edu blocked weeks back.

--
I haven't posted in so long, my sig is out of date.
The spam I do see by steveha · 2003-06-26 06:02 · Score: 4, Interesting

I'm using SpamProbe, and it blocks almost all spam I get.

Much of the spam that gets past it is so minimalist it cannot be blocked by a Bayesian filter. I get messages like this:

Subject: a nice lady wants to talk to you

see the pictures

no more mail

It's like someone is trying to put so little in the message, that there is nothing to filter. If only they would use the stock "We are sending you this because you opted-in on it. Click on this link to remove your address." If they used that, I'll never see the message; SpamProbe will grab it. But how could I train SpamProbe to detect the minimalist ones, without blocking everything forever?

So far I don't get too many of the minimalist ones, and I just hit delete. If it becomes widespread, I'll have to start using Vipul's Razor or something.

The other kinds of spam that get past SpamProbe are the ones that have rampant misspellings. Since none of the words are in the database, they don't match as spam terms:

Subject: make moneey on EBAYxbbid

Want to make moneyzseqw? Click here...

I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely
What I don't like by Boyceterous · 2003-06-26 06:06 · Score: 5, Interesting

about this kind of filtering is that it has to download the email content - not always as good idea, especialy in a Windows environment. Besides, I can identify spam just by looking at message header information. Sender, recipient, and subject line are nearly always enough. Plus I don't need to waste time, bandwidth, or get subjected to offensive graphics, or risk 1-pixel confirmations or getting hacked by the latest security issue. My homespun message header analysis program drops nearly all spam, and results in few legit email rejections. I score the headers based on missing recipient, sender info, keywords in subject, string match in sender email or name, punctuation count in subject line, number of contiuous spaces in subject line, plus a few other things that seem to run common in the spam I get. I can also permit certain email addresses to pass no matter the score. It's not fancy, but it works, and I never have to waste time drawing the whole content down to my local machine. What I do may not work for everyone, but it seems that in most cases it should, unless you get a lot of email from unknown (non-spam) sources - not typical for the average email user.