Sorting the Spam from the Ham

Why not here? by Anonymous Coward · 2003-06-26 05:31 · Score: 5, Interesting

What happens if Slashdot runs a Bayesian filter which runs a day after the stories are posted and programs itself with all the -1 comments as "Spam" and all the +5 comments as "Ham". Then let the Bayesian filter adjust all incoming messages by up to 2 points.

I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.

Re:Why not here? by bmongar · 2003-06-26 05:59 · Score: 4, Interesting

Very interesting but I think it wouldn't work well, since most of the trolls and flamethrowers are talking about the same topics the same words will show up in both ham and spam posts. But if someone could come up with a word pattern algorithm that could differentiate that would rock.

--
As x approaches total apathy I couldn't care less.
Re:Why not here? by Anonymous Coward · 2003-06-26 06:03 · Score: 0, Interesting

It would do a great job of catching:
Penis Bird
First Post!
$OS is dying
goatse.cx
It might make a better lameness filter than the one they have now

What I want by Nate+Fox · 2003-06-26 05:32 · Score: 5, Interesting

is a scalable popfile for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock. I've heard spamasassin does Bayesian, but I couldnt see how it was trainable (and I dont want other people on my server to read each others mail, obviously).

Re:What I want by slagdogg · 2003-06-26 07:45 · Score: 2, Interesting

I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox.

Would you mind sharing your .muttrc for this?

--
(Score:-1, Wrong)
Re:What I want by leshert · 2003-06-26 11:48 · Score: 3, Interesting

Not at all. The macros are short and sweet:

macro index d ~/Mail/bham^my macro pager d ~/Mail/bham^my macro index S ~/Mail/bspam^my macro pager S ~/Mail/bspam^my

Then the relevant sections of my crontab look like this:

0 2 * * * /usr/bin/sa-learn --spam --mbox /home/tim/Mail/bspam 15 2 * * * /usr/bin/sa-learn --ham --mbox /home/tim/Mail/bham

In another post (as well as on several sites on the web), it's recommended to bind a key to pipe the message directly to sa-learn. I read my mail on the server, which is an embarrassingly old machine, and sa-learn takes on the order of 30 seconds per email--not fun when you're just doing 'that last check of email before heading home'. Copying the mail to a file is just about instantaneous, and the sa-learn can do its dirty work while I'm sleepting (or watching The Office, as the case may be).
Re:What I want by thogard · 2003-06-26 12:23 · Score: 2, Interesting

Some of them are dealing with the pain. A guy I meet recently paid about AU$5000 to a spam house to send his ad out to a million people in an opt-in mail list. His web server got 40 hits that day compared to the daily averge of 13 and none of them bought his book. He was taught a $5000 lesson that spaming doesn't work. What was interesting is that the "demo run" got more hits on his web site than the real run.

SpamAssassin works for me (even on Exchange) by AssFace · 2003-06-26 05:35 · Score: 5, Interesting

My own personal account is on a shared server at pair.com, and I run SpamAssassin (the perl script, can't put the spamc/d on there since I'm not root).
I have written on here before how I have saved myself a lot of hassle over the last few months by installing SA. I now stop 100+ messages a day (usually more like 140 now).
My stats tell me that since Feb, I've stopped over 15K Spam messages. Hot damn.

Where I currently work now we have Exchange and I wanted SpamAssassin on there, but we weren't getting the money approved to put it on.
So I hacked in SpamAssassin via an Exchange 2000/2003 EventSink.
If you want the code for it, feel free to grab it from http://www.cardboardutopia.com/ExchangeSpamFilter. zip

But do note that if you have many users on your machine, you aren't going to want to use this - an EventSink on Exchange runs in serial, so SpamAssassain's Perl script (the spamc/d doesn't work under Win32) will get executed on every incoming mail, and it will have to wait until it is done before it gets the next one.

We process about 2000-5000 incoming messages a day and it does okay, but we have a very light load.

--

There are some odd things afoot now, in the Villa Straylight.

Re:SpamAssassin works for me (even on Exchange) by mpieters · 2003-06-26 06:25 · Score: 2, Interesting

We ran SpamAssassin on Python.org and Zope.org for a considerable lenght of time. We had, however, many false-positives to deal with (we manually checked everythiong that scored everything between 5 and 10 points on the SpamAssassin scale). Usually, we had to review between 10 and 15 messages a day like this.

We recently switched to SpamBayes, and our false-positive rate so far is 6 out of 2200+ spams (almost 12 days of traffic, with certain foreign charactersets, malformed email headers and blacklisted email bounced and not included in this number), mostly because we are still in training mode.

On top of that, because SpamBayes is written in Python, we can integrate it directly into Exim with Greg Ward's elspy, whereas we had to run SpamAssassin in a separate process, which occasionally bombed out. Way much faster this way!

Way more hot damn!

--
"The truth shall make ye fret" -- The Truth, Terry Pratchett

An interesting way to deal with spam. by Meat+Blaster · 2003-06-26 05:37 · Score: 2, Interesting

I've tried a number of different ways to filter spam, from whitelisting to Bayesian filtering, and Bayesian seems to offer a good balance between not eating too much of the ham while letting the spam through. Not too shabby, especially given that it comes with Mozilla now, and I think it's an excellent way of allowing clients to determine what they want to see without infringing free speech.

I don't know if I'd want it in Python, though... it does seem to be a good deal slower already than other spam filtering methods without putting it in a scripting language. Getting it in Outlook can only be good for the net (can Bayesian be applied to things like spam from Internet virii as well?)

Re:An interesting way to deal with spam. by kefoo · 2003-06-26 06:15 · Score: 2, Interesting

can Bayesian be applied to things like spam from Internet virii as well?

What if the the filtering programs had a feature that would allow somebody to send out the "signature" of an email virus that the filter could use to block the virus before it had ever actually seen one, by adding its characteristics to the list of things that weigh heavily toward spam so it would be filtered out before ever reaching Exchange/Outlook.

Spam filtering altogether by ToadMan8 · 2003-06-26 05:40 · Score: 5, Interesting

I sat on the E-Mail policy team (a branch of the Strategic Planning team) for Miami University (Oxford, OH, not Florida) this last year (as a technical advisor, student and support desk employee. We looked at all sorts of spam solutions, as the president decided this should be a main focus (apparently the Viagra adds hit a bit too close to home for comfort ;)).

The problem in the educational market, though, is that, not being a business that can make rules and force people to live by them, educational establishments have annoyed customers (students and faculty) sometimes if any spam is blocked. (research, etc) False positives absolutely can't be tolerated. So a ranked system (spam assasian) that suggests the possibility of spam is not on the best but the only solution we have avalible. Mail will be ranked and users can make rules that trash everything but a guarenteed perfect mail, if they so desired. Or they can leave them all alone. So intelligent filtering is a necessity, not just a bennefit.

On another page, I had an odd place during this discussion of the team. I do not receive spam. (Please, don't start now). My MUOhio.edu address simply doesn't get a single piece of spam e-mail. I have had the account for two years. I have over 3000 messages in various folders. And none are spam at all. I just haven't signed up for anything with it. I put the e-mail addy on webpages too (that I author) and haven't gotten a single thing. But oh my the trash "spam" account gets 60 a day. On AOL. That blocks 80% of incoming mail. Ironically, they had MUOhio.edu blocked weeks back.

--
I haven't posted in so long, my sig is out of date.

Remote Images in spam... by dioxn · 2003-06-26 05:49 · Score: 3, Interesting

I've noticed that the spam that has been getting through my Mozilla filter are the ones with innocuous sounding subjects and an embedded image.
Could this be the future of spam?
Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)

Re:Mozilla Mail by drinkypoo · 2003-06-26 05:51 · Score: 2, Interesting

They work pretty well for me, but nowhere near flawless. Some days I get 25 messages that go into the spam folder and only 3 in my inbox, some days I get about 10 in the spam folder and 5 in the inbox... It's a lot better than nothing. The real reason I run Mozilla for mail is the HTML rendering, which is better than any other mail client I'm aware of; The secondary reason is the bayesian filtering, and the tertiary is Enigmail, though no one I know bothers to use encryption anyway.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Re:Written by more than hammond by Wakko+Warner · 2003-06-26 05:58 · Score: 3, Interesting

The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made.

This is quite possibly the only complaint I have about spambayes, too, and it's not even that big a deal to me. After about a month of collecting spam in its own folder (named SHIT, oddly enough), it had learned enough that I was able to dial down my SpamAssassin settings (I use an old version of SA still, too, without the bayesian stuff built in -- too lazy to switch; spambayes works well enough that it's not worth it.) I check my incoming spam folder once or twice a week now, as opposed to once or twice a day when I only ran SpamAssassin at a relatively forgiving (4.5-5.5) setting.

There are a few thousand spams in SB's crap folder now; it's gotten so good that I can't really remember the last time I've had something miscategorized as spam, and of the 50-60 spams I get per day, usually only one or two make it through to my inbox, if that. Half of the time, I don't get any at all.

If you didn't have a reason for installing a Python interpreter before, now you do.

- A.P.

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"

The spam I do see by steveha · 2003-06-26 06:02 · Score: 4, Interesting

I'm using SpamProbe, and it blocks almost all spam I get.

Much of the spam that gets past it is so minimalist it cannot be blocked by a Bayesian filter. I get messages like this:

Subject: a nice lady wants to talk to you

see the pictures

no more mail

It's like someone is trying to put so little in the message, that there is nothing to filter. If only they would use the stock "We are sending you this because you opted-in on it. Click on this link to remove your address." If they used that, I'll never see the message; SpamProbe will grab it. But how could I train SpamProbe to detect the minimalist ones, without blocking everything forever?

So far I don't get too many of the minimalist ones, and I just hit delete. If it becomes widespread, I'll have to start using Vipul's Razor or something.

The other kinds of spam that get past SpamProbe are the ones that have rampant misspellings. Since none of the words are in the database, they don't match as spam terms:

Subject: make moneey on EBAYxbbid

Want to make moneyzseqw? Click here...

I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely

Battle of the network Bayesian allstars by dubStylee · 2003-06-26 06:06 · Score: 3, Interesting

Suppose

1. I have a friend who uses the same kinds of words as I do and who uses Outlook (ok, an aquaintance, because friends don't let friends ...)

2. An email virus attacks this person, snarfs up his Ham, runs a Bayesian filter on it and comes up with Spam specifically tailored for this person's aquaintances.

There's a science fiction book waiting to happen in here somewhere. If so, I own the SCOpyright on it.

What I don't like by Boyceterous · 2003-06-26 06:06 · Score: 5, Interesting

about this kind of filtering is that it has to download the email content - not always as good idea, especialy in a Windows environment. Besides, I can identify spam just by looking at message header information. Sender, recipient, and subject line are nearly always enough. Plus I don't need to waste time, bandwidth, or get subjected to offensive graphics, or risk 1-pixel confirmations or getting hacked by the latest security issue. My homespun message header analysis program drops nearly all spam, and results in few legit email rejections. I score the headers based on missing recipient, sender info, keywords in subject, string match in sender email or name, punctuation count in subject line, number of contiuous spaces in subject line, plus a few other things that seem to run common in the spam I get. I can also permit certain email addresses to pass no matter the score. It's not fancy, but it works, and I never have to waste time drawing the whole content down to my local machine. What I do may not work for everyone, but it seems that in most cases it should, unless you get a lot of email from unknown (non-spam) sources - not typical for the average email user.

Spam is a poor use. by Lord+Bitman · 2003-06-26 06:07 · Score: 2, Interesting

this is like inventing something as useful as the Knife, and using it only to attack salesmen. Why bother stopping with spam? Why not apply this filter to, say, absolutely everything? Since I just said "absolutely everything", I wont bother giving examples.
Training something to know how likely something is to be true, that sounds too useful to waste any time with on spam at all.

--
-- 'The' Lord and Master Bitman On High, Master Of All

Soundex to work around intentional misspellings? by GGardner · 2003-06-26 06:25 · Score: 3, Interesting

For the spammers who are trying to use misspellings to get around filters, I wonder if soundex could fix that problem quickly. That is, instead of doing the Bayesian calculations on the raw tokens, calculate probabilities based on the soundex values of the the tokens. You might need to teach soundex that the number one sounds like I, and other leet-speek-like things, but this might solve the problem quickly and easily.

Re:This is bad news!!! by aborchers · 2003-06-26 06:31 · Score: 2, Interesting

First, I'd refer you to my /. Moderation Aphorism #1. Second, I'll give a serious answer to your serious observation:

I use MS Office under Crossover Office because it gives me the features I want (admittedly, one of them is the ability to share identically functional documents with Windows users) so I definitely agree with your perspective. In the case of Mozilla, there has been a great ruckus around here about spam, and I kept telling people it didn't affect me because I used Mozilla w/ Bayesian filters. Additionally, Outlook's rotten record for relaying mail worms has been a problem to me as a sys admin. Independent of the calendar/groupware features, in my immediate area, most people use Outlook as a mail client out of inertia because it came with Office and refuse to switch because of fear of the unknown rather than out of a choice based on features.

--
Trouble making decisions? Just flip for it.

Great, but my problem is a bit more complicated .. by slagdogg · 2003-06-26 06:48 · Score: 3, Interesting

Bayes rocks, been using it with spamassassin and it kills 99% of my spam. The problem is when some asshole spammer uses my email address in the 'From' header of his spam ... then I get scores of 'user not found' or 'virus detected' emails from legitimate mail servers ... it's not spam, but it's just as annoying. How do you guys deal with this problem?

--
(Score:-1, Wrong)

Re:This is bad news!!! by H310iSe · 2003-06-26 06:48 · Score: 2, Interesting

I use outlook because my clients use outlook (though mostly I just use the awsome web interface that fastmail.fm provides). My clients use outlook because it has great, integrated calendaring and it syncs with their various PDAs. Such is life.

I recently reviewed 7 client-side spam filters and ended up picking Spambully. It's not free and it's not perfect but for our environment (Win/outlook 2k2 w/ a weird mirapoint IMAP server and multiple PCs per user (so email needs to stay on the server)) it was the best. Very tight outlook integration (i'm a little worried about instabilities but so far it's smooth) and baysian.

But it's really just the best of a bad lot. It's great to see someone working on an open source filter that might work w/ IMAP - we can't have enough of these since right now, well, we have almost none.

--
closed minded is as closed minded does

Re: Too resource hungry by Anonymous Coward · 2003-06-26 07:10 · Score: 1, Interesting

Have you tried running a Bayesian filter on many messages at once? The Mozilla implementation hangs the mail app for a few seconds on a 1.4GHz Athlon when going through a hundred or so messages. Assuming Slashcode would implement it through Perl, it would be even slower. For reference, running Spamassassin with Bayes filtering (Perl scripts, not spamc) isn't exactly speedy. Going through several messages brings CPU usage close to maximum.

Bayesian filtering on comments would be too resource costly. A more plausible application would be to run stories through a Bayes-style filter, creating a profile for each story that checks each new story with previous profiles so that dupes could be reduced. But that would not be as good as having the editor looking at the current front page (as SCO stories would look similar).

Admit it, Slashdot. You love spam. by jeduthun · 2003-06-26 07:21 · Score: 2, Interesting

You guys are a bunch of hypocrites. You don't really want spam to stop. You love spam.

Every spam thread is the same: I use X, and it blocks 98% of my spam, with no false positives! I use Y, and it blocks 99.9% -- take that! Here, I use Z + Y with these custom Perl scripts I wrote that interface with procmail and stop 101% percent of spam! It doesn't matter, because I never get ANY spam! Spam is only because people buy things in spam! What morons! Bow before me, for I am 1337!

Spam gives you something to fight. Spam gives you an excuse to solve an interesting technical problem (i.e. separating spam from ham). Spam gives you a reason to boast. Spam gives you people to dislike.

Admit it.

You love spam.

Re:Why stop at classifying spam? Why not all e-mai by Anonymous Coward · 2003-06-26 09:14 · Score: 1, Interesting

Make sense. Consider classifying with a binary tree (e.g., first divide into spam and non-spam, then divide the no-spam into personal and business, and then divide personal into two groups, and so on). If each step can be done with 99% accuracy (something my experience with Bayesian spam filtering would indicate), then you could go 5 levels deep (32 buckets, if fully populated) and have roughly 95% accuracy. Not "very slight" decrease but still quite usable,
and the cost of misclassification wouldn't be very high anyway.

Anyone know of a Lotus Notes filter? by Moderation+abuser · 2003-06-26 09:43 · Score: 2, Interesting

I've just been migrated to Notes from Outlook. Not a happy bunny till I discovered how powerful it is with stuff like agents.

The only thing I'm missing now is a spam classification tool like popfile for notes.

--
Government of the people, by corporate executives, for corporate profits.

Slashdot Mirror

Sorting the Spam from the Ham

27 of 249 comments (clear)