Sorting the Spam from the Ham

But without spam... by pnix · 2003-06-26 05:29 · Score: 3, Funny

But without spam, I wouldn't get any email!

Re:But without spam... by IthnkImParanoid · 2003-06-26 05:38 · Score: 3, Insightful

You're getting modding as funny, but I just figured out exactly how true this is. My main email account is used primarily for work, so it was very easy to set up white lists for 30 or so email addresses with a few family and friends thrown in, and route to a special folder. I still check the default folder, of course, but I turned off notification for everything except the white folder.

I went from checking my email every 5-10 minutes to a handful of times a day.

--
It's nothing but crumpled porno and Ayn Rand.

Why not here? by Anonymous Coward · 2003-06-26 05:31 · Score: 5, Interesting

What happens if Slashdot runs a Bayesian filter which runs a day after the stories are posted and programs itself with all the -1 comments as "Spam" and all the +5 comments as "Ham". Then let the Bayesian filter adjust all incoming messages by up to 2 points.

I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.

Re:Why not here? by bmongar · 2003-06-26 05:59 · Score: 4, Interesting

Very interesting but I think it wouldn't work well, since most of the trolls and flamethrowers are talking about the same topics the same words will show up in both ham and spam posts. But if someone could come up with a word pattern algorithm that could differentiate that would rock.

--
As x approaches total apathy I couldn't care less.

What I want by Nate+Fox · 2003-06-26 05:32 · Score: 5, Interesting

is a scalable popfile for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock. I've heard spamasassin does Bayesian, but I couldnt see how it was trainable (and I dont want other people on my server to read each others mail, obviously).

Re:What I want by franimal · 2003-06-26 05:41 · Score: 3, Informative

Personally, I really like Spambayes and Procmail for use with my IMAP server. It's easy to setup for each user and they can train their own SPAM database. You can even run the training script as a cron job and the users only need to shuffle unknowns to the spam folder. Works well, because users never even have to see the spam, if they don't want to.
Re:What I want by leshert · 2003-06-26 06:53 · Score: 5, Informative

Spamassassin learns in two ways:
1. Manual training: there is a tool called 'sa-learn'. You can pipe a message to it, or point it to a mailbox, and specify whether the mail is spam or ham.
2. Automatic training: if the score of the mail is significantly low (definitely spam) or significantly high (definitely ham), it will automatically train on the message. This may seem useless, but it's useful in that SA will then start to figure out patterns in spam or ham that don't trigger its rules.

I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox. Then I added a nightly cron job to run sa-learn over the two mboxes and truncate them. This has worked very, very well for me... In I haven't had a single false positive since Bayes kicked in about two months ago, and I got my first false negative in about two weeks today. I typically trap 10-15 spams a day.

One thing to notice: even if you enable it, Bayesian filtering won't kick in until you've recognized at least 200 spam and 200 ham messages. Took me a long time to figure that out (I had plenty of spam, but I wasn't training it on ham at all, which is why I started remapping the mutt commands).

As far as installing it on a server, your users don't have to be able to read each others' mail. I have it installed so that my wife and I each have our own bayes dbs, so neither of us has to read each others' mail. Plus, different users will regard different mail as spam: anything about the Pittsburgh Steelers going to my mailbox is probably spam, but not hers; similarly, anything regarding Linux going to her mailbox is probably spam, but not mine.
Re:What I want by leshert · 2003-06-26 11:48 · Score: 3, Interesting

Not at all. The macros are short and sweet:

macro index d ~/Mail/bham^my macro pager d ~/Mail/bham^my macro index S ~/Mail/bspam^my macro pager S ~/Mail/bspam^my

Then the relevant sections of my crontab look like this:

0 2 * * * /usr/bin/sa-learn --spam --mbox /home/tim/Mail/bspam 15 2 * * * /usr/bin/sa-learn --ham --mbox /home/tim/Mail/bham

In another post (as well as on several sites on the web), it's recommended to bind a key to pipe the message directly to sa-learn. I read my mail on the server, which is an embarrassingly old machine, and sa-learn takes on the order of 30 seconds per email--not fun when you're just doing 'that last check of email before heading home'. Copying the mail to a file is just about instantaneous, and the sa-learn can do its dirty work while I'm sleepting (or watching The Office, as the case may be).

This is bad news!!! by aborchers · 2003-06-26 05:34 · Score: 4, Funny

The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP

I've now lost one of my primary arguments for switching my colleagues to Mozilla!

--
Trouble making decisions? Just flip for it.

Re:This is bad news!!! by Mikey-San · 2003-06-26 06:24 · Score: 4, Insightful

I know your post was meant to be funny, but it brings up a point:

So what? If more computer products benefit, don't we all? Anything that makes Outlook better is good in my book. Perhaps this will eliminate some virus-and-worm-carrying spam--and that's good for /all/ of us on teh intarweb. ;-)

--
Mikey-San
Karma: +Eleventy billion (mostly affected by watching Celebrity Jeopardy)

SpamAssassin works for me (even on Exchange) by AssFace · 2003-06-26 05:35 · Score: 5, Interesting

My own personal account is on a shared server at pair.com, and I run SpamAssassin (the perl script, can't put the spamc/d on there since I'm not root).
I have written on here before how I have saved myself a lot of hassle over the last few months by installing SA. I now stop 100+ messages a day (usually more like 140 now).
My stats tell me that since Feb, I've stopped over 15K Spam messages. Hot damn.

Where I currently work now we have Exchange and I wanted SpamAssassin on there, but we weren't getting the money approved to put it on.
So I hacked in SpamAssassin via an Exchange 2000/2003 EventSink.
If you want the code for it, feel free to grab it from http://www.cardboardutopia.com/ExchangeSpamFilter. zip

But do note that if you have many users on your machine, you aren't going to want to use this - an EventSink on Exchange runs in serial, so SpamAssassain's Perl script (the spamc/d doesn't work under Win32) will get executed on every incoming mail, and it will have to wait until it is done before it gets the next one.

We process about 2000-5000 incoming messages a day and it does okay, but we have a very light load.

--

There are some odd things afoot now, in the Villa Straylight.

Re:SpamAssassin works for me (even on Exchange) by vanyel · 2003-06-26 07:12 · Score: 4, Informative

I run a small ISP with spamassassin installed, and I had to increase the default quota when I upgraded to the version with Bayesian filtering and its multi-megabyte databases per user. Combined with spamd bugs forcing me to switch back to running spamassassin individually and the fact that spamd still doesn't serialize processing, so the system still gets hammered by a flood of spam, I'm looking forward to greylisting to help take the load off spamassassin.

Spambayes by Chromodromic · 2003-06-26 05:35 · Score: 5, Informative

I use Spambayes with Outlook 2000, and it takes a little tweaking, but it works as advertised. Ahhh, the magic of mathematics. Just now, brought up Outlook, checked my mail and three little messages offering a free Sony headset, 70% off cell accessories, and a chance to take an IQ test just got tossed into my spam folder. Thanks anyway, but I think that means I just passed my IQ test.

Every so often I go in and take out some old, old spam, just to make sure my current preferences are being represented and that's all the maintenance that's required.

This is, however, the second time I've trained the filter. The first time, it incorrectly identified my FreeBSD status mails as spam, and from then on was throwing those into the Spam folder. My own fault, though, since I hadn't included any of these messages in my representative ham.

If you run Outlook, download this filter and use it. You'll be doing yourself, and a world that doesn't need fat-injected, herbally enhanced penises, a favor.

--
Chr0m0Dr0m!C

Re:Spambayes by AssFace · 2003-06-26 06:20 · Score: 4, Insightful

I have seen all of the local client software and I personally have never bothered with it.

I always felt that the whole point of spam being annoying was that it wasted bandwidth. It gets sent to my server, and then I have to download it all from my server, and then it gets sorted away from my eyes in my client.

It is fairly trivial if you get enough regular mail for it to matter, and you are on a fast connection.

But I can't tell you how annoying it is to be on a slow dial-up connection and download 50 messages and then see that they all got filtered into the spam folder and that there were no "real" messages.
While there is a nice feeling of seeing them all get caught, it is annoying to have to wait for a download (and pay for it) and then get no return on the investment.

That is why I always try to have the spam blocking on the server side. Although I now spend most of my time using ssh into my server and that way it isn't downloading all of the mail until I want to see something.

Perhaps if I combine the fact that I have SA on the server, and then if I also had a client side option, I would get everything properly blocked that way (the only reason stuff gets through my server setup right now is if the server is under a high load, then my SA script will time out and the mail gets through).

--

There are some odd things afoot now, in the Villa Straylight.

Written by more than hammond by adamhupp · 2003-06-26 05:37 · Score: 4, Informative

The Outlook plugin may have been written by Mark Hammond but spambayes is very much a group effort. The project can be found at spambayes.sf.net.

I've been using spambayes for months now and it really is quite amazing. Now, when I get the occasionaly spam in my mailbox it's actually interesting because I want to figure out why it made it in. The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made. It's made reading email a much more enjoyable activity.

-Adam

Re:Written by more than hammond by Wakko+Warner · 2003-06-26 05:58 · Score: 3, Interesting

The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made.

This is quite possibly the only complaint I have about spambayes, too, and it's not even that big a deal to me. After about a month of collecting spam in its own folder (named SHIT, oddly enough), it had learned enough that I was able to dial down my SpamAssassin settings (I use an old version of SA still, too, without the bayesian stuff built in -- too lazy to switch; spambayes works well enough that it's not worth it.) I check my incoming spam folder once or twice a week now, as opposed to once or twice a day when I only ran SpamAssassin at a relatively forgiving (4.5-5.5) setting.

There are a few thousand spams in SB's crap folder now; it's gotten so good that I can't really remember the last time I've had something miscategorized as spam, and of the 50-60 spams I get per day, usually only one or two make it through to my inbox, if that. Half of the time, I don't get any at all.

If you didn't have a reason for installing a Python interpreter before, now you do.

- A.P.

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"

News for Pervs, Stuff that Matters. by notque · 2003-06-26 05:40 · Score: 5, Funny

Would you use the phone if you had to listen to a 10-second brothel advertisement every time you made a call?

Yes.

Definately Yes.

Is that a feature I can have added?

--
http://use.perl.org

Spam filtering altogether by ToadMan8 · 2003-06-26 05:40 · Score: 5, Interesting

I sat on the E-Mail policy team (a branch of the Strategic Planning team) for Miami University (Oxford, OH, not Florida) this last year (as a technical advisor, student and support desk employee. We looked at all sorts of spam solutions, as the president decided this should be a main focus (apparently the Viagra adds hit a bit too close to home for comfort ;)).

The problem in the educational market, though, is that, not being a business that can make rules and force people to live by them, educational establishments have annoyed customers (students and faculty) sometimes if any spam is blocked. (research, etc) False positives absolutely can't be tolerated. So a ranked system (spam assasian) that suggests the possibility of spam is not on the best but the only solution we have avalible. Mail will be ranked and users can make rules that trash everything but a guarenteed perfect mail, if they so desired. Or they can leave them all alone. So intelligent filtering is a necessity, not just a bennefit.

On another page, I had an odd place during this discussion of the team. I do not receive spam. (Please, don't start now). My MUOhio.edu address simply doesn't get a single piece of spam e-mail. I have had the account for two years. I have over 3000 messages in various folders. And none are spam at all. I just haven't signed up for anything with it. I put the e-mail addy on webpages too (that I author) and haven't gotten a single thing. But oh my the trash "spam" account gets 60 a day. On AOL. That blocks 80% of incoming mail. Ironically, they had MUOhio.edu blocked weeks back.

--
I haven't posted in so long, my sig is out of date.

Mozilla Mail by respite · 2003-06-26 05:42 · Score: 3, Informative

In case anyone hasn't tried it yet, the Bayesian filters in the mail client of the Mozilla suite are really impressive. They have worked close to flawless for myself.

So-so article by scottme · 2003-06-26 05:48 · Score: 3, Insightful

For an article in an "IT tech" section of a paper, this is really very weak.

It really doesn't do much more than precis Paul Graham's arguments, then ends in a blatant plug for just one Outlook addon.

I suppose if there are still people in the column's audience who haven't heard this all before, and it gets the message out that spam can be effectively filtered, it's a minor goodness.

Remote Images in spam... by dioxn · 2003-06-26 05:49 · Score: 3, Interesting

I've noticed that the spam that has been getting through my Mozilla filter are the ones with innocuous sounding subjects and an embedded image.
Could this be the future of spam?
Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)

Re:Remote Images in spam... by zerocool^ · 2003-06-26 06:20 · Score: 3, Informative

Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)

Um, only read emails in plain text? Use mh.
inc; scan; show last
By the way, those images are baaaad. Usually they're something like img src="blahblah.jpg?userid=32898392" and then, when you open it, there's a log of the image with the userid 32898392 being fetched. Therefore, they know that your email address is valid. So, it's a good idea to filter out images anyway.

But, come on. Email is a medium for transmitting text. It's not supposed to have flowery backgrounds, blinking text, and embedded images. Mabey i'm a purist? But, it's another thing that use to be beautifully simple that the explosion of advertising on the internet has rendered unuseable.

--
sig?

Re:Better Bayesian Filtering by GoatEnigma · 2003-06-26 05:53 · Score: 4, Funny

Email is not just text; it has structure.

You've obviously never received email from an AOL user!

Why stop at classifying spam? Why not all e-mail? by Anonymous Coward · 2003-06-26 05:54 · Score: 5, Insightful

As I wrote only late last night, using Bayesian classification with only two categories (spam and "non-spam") is somewhat short-sighted, since if properly trained, a Bayes classifier can do a much better job than ordinary mail filtering (procmail, Mozilla or Mail.app filters, you name it).

In fact, if I had to bet on the next "killer apps", mail sorting and RSS filtering based on Bayesian classification would be right at the top of my list, based solely on the actual time-saving benefits for users. And I can't see any reason for Bayesian filtering not being included in Mozilla Mail and Apple's own (revamped) Mail.app.

I have to use Outlook at work, and after setting up Outclass (which requires POPfile) with several "buckets" to classify my corporate e-mail by project and field, I'm definetly not going back. Outlook, even with extensive use of Rules Wizard and categories, simply cannot cope with the diverse kinds of project-related e-mail I swap with colleagues, and Outclass is the only thing I could find that could deal with Exchange, PST folders and multiple Bayesian "buckets" categories.

Come on, do the right thing and tell Apple and The Mozilla Project that you want configurable Bayesian filtering on their mail clients.

I hate spam too, but... by Daimaou · 2003-06-26 05:58 · Score: 3, Funny

I hate spam just as much as the next person, but I must admit, without it I wouldn't be the horse-sized love stud that I am. Thanks spam.

The spam I do see by steveha · 2003-06-26 06:02 · Score: 4, Interesting

I'm using SpamProbe, and it blocks almost all spam I get.

Much of the spam that gets past it is so minimalist it cannot be blocked by a Bayesian filter. I get messages like this:

Subject: a nice lady wants to talk to you

see the pictures

no more mail

It's like someone is trying to put so little in the message, that there is nothing to filter. If only they would use the stock "We are sending you this because you opted-in on it. Click on this link to remove your address." If they used that, I'll never see the message; SpamProbe will grab it. But how could I train SpamProbe to detect the minimalist ones, without blocking everything forever?

So far I don't get too many of the minimalist ones, and I just hit delete. If it becomes widespread, I'll have to start using Vipul's Razor or something.

The other kinds of spam that get past SpamProbe are the ones that have rampant misspellings. Since none of the words are in the database, they don't match as spam terms:

Subject: make moneey on EBAYxbbid

Want to make moneyzseqw? Click here...

I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely

Battle of the network Bayesian allstars by dubStylee · 2003-06-26 06:06 · Score: 3, Interesting

Suppose

1. I have a friend who uses the same kinds of words as I do and who uses Outlook (ok, an aquaintance, because friends don't let friends ...)

2. An email virus attacks this person, snarfs up his Ham, runs a Bayesian filter on it and comes up with Spam specifically tailored for this person's aquaintances.

There's a science fiction book waiting to happen in here somewhere. If so, I own the SCOpyright on it.

What I don't like by Boyceterous · 2003-06-26 06:06 · Score: 5, Interesting

about this kind of filtering is that it has to download the email content - not always as good idea, especialy in a Windows environment. Besides, I can identify spam just by looking at message header information. Sender, recipient, and subject line are nearly always enough. Plus I don't need to waste time, bandwidth, or get subjected to offensive graphics, or risk 1-pixel confirmations or getting hacked by the latest security issue. My homespun message header analysis program drops nearly all spam, and results in few legit email rejections. I score the headers based on missing recipient, sender info, keywords in subject, string match in sender email or name, punctuation count in subject line, number of contiuous spaces in subject line, plus a few other things that seem to run common in the spam I get. I can also permit certain email addresses to pass no matter the score. It's not fancy, but it works, and I never have to waste time drawing the whole content down to my local machine. What I do may not work for everyone, but it seems that in most cases it should, unless you get a lot of email from unknown (non-spam) sources - not typical for the average email user.

SpamBayes not Marc Hammond's work only by mpieters · 2003-06-26 06:13 · Score: 5, Informative

SpamBayes was originally conceived by Tim Peters and co at Python Labs, who improved on the orginal algorithm considerably. From there on out, many people helped tune and perfect the implementation, making it the most effective Baysian-based spam filtering tool currently available (IMNSHO).

Mark Hammond then wrote the Outlook plugin, which, admittedly, is considerably more code than SpamBayes, but not SpamBayes itself.

For the complete background on why SpamBayes is so good at what it does, and it's history, see:

SpamBayes Background

Marc's is not the only application frontend for SpamBayes, here is a list of others:

SpamBayes Applications

No apologies for this my pedantry offered.

--
"The truth shall make ye fret" -- The Truth, Terry Pratchett

Re:This is totally useless. by serbanp · 2003-06-26 06:17 · Score: 3, Informative

No it's not.

At work I have Outlook always running with the excellent bayesian FREE filter Spammunition www.upserve.com. I also do check the mailbox from home over a dial-up connection.

If I wouldn't use Spammunition, then I would spend a lot of time downloading spam messages; as it is right now, I get just the ham (several messages instead of many).

Serban

Soundex to work around intentional misspellings? by GGardner · 2003-06-26 06:25 · Score: 3, Interesting

For the spammers who are trying to use misspellings to get around filters, I wonder if soundex could fix that problem quickly. That is, instead of doing the Bayesian calculations on the raw tokens, calculate probabilities based on the soundex values of the the tokens. You might need to teach soundex that the number one sounds like I, and other leet-speek-like things, but this might solve the problem quickly and easily.

Great, but my problem is a bit more complicated .. by slagdogg · 2003-06-26 06:48 · Score: 3, Interesting

Bayes rocks, been using it with spamassassin and it kills 99% of my spam. The problem is when some asshole spammer uses my email address in the 'From' header of his spam ... then I get scores of 'user not found' or 'virus detected' emails from legitimate mail servers ... it's not spam, but it's just as annoying. How do you guys deal with this problem?

--
(Score:-1, Wrong)

Well... What would REALLY interest me is... by crazyphilman · 2003-06-26 07:19 · Score: 3, Funny

A Bayesian filter that reads personal ads, compares them to ads posted by women who are KNOWN to have been "easy" (on a sliding scale, configurable, ranging from "mildly slutty" to "dangerously psychotic nymphomaniac"), and returns a list of likely phone numbers.

Hell, I'd pay MONEY for a piece of software THAT good (Hmm, clickety-click, select "nymphomanic", enter search site... Ah! This one has an oral fixation! Thank you, Mr. Bayes!).

--
Farewell! It's been a fine buncha years!

I use Apam Assassin with Hotmil by esanbock · 2003-06-26 07:56 · Score: 3, Informative

1. Use Debian
2. apt-get install spamassassin
3. apt-get install hotway
4. Add this to your /etc/inetd.conf: pop3 stream tcp nowait nobody /usr/sbin/tcpd /usr/bin/hotwayd
5. Switch to Kmail
6. Menu: Settings|Configure Filters
7. Add first filter.
a. Select Match Any of the following
b. Select size 250000
c. Filter action: PIPE THROUGH spamassassin
8. Add second filter
a. Select 'Match any of the following'
b. Type 'X-Spam-Flag' (no quotes)
c. Select equals. Type 'YES'
d. Filter action: Move to folder [your spam folder]
9. It's crucial thta the second filter happes after the first (use the arrows to the left).

There you have it - a spam-free Hotmail account. Not quite setup.exe, but this is Linux after all.

I must have done something wrong... by chinton · 2003-06-26 08:35 · Score: 3, Funny

I tried a few months ago to write a Spam filter in Python, but no matter what I tried, this was the only output I could receive:

I DON'T LIKE SPAM! I DON'T LIKE SPAM! I DON'T LIKE SPAM!

Not it's not... by Goonie · 2003-06-26 14:41 · Score: 4, Insightful

Client side filtering is not an ideal spam solution, but it's a good thing on both a micro and macro scale.

For the 99% of people who don't respond to spam, it makes no difference to the spammer whether they filter it or delete it manually. At an individual level, it reduces the amount of spam I have to deal with to managable levels.
For the 1% that *do* respond to spam, having a filter might reduce the amount of spam they respond to and thus reduces the financial rewards for spammers. Anything that reduces the financial rewards for spammers is going to help reduce the spam levels.
If spammers are spending all their time and money figuring out how to beat filters, that's time and money that they're not using to send spam.

As for your indictment of spam filtering providers, could you please explain where the spamassassin devteam is making money?

My choices with regards to spam at the moment are simple. Use spamassassin or something like it, or wade through spam myself. I know which I'd prefer.

--

Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)

Slashdot Mirror

Sorting the Spam from the Ham

36 of 249 comments (clear)