Working Bayesian Mail Filter

Whas that? by cos(0) · 2002-11-03 06:08 · Score: 2, Interesting

Would anyone care to explain what is a "Bayesian" mail filter?

Re:Whas that? by Raul654 · 2002-11-03 06:11 · Score: 1

You took the words right out of my mouth (and I want them back!)

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:Whas that? by DalTech · 2002-11-03 06:16 · Score: 4, Informative

Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government. http://www.bayesian.org/
Re:Whas that? by Evil+Adrian · 2002-11-03 06:19 · Score: 5, Funny

If you had just clicked the POPFile link, you would see the explanation.

Initiative is your friend.

Hyperlinks are your friend.

Don't be afraid, just click.

--
evil adrian
Re:Whas that? by dvk · 2002-11-03 06:19 · Score: 5, Informative

From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".
A couple of URLs quickly found on Google:
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html
http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf
Also, any decent AI/machine learning textbook ought to cover the topic.
-DVK

--
"The right to figure things out for yourself is the only true freedom everyone shares. Go use it"-R.A.Heinlein
Re:Whas that? by sfe_software · 2002-11-03 07:02 · Score: 5, Informative

If you had just clicked the POPFile link, you would see the explanation.

I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.

--
NGWave - Fast Sound Editor for Windows
Re:Whas that? by Anonymous Coward · 2002-11-03 07:07 · Score: 0

Mod this up, it's not "Offtopic"!!!
Re:Whas that? by Anonymous Coward · 2002-11-03 09:02 · Score: 0

If I forget to pay my cable bill, are they going to send the FBI with guns drawn to my house too? So they can raid my house and take my checkbook?
Re:Whas that? by Anonymous Coward · 2002-11-03 22:21 · Score: 0

Bayes' theorem is used in statistics and probability, but most statistics is frequentist who reject the validity of Bayes' theorem

Those terrorists! by edb · 2002-11-03 06:08 · Score: 0, Offtopic

This is a clear and present threat to our society. Good thing the FBI acted quickly!

--
In theory, practice and theory are the same. In practice, they rarely are.

Re:Those terrorists! by edb · 2002-11-03 06:10 · Score: 1

Gaack! reply connected to wrong parent article!

Never mind...
- Emily Litella

--
In theory, practice and theory are the same. In practice, they rarely are.
Re:Those terrorists! by Anonymous Coward · 2002-11-03 07:16 · Score: 0

Nevermind, you still got an "insightful" out of it!

True "Bayesian" and do I care? by Dog+and+Pony · 2002-11-03 06:08 · Score: 1, Flamebait

From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian.

Who cares? Whatever works best should be used, not the one with the coolest name or whitepaper, right?

Re:True "Bayesian" and do I care? by Evil+Adrian · 2002-11-03 06:11 · Score: 0

I guess if I was using a Mac, that would be a valid statement.

--
evil adrian
Re:True "Bayesian" and do I care? by Anonymous Coward · 2002-11-03 06:12 · Score: 0

ifile has been around for a LONG time and it uses "Naive Bayesian". It's functioned good enough for me.
Re:True "Bayesian" and do I care? by Anonymous Coward · 2002-11-03 08:26 · Score: 0

without a cool whitepaper, what geek is really going to know about it? they still do have some use imho.

what about decimal ? by Anonymous Coward · 2002-11-03 06:09 · Score: 0

Does it give hexadecimal output (like for messages blocked)? I hate decimal.

Re:what about decimal ? by Anonymous Coward · 2002-11-03 06:31 · Score: 0

good point. submit a patch immediately.
Re:what about decimal ? by Anonymous Coward · 2002-11-03 06:35 · Score: 0

I'm curious. When you developed your decimal v. hexidecimal comment generator script did you mostly use base 16 or base 10?

spambayes.sf.net by supton · 2002-11-03 06:10 · Score: 5, Informative

Saw this a few weeks back... Spam filter in Python using Naive Bayes.

Re:spambayes.sf.net by mpieters · 2002-11-03 10:54 · Score: 2, Informative

Note that the spambayes core has been developed by Tim Peters of the PythonLabs team, someone who has tons of experience with statistical schemes and the fine-tuning of them. The results from this filter so far have been fenomenal.

--
"The truth shall make ye fret" -- The Truth, Terry Pratchett
Re:spambayes.sf.net by Pig+Hogger · 2002-11-03 11:28 · Score: 1

Spam filter in Python using Naive Bayes.
You mean Monty Python???
Re:spambayes.sf.net by Weird+Dave · 2002-11-04 16:05 · Score: 2

I think you mean phenomenal. I don't mean to be the spelling nazi or anything, but your post was going so well, and that ending was just anticlimactic.

--

Grumble, Grumble

Honest to whom? by matth · 2002-11-03 06:11 · Score: 0, Offtopic

Honest to god? or God? Just which god/God is it honest to? Capital or lowercase G?

Re:Honest to whom? by Anonymous Coward · 2002-11-03 06:13 · Score: 0

Whom() -- deprecated, see who()
Re:Honest to whom? by jez9999 · 2002-11-03 06:37 · Score: 1

Whom's a perfectly valid word today, you moron. It's like saying 'dont say television when we can say tube'.

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:Honest to whom? by DrPascal · 2002-11-03 08:07 · Score: 1

It handles all cases:

src honesty.pl, line 349:

if($a = m/(not)? [hH]onest to ([Gg]od|[Aa]llah|[Ss]atan)/)
{ ...

--
DrPascal: Not the language, the mathematician.

Sure it's promising by bigberk · 2002-11-03 06:12 · Score: 4, Insightful

And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.

Re:Sure it's promising by outlier · 2002-11-03 06:26 · Score: 5, Informative

While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
Re:Sure it's promising by bmwm3nut · 2002-11-03 06:26 · Score: 2, Interesting

that's the beauty of this approach. the filter learns all the time (or atleast you can set it up that way). so if spammers get smart, it doesn't take long until the filter adjusts. what i'd love to see is this filter built into a mail client where you have two buttons for delete. one, just to delete the mail, the other to delete it and mark it as spam. when you press that button the filter would scan the email and update its rules.
Re:Sure it's promising by Theodore+Logan · 2002-11-03 06:27 · Score: 2

Well, as the man says in the article:

The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.

And I think that in this he is correct, almost even provably correct. That's theory, however. In practice no system, short of "real" AI, will be good enough to always recognize spam with a zero false positive rate. It may eventually be good enough, but it won't be perfect. Natural language is just too hard to parse in this way.

But don't despair. If it flunks, there's always spammotel and their likes.

--
"If you think education is expensive, try ignorance" - Derek Bok
Re:Sure it's promising by wheany · 2002-11-03 06:37 · Score: 1

I've already seen a couple of spams that had bunch of nonsense and a picture attachement. I didn't open the picture, but it could've had an address...
Re:Sure it's promising by chrsbrwn · 2002-11-03 06:40 · Score: 1

Note that Bayesian mail filters use a probabilistic analysis of the word distribution in the email you feed it during the training process in order to classify email as spam, or nonspam (and in the case of popfile, any other category you want to create).

As long as the spam you receive remains sufficiently different from the nonspam email you receive, Bayesian filters should still flag it properly. To put it another way, it doesn't test for the presence of specific words to categorize as spam (like SpamAssassin does), instead it uses the probability database built up during the training process to determine how similar the prospective email is to either the spam you have received and trained upon or to your regular email that you have trained upon. Thus it is much less susceptible to spammers changing their wording in order to defeat the filter.
Re:Sure it's promising by rgmoore · 2002-11-03 06:47 · Score: 4, Informative

Another important point is that there are some things that they can't hide, at least not in their current working model. If they're trying to sell you something, they have to describe what that thing is and where you can get it, and those descriptions are unlikely to be in any legitimate email. If they want to advertize a web site, they have to include its URL in the message, and the filter can catch that. If they advertize a physical address or phone number, the system can catch those, too. If they don't repeat the message, it means that there's inherently less spam, because I'm only seeing each add once.
It's also not possible to disguise everything in their headers, so things like their posting host (either the one they pay for legitimately or any open relay they're taking advantage of) will wind up being a pointer to who they are. They certainly can't change anything about the headers that's added downstream of their posting host, so as long as they keep using the same one it's likely that there will be characteristic stamps there that the spammers absolutely can't change. I know that analysis of the headers is part of bogofilter, another Bayesian filter that I've been using to good effect.

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:Sure it's promising by Brendan+Byrd · 2002-11-03 06:52 · Score: 2

Is there an application to this theory with SpamAssassin? Right now, it's more or less human-edited words and phrases, but applying a real Bayesian method to it would increase it's accuracy. I've also consider making a filter that would change the scores of the different SA rules to reduce the false positives, but this would be a long project.

--
Zodiac Survey
Re:Sure it's promising by rgmoore · 2002-11-03 06:56 · Score: 4, Informative

Bogofilter comes close to this. It has an operating mode where each file that it filters is automatically added to the appropriate corpus, either of spam or non-spam. Since it's correct the vast majority of the time, that means that there's very little for the user to do. When it is wrong, you just take the messages that it miscategorized and feed them back into the system with the notation that they were originally marked incorrectly, and it backs out the changes to the wrong category and adds them to the correct category.
I'm using bogofilter with Evolution, and it works very well. I just have two extra folders, one for false negatives and one for false positives. When I notice mail that's been flagged incorrectly, I drag it into the appropriate folder and run a script that tells bogofilter to correct its mistake. Then I either flush the mail (if it was spam marked as non-spam) or process it normally (if it was non-spam marked as spam). I've only been using it for about two weeks and it already has a nearly zero false positive rate (i.e. incorrectly flagged as spam) and a usefully low false negative rate (i.e. incorrectly flagged as legitimate).

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:Sure it's promising by Helter · 2002-11-03 07:00 · Score: 1

Cloudmark does something very similar, but then goes one better. It adds your results to the results of every other user.
This would, in my opinion, add greatly to this new anti-spam tool. Have the filter customize itself to you, but at the same time have it upload your rules to a central server so that every new piece of spam doesn't have to be "discovered" by each person. Sure you'll still have your personalized rules that will take precedence, but you'll also have millions of other people helping out.
Re:Sure it's promising by Anonymous Coward · 2002-11-03 07:22 · Score: 0

Ham is good.

Long live Ham!
Re:Sure it's promising by dvdeug · 2002-11-03 07:24 · Score: 2

it already has a nearly zero false positive rate

True. I think I've only had one message falsely get pegged as spam.

a usefully low false negative rate

I haven't found this to be true. Maybe it's because I didn't save up all my spam to send through it, but I estimate that it's only catching half my spam. Nigerian spams still get through sometimes, which should been very easy to catch.
Re:Sure it's promising by rgmoore · 2002-11-03 07:31 · Score: 1

Well, I guess it just depends on what you consider to be a usefully low false negative rate. I'd guess that mine flags 90+% of incomming spam correctly, and that's with just a few weeks worth of spam to learn from. I think that reducing spam by 90% without a serious problem with false positives (I've had one; a html message from my new ISP. After telling the filter that it was wanted, future messages made it through fine.) is useful. I have every reason to think that it will continue to get better, and that's good enough for me. I also suspect that if my spam load gets higher, the filter will improve quickly enough to catch up.

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:Sure it's promising by marmoset · 2002-11-03 07:47 · Score: 2, Informative

Over the last month or so, I've received a few really strangely worded porn spams that seem to be engineered so as not to trip ISP porn filters. They use lots of passive verbs, no exclamation points, no HTML, and dictionary definitions of whatever kink the spammer is selling.

Since I use Jaguar's mail client, I just told it that these were spam too and now it catches them by itself. :)
Re:Sure it's promising by tsg · 2002-11-03 08:46 · Score: 4, Insightful

Any solution that requires spammers to be more clever is going to reduce the number of spammers. And that is the end goal.

--
People's desire to believe they are right is much stronger than their desire to be right.
Re:Sure it's promising by Tim+Browse · 2002-11-03 08:51 · Score: 4, Interesting

One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

Tim
Re:Sure it's promising by shayne321 · 2002-11-03 10:08 · Score: 2

Is there an application to this theory with SpamAssassin?
Yup, they've been working on getting this added to the tests they're already doing, so you get the benefits of bayes filtering PLUS all of the other tests SA does. I'm not attached to the project, but I follow the sa-devel list.. From what I've seen, most of the code is in CVS so you can get it by downloading one of the nightly test builds.. I imagine if it goes well it will be included in the next release.
Shayne

--
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
Re:Sure it's promising by Brendan+Byrd · 2002-11-03 12:32 · Score: 2

Any links to the archive of devel messages? I didn't see much with the search for "Bayesian" in the list archive.

--
Zodiac Survey
Re:Sure it's promising by Alsee · 2002-11-03 12:48 · Score: 5, Funny

(e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam.

I have a cousin that lives in Nigeria and we regularly discuss tips on penis enlargement. He works at a bank refinancing mortgages and his wife is a professor at an accredited university. I work in in a Las Vegas casino producing shows featuring live nude showgirls. He offered to help me pay some bills and get out of debt (a generous offer, but I told him I just found a second part time job working from home earning thousands of dollars per week). My wife is a stock broker and I regularly let my cousin in on hot stock tips. I have an herb garden, I take viagra, and use rogaine. Since we both own the same brand of printer we've been working out the best way to refill the ink cartridges. I've been trying to lose weight, but it comes right back as soon as I quit smoking.

I don't quite understand this "beysian filter" stuff, but I can't wait to try it out!

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
Re:Sure it's promising by Anonymous Coward · 2002-11-03 14:07 · Score: 0

The marketing-speak for Apple's spam filter includes a line about "adaptive latent semantic analysis." In any event, I can confirm it works pretty well-- but since it's closed-source, we don't know exactly how.
Re:Sure it's promising by shayne321 · 2002-11-03 14:12 · Score: 2

Here's SourceForce's Archive. The developers mostly refer to it as Bayes, so try that in your searching.. Also you can search their bugzilla for "bayes" to see some of the discussion there, too.
Shayne

--
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
Re:Sure it's promising by Anonymous Coward · 2002-11-03 14:18 · Score: 0

Bogofilter was written by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002...

too all those FSF, "GNU/Linux" bashers... look, ESR does, in fact kick-ass
Re:Sure it's promising by SubtleNuance · 2002-11-03 14:21 · Score: 1

do you have a how-to on setting this up?
Re:Sure it's promising by tsg · 2002-11-03 14:35 · Score: 2

It may eventually be good enough, but it won't be perfect.

"Good enough" is usually all that's required. We could spend a ton of money and a bunch of time coming up with the perfect solution (maybe) to a problem that, for most people, is largely just a nuisance. Or we can go with a much cheaper, easier to implement, "good enough" solution that will bring the problem down to manageable levels.

If it flunks, there's always spammotel

I found this in their FAQ: "there is nothing to install on your computer. When you download the SpamMotel User Interface and place it on your desktop, it is ready to go."

OK, so which is it?

--
People's desire to believe they are right is much stronger than their desire to be right.
Re:Sure it's promising by Daytona955i · 2002-11-03 15:45 · Score: 1

Don't forget that the addition of a bigram or a trigram model (using sets of 2 or 3 word pairs) would increase the probability of detecting spam a lot. Granted it taked more computing to determine this but on texts as small as an e-mail, a bigram model at least would be very helpful at getting things like "act now" or "limited offer" and such. This is a really good idea, let's hope most mail programs impliment something like this.
-Chris

p.s. I've used Apple's e-mail program on Jaguar and I must say it's really good at catching spam. It has to learn at first but it was pretty good right out of the box. Incorrectly identified some of my mailing lists that I wanted but I just clicked on Not Junk and it fixes it. I'm not sure what they are using to identify spam but it's been pretty good so far. I just check it every once in a while to make sure.
Re:Sure it's promising by stand · 2002-11-03 18:30 · Score: 2, Insightful

This is true, but remember, with computer-related ventures like a spam operation, all you need is one clever person to write the clever program that gets distributed to all the morons. This spam filter is a perfect example. I'm not clever enough to write something like that myself, but I'm certainly clever enough to download it and use it.

--
Four fifths of all our troubles in this life would disappear if we would just sit down and keep still. -C. Coolidge
Re:Sure it's promising by Anonymous Coward · 2002-11-03 18:50 · Score: 0

What's wrong mods??? +1, Hilarious!!!
Re:Sure it's promising by hfastedge · 2002-11-03 21:26 · Score: 1

this is the funniest thing i have ever seen on here!!

Not an ounce of negative humor.

--
-- -- --
Help my mini cause: My journal
Re:Sure it's promising by Anonymous Coward · 2002-11-03 23:40 · Score: 0

Avoiding spam with a Bayesian filter can be done so that you don't have to look for false negatives: automatically reply to the original author why his message was filtered. If it was a spam, in most cases it just bounces (which can then easiliy be filtered away). If it wasn't a spam and the original author exists, he will try to contact you again with a different message avoiding the words he's used. Then you can add him to your "approved From"-list. That's all the possibilities there are. regs, Berzelius
Re:Sure it's promising by Autonomous+Crowhard · 2002-11-06 06:09 · Score: 2

While you're right about all the things that are difficult to hide, you (and apparently everyone else) have missed one of the other neat aspects. Filtering of the "ham" quickly becomes a inadvertant whitelist.

Server-side solutions? by Quixote · 2002-11-03 06:12 · Score: 3, Interesting

Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?

Re:Server-side solutions? by rehannan · 2002-11-03 06:21 · Score: 2

I've been using PopTray (a POP3 email checker for Windows). You have the option of defining "rules" which allow you to delete emails server-side.

It's not a "smart filter" but it works fine for me.
Re:Server-side solutions? by cmeans · 2002-11-03 06:33 · Score: 4, Interesting

James is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.

--
Give a hand, not a hand-out.
Re:Server-side solutions? by ranmachan · 2002-11-03 06:34 · Score: 1

apt-get install bogofilter :-)

--
Tobias
Re:Server-side solutions? by Saint+Aardvark · 2002-11-03 06:41 · Score: 1

Oh man, that looks perfect...sorry, see my post below re: something for Windows users. We use SpamAssassin, and I'd love something that would let people filter by the score. OE doesn't let you filter on any header, but this does...sweet. Thanks for the tip.

--
Carousel is a lie!
Re:Server-side solutions? by koreth · 2002-11-03 06:44 · Score: 4, Interesting

I've been using SpamProbe (which gets invoked from procmail) with excellent results.
Re:Server-side solutions? by Anonymous Coward · 2002-11-03 07:16 · Score: 0

spamprobe is great, there are others perl, python, and c versions.

Went from 200-400 spam a day to 0-1 a week. I shit you not.

The first couple days I had a few good emails slip through (because I forgot to train them, they are blank, subjectless emails with large mime attachments) but after I tweaked it I've had absolutly no false-positives for ~2 weeks.

I actually look forward to email now, something I havn't in several years.

I trained it with 8000+ spam and ~750 good emails.
Re:Server-side solutions? by ragnar · 2002-11-03 08:49 · Score: 2

Yes, my company provides an online service to do this sort of thing. We are in beta right now. email me (ragnar@spinweb.net) if you are interested in some more details, as the marketing stuff on the site is a bit lacking.

--
-- Solaris Central - http://w

Mozilla in Process of adding Bayesian filter by AT · 2002-11-03 06:14 · Score: 5, Interesting

The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.

Re:Mozilla in Process of adding Bayesian filter by Jugalator · 2002-11-03 06:25 · Score: 2

And it seems likely the SpamBayes project will work as the foundation for their mail filter.

There are a few other applications that use this code as well, such as an Outlook 2000 add-in.

--
Beware: In C++, your friends can see your privates!
Re:Mozilla in Process of adding Bayesian filter by Anonymous Coward · 2002-11-03 12:46 · Score: 0

Is this enabled in the builds now? How do I get it working?
Re:Mozilla in Process of adding Bayesian filter by the_olo · 2002-11-05 02:19 · Score: 1

Look at bug 169638, comment #125.
http://bugzilla.mozilla.org/show_bug.cgi?id =169638 #c125

That Google search... by Jugalator · 2002-11-03 06:15 · Score: 4, Insightful

Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".

--
Beware: In C++, your friends can see your privates!

Re:That Google search... by Preposterous+Coward · 2002-11-03 07:45 · Score: 2

Google doesn't match "*bayes*" (as one would think) when searching for "bayes",

Just curious: Why at all would anyone think that "bayes" would match "*bayes*"? Imagine if searching for "cars" also got you "scars", "Johnny Carson", and so on...

It might make sense for a search engine to do limited stemming (cars -> car, eating/eats/ate -> eat), but that's something completely different...

--

"Biped! Good cranial development. Evidently considerable human ancestry."

Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · 2002-11-03 06:16 · Score: 5, Interesting

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.

Enough nerd talk for today :-)

Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by wheany · 2002-11-03 06:41 · Score: 1

They make false assumptions about the data (independence of features).
If it works (and people say it does), who cares?
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 06:44 · Score: 1, Informative

I've been doing research into email filtering using AI, and SVMs/kernel machines seem to work well (statistically, they're correct more than the other methods), but they require massive tuning.

On the other hand, Naive Bayes is usually easier to implement, easier to tune, and only trails by a few percentage points.

One of the more promising bayes units is autoclass, offered by Cheeseman (et. al.) - public domain classifier that's been around for years and years, and seems to perform quite nicely.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 07:35 · Score: 1, Interesting

A True Jedi Nerd would use compression based classification. Make two zip/gz/bz2/lzw/whatever archives, one containing known-not-spam and one containing known-spam. For each incoming mail, add to both archives, see which compresses better, bingo, that's the category it's supposed to be in. Obviously needs some tweaking (blocksizes etc.) but that's the gist of it.

Apparently, it does work, though I can't whip out the references just now.

Anyway, naive bayes is interesting mostly because it's so damn fast and only requires one pass through the data; and it works well, it just makes you feel stoopid because it's called "naive".
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Lenbok · 2002-11-03 07:41 · Score: 3, Interesting

Actually compresssion-based techniques don't work particularly well, mainly because they are very sensitive to the amount of training data. If you have a lot of non-spam mail, your non-spam compressor will compress better than your spam compressor.

In the long view, all compression is machine learning anyway :-)
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · 2002-11-03 07:47 · Score: 1

ur right - bayesian statistics can give a good approximation. but a system that works in practics, but is flawed theoretically is far less exciting...
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · 2002-11-03 07:50 · Score: 1

A few percentage points may mean a reduction of the error by 50%!
I found the SVM libaries to be quite handy. However, learning probably requires too much time if there are many samples.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by g0at · 2002-11-03 07:54 · Score: 1

Um...

that's not irony, it's sarcasm.

--
myselfmusic
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 08:07 · Score: 0

You are describing the Minimal Message Length principle I think:

http://www.csse.monash.edu.au/hons/projects/2000 /E dmund.Lam/thesis/node5.html
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 08:32 · Score: 1, Informative

They make false assumptions about the data (independence of features).

NOT TRUE! The Bayesian approach can use the full correlation matrix without diagonalization, e.g., you can write the algorithm to correctly account for the fact that a probability of word A, given that word B is also in the email, is not the product of the probabilities of A and B separately. The only downside is that the number of weight the database contains goes as N^2, so storage space and speed can lack.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 10:01 · Score: 0

SVM libraries don't apply to commercial works.

Most of them (ie: svm light, and mysvm) are free for academic use, but NOT free for commercial applications.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by rueba · 2002-11-03 10:53 · Score: 1

I find your point of view interesting(I guess this is the great Bayesian controversy?)

In my opinion, a system that is wonderful in theory but fails in practice is not exciting at all. I am not against theory, it has its place. But when building a practical system the most important question is:- Does it work?

--
The only reason all cover-ups appear to fail is that you never hear about the ones that succeed.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by mysta · 2002-11-03 10:58 · Score: 1

First of all, it's not von Bayes. The guy was named Thomas Bayes.
Secondly, just because something is not state of the art does not mean it should be dismissed out of hand. You are right about Bayes classifiers making false assumptions about the independence of features but it has been suprisingly successful in practice, even when these assumptions have been violated. This paper shows that "...the accuracy of naive Bayes is not directly correlated with the degree of feature dependencies".
While kernel machines tend to be much more accurate (and quite cool theoretically), they are nowhere near as efficient (time and space-wise) to train. You want an intelligent spam filter to go easy on available resources and for this reason I don't think KMs are the way to go.
Another nice technology for text classification is Latent Semantic Analysis but once again, probably not the best tool for this particular job.

--

"Where is the wisdom we have lost in knowledge, and where is the knowledge we have lost in information?"-T.S.Eliot
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by JPZ · 2002-11-03 11:03 · Score: 3, Informative

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

Bullshit. Bayes' formula is exact, and makes no assumption on independence whatsoever. Naive Bayesian approaches make independence assumptions, hence the use of the term naive.

The only inherent drawback in using Bayes' rule in classifiers is that you have to assume the number of classes to be known a priori.

JPZ

Forget Bayes by Evil+Adrian · 2002-11-03 06:16 · Score: 5, Funny

We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.

--
evil adrian

Re:Forget Bayes by CatWrangler · 2002-11-03 07:44 · Score: 1

How about a Baysian Filter, to filter out Bay and Pat Buchanan.

--
---
When you come to a fork in the road, take it! --Yogi Berra--
Re:Forget Bayes by frenetic3 · 2002-11-03 09:54 · Score: 1

They're probably already there, or places even worse. And wouldn't such a filter consider pyramid schemes to be "for the common good" and take a liking to insurance spams that offer Five Year Plans? :P

-fren

--
"Where are we going, and why am I in this handbasket?"
Re:Forget Bayes by Galahad2 · 2002-11-03 11:02 · Score: 3, Funny

I tried that, but it was constantly too paranoid about idenifying spam. I can't even remember how many of my friends and family ended up in Vladivostok for sending me bad jokes. The problem sort of solved itself though, since the filter program eventually just barracaded itself in my second hard drive and refused to come out. The only drawback is that now I can't save anything on the drive, since the Stalin Filter instantly deletes everything it can.

*BUT* it's a Perl script... by pilot1 · 2002-11-03 06:17 · Score: 2, Redundant

Sure it's great that someone made one, but its a perl script. We might be able to use perl , but most of the "normal" people have never even heard of perl, let alone them having knowledge of running perl scripts. It would be great if someone ported this, to an .exe file or something that everyone could run. It'll probably happen eventually.

Re:*BUT* it's a Perl script... by Niksie3 · 2002-11-03 06:27 · Score: 2, Funny

sure... an .exe file everyone could run... have you had your pills today? a perl script runs on many more platforms then any .exe file.

--
Sig you!
Re:*BUT* it's a Perl script... by pilot1 · 2002-11-03 06:41 · Score: 1

Last time I checked half the world could care less if it ran on *NIX, and Macs. Sure us geeks can run a perl script, but most people can't. Most people also have Windows, so it makes sense to port it to something that almost everyone can run, not just geeks.
Re:*BUT* it's a Perl script... by Elias+Israel · 2002-11-03 06:49 · Score: 2

This is a very good point.

Truth is, to really tackle the problem of spam, a solution is needed that doesn't require the user to be a software engineer.

Plus, another problem with rolling out a Bayesian filter for a large collection of users is that each individual user needs their very own filter database. The statistical analysis of my mail would be nearly useless for anyone else.

OK, cards on the table: I am working on a new solution that will be useful for the general public and overcomes these problems.

Those who care to learn more can sign up to be notified when it becomes available.

Check out www.PureMessaging.com
Re:*BUT* it's a Perl script... by Fastolfe · 2002-11-03 06:52 · Score: 1

It's not uncommon for new technologies to be implemented with the languages and on the platforms used by those that frequently implement new technologies: geeks.

I read another comment that Mozilla is already trying to implement something similar.

Don't worry, these things will eventually end up suitable for the masses. In the mean time, it's suitable for geeks. Most geeks know what Perl is and how to set up an environment that Perl scripts can run in. Other geeks may choose to port it to a language or platform more familiar to them. I believe something similar is already out there for Python.

This is OpenSource, after all, not a commercial product. If you don't like it, don't use it.
Re:*BUT* it's a Perl script... by rgmoore · 2002-11-03 07:20 · Score: 2, Informative
But perl scripts are just as easy to run as .exe files, so long as you have the perl interpreter installed. So now it's just a two step process:
1. Install perl.
2. Install the perl script.
This is not exactly brain surgery. Perl can be installed on essentially any system you choose to name, with no more trouble than installing any other executable. For those people running Windows, there's an excellent port available from Activestate. As somebody else pointed out, this means that a perl script is actually available to more people than a .exe would be, because it's truly cross-platform.
--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:*BUT* it's a Perl script... by crisco · 2002-11-03 07:22 · Score: 2

You're absolutely right. I've been closely following POPFile's development (and trying to help with docs) and it is a goal of the developer to create a brainless install that the masses can use, while still retaining the cross platform core that is useful for much more than spam detection. POPFile is under very active development and is only recently getting close to the point where it will be ready to stabilize on a release.

--
Bleh!
Re:*BUT* it's a Perl script... by rgmoore · 2002-11-03 07:26 · Score: 1

Free hint: perl runs under Windows, too. True, it's not pre-installed on every system, and it's a bit bigger than you'd probably want as part of a simple install, but there's no real barrier to installing perl on a Windows box. With a bit of cleverness, I'd bet that you could make an installer figure out if the system has perl and install it if it's not there already.

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:*BUT* it's a Perl script... by Jeremy+Erwin · 2002-11-03 07:26 · Score: 2

Lots of libraries allow you to embed a perl interpreter in a C program... I suspect that a number of linux email clients could be altered to run such a script as part of their "retrieve_mail()" functions.

What do you want? a hideous visual basic macro in Outlook? The mere fact that one OS is difficult to use with perl shouldn't be a obstacle to innovation.
Re:*BUT* it's a Perl script... by duncangough · 2002-11-03 07:54 · Score: 0

which is pretty much what we all have to do to get Java apps working, or Flash crap to whizz all over our screens.

So why not install Perl? Ever heard of a killer app? It tends to be a carrot and stick exercise for new technologies to become accepted, after all.

--
Suttree, a weblog about casual games development
Re:*BUT* it's a Perl script... by B'Trey · 2002-11-03 08:01 · Score: 2

Might want to check your own medicine cabinet. Sure, Perl runs on more platforms. So what? How many of the worlds actual computers have Perl installed? Even better, how many of the worlds computers that are used daily to read email have perl installed? How many of them can run an .exe file? I'd suggest that the latter is orders of magnitude more than the former.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:*BUT* it's a Perl script... by donarb · 2002-11-03 09:06 · Score: 1

Just what does PureMessaging do? The website has no description of just what is so great about PureMessaging.

And if it's so great, why do you use 'NO@SPAM' in your PureMessaging email address? Anybody can do that without signing up for your service.
Re:*BUT* it's a Perl script... by crucini · 2002-11-03 09:11 · Score: 3, Informative

It would be great if someone ported this, to an .exe file or something that everyone could run.

I don't think an .exe would help much - a Windows user doesn't need a standalone executable. He needs a filter (probably a .dll) coded to the specific filtering API of his mail client. Or does Microsoft have a generic mail filtering API? That way the filter seems to run "inside" the mail client.

In general this illuminates one of the advantages of Unix. Lots of programs are written as filters that read from STDIN (standard input) and write to STDOUT (standard output). My own mail filtering script, for example, does that. I didn't have to learn any mailer-specific API, and my script can be used in different contexts. (Actually my script doesn't write to STDOUT - it saves the message to the appropriate folder.)

Windows does not lend itself to the everything-is-a-filter idea because, among other things, process creation is slow and expensive. When a filter is invoked, a process is launched. Unix has more efficient process creation, and Linux has especially efficient and light process creation. Therefore on Windows a mail filter should be implemented as a reusable software component (probably a COM object) that can be called by the mail client.

Also, most mail clients on Unix use the same mail folder format (mbox) which is basically just the literal messages from the network written to a file. Since it is the assumed common language of mail folders, it encourages software to interoperate on the file level, which my script does by writing messages to mail folders. (Unix is file-centric.) Windows mail clients, in contrast, seem to store mail folders in proprietary formats. That's because Windows philosophy is that an application serves as gatekeeper to "its" files - the file is not a unit of interoperability. In our case it means a standalone mail filter probably couldn't write messages to the mail folder.

Unix is a more friendly, efficient development environment because you can write a mail filter as a standalone program and test it without building a test harness.

I don't get any spam by Istealmymusic · 2002-11-03 06:17 · Score: 3, Funny

Can someone explain why this filter would be useful to me?

--
"The lesson to be learned is not to take the comments on slashdot too literally." --Vinnie Falco, BearShare

Re:I don't get any spam by moosesocks · 2002-11-03 06:49 · Score: 4, Funny

Just post your email address, and we'll be happy to tell you.

--
-- If you try to fail and succeed, which have you done? - Uli's moose

the decimal angle... by Anonymous Coward · 2002-11-03 06:17 · Score: 0

What about the use of decimal in these sites? Can I filter out sites that use the decimal cancer? Geeks hate decimal.

bogofilter by stype · 2002-11-03 06:18 · Score: 4, Informative

This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.

--
-Stype
Bus error -- driver executed.

Re:bogofilter by Theodore+Logan · 2002-11-03 06:44 · Score: 2
You quite obviously haven't checked out bogofilter's README. Let me quote:
- This package implements a fast Bayesian spam filter along the lines suggested
- by Paul Graham in his article "A Plan For Spam".
'Nuff said.
--
"If you think education is expensive, try ignorance" - Derek Bok
Re:bogofilter by SubtleNuance · 2002-11-03 14:45 · Score: 2

what is your point?
Re:bogofilter by Theodore+Logan · 2002-11-04 01:40 · Score: 2
The parent said:
- This isn't exactly the first bayesian mail filter out there.
But the FAQ of bogofilter, which the poster uses as an example of an earlier filter, clearly explains that it is actually built according to the principles laid out in the very article it is claimed to have preceeded.

Why is it that you always have to spell things out for your fellow slashdotters? What on earth is so difficult about just reading a post carefully before you reply? Sorry for the harsh tone, but I'm just so tired of this.
--
"If you think education is expensive, try ignorance" - Derek Bok

Bayes Explained by brw215 · 2002-11-03 06:18 · Score: 1, Informative

A naive bayes classifier is an algortihm that is based on bayes therom in mathematics. It is based on the following therom

Pr(h|D) = Pr(D|h) * Pr(h)

where Pr is probabilty, h is the hypothesis and D is the data. In this case it would be

Pr("SPAM"|Email) = Pr(Email|"SPAM") * proportion of spam.

The trick is how to estimate the second term. This is a very popular machine learning algorithm due to its simplicity and elegance. For more info, check out this link Bayes

Re:Bayes Explained by Stonehand · 2002-11-03 06:24 · Score: 1

Don't forget the P(D) term.

--
Only the dead have seen the end of war.
Re:Bayes Explained by johnynek · 2002-11-03 06:36 · Score: 5, Informative

That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

It should be:

Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

and:

Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)

--
jabber: johnynek@jabber.org
Re:Bayes Explained by capt.Hij · 2002-11-03 06:47 · Score: 2

Great, now the spammers will hire mathematicians to figure out how to best defeat the common algorithms used to calculate Pr(D|h). It is the same old story. In a war over information only the mathematicians win.
Re:Bayes Explained by B'Trey · 2002-11-03 07:07 · Score: 4, Informative

Read the referenced article. The only way to avoid the filter is to make your email sound like a normal message. In essence, the filter recognizes the sales pitch. If you remove the sales pitch to get your spam past the filter, you've removed the whole point of sending the spam.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:Bayes Explained by Jim+Nugent · 2002-11-03 07:55 · Score: 4, Informative

To put this in simpler terms, consider this scenario, 90% of all all X-rays that have a certain feature are from women with breast cancer. That is an easy statistic to compute; you have the x-rays and you follow up with the patients.

The trick is derive a statement like: "If an x-ray has this feature, the patient has NN % chances of having breast cancer. THAT's useful tor screening, but it doesn't follow from the first statment (without some serious statistical calculations).

Bayes theorem has all sorts of applications in prediction. In the case of E-mail, we can greatly oversimply and say "We found that X% of E-mails with this subject line are Spam." "We conclude that an E-mail with this subject line has Y% odds of being spam." Note that these are two very different statements. If we can find Y for the second statement and set a threshold we're comfortable with, say, 95% then we can create a filter with 95% confidence of correctness; it may well be wrong 5% of the time.

Other responses have done a good job with the math so I won't repeat it here.
Re:Bayes Explained by Anonymous Coward · 2002-11-03 08:12 · Score: 1, Informative

I think that the original poster dropped the /Pr(D) term because, in the *cough* referenced articles *cough* they dropped it, since they were only comparing the different Pr(h|D)'s among various email folders ("buckets"), and the /Pr(D) term was the same in all of them.

Thanks for posting the (correct) general form of the equation, though.
Re:Bayes Explained by brw215 · 2002-11-03 10:59 · Score: 2, Informative

Actually I didn't forget it. Typically in Bayesian expression the denomonator Pr(D) is dropped, meaning there is no more probabilty of any one email then any other.
Re:Bayes Explained by brw215 · 2002-11-03 13:22 · Score: 1

In fact if you bothered to read how the project labels mails, you would have realized my original equation was in fact correct. Pr(D) is never used when the version space is too big. And if you think of trying to calculate the odds of anll possible emails, you quickly realize that is not a workable approach. But I'll just quote the authors of the project.......
P(E) is the probability of that specific email occuring.
To calculate which bucket E should go in we need to calculate P(Bi|E) for each of the buckets and find the largest. Since each of those calculations involves the value P(E) we just ignore it and pretend that we need to calculate
P(Bi|E) = P(E|Bi) x P(Bi)

Hope that helps clear up confusion.

The best spam filter. by Anonymous Coward · 2002-11-03 06:19 · Score: 0

If you don't want spam then DONT USE AOL OR HOTMAIL!

Keep your email private and only give it to freinds and family. Set up a spamcop account to report any spam that does get through, and never 'remove' an email!

Ive never recieved a single spam in my blueyonder email account and rightly so.

IMAP by Evil+Adrian · 2002-11-03 06:22 · Score: 2, Insightful

Does anyone know of any spam solutions for IMAP? Everything I've seen out there is POP3, but goddammit I like my IMAP folders!!! (Not to mention that the server on which my e-mail resides gets backed up nightly...)

--
evil adrian

Re:IMAP by LetterJ · 2002-11-03 07:05 · Score: 2

If you use SquirrelMail, you can use a Bayes spam filter from the Squirrelmail plugin page.

--

The Glass is Too Big: My Take on Things
Re:IMAP by uksv29 · 2002-11-03 07:14 · Score: 1

You are probably asking the wrong question. If you are using a *nix system then you can use procmail to check and sort your mail into folders which you can use IMAP to read. In my case I use Spamassassin in my .procmailrc to evalulate the mail and if it exceeds a count of 10 then throws the mail into its own folder.

Here is an extract from my .procmailrc
------------- :0fw | spamassassin -P :0 * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\* /home/XXXXXXXXX/mail/Spam10 --------------
You should create the mail folder using your mail reader software before installing the .procmailrc as specific information is often held in a dummy mail message at the start of the file.

Remember it is easily possible to lose mail completely if you get this script wrong so test it carefully.
Re:IMAP by RustyTaco · 2002-11-03 07:22 · Score: 1

Run any of the above mentioned filters from your .procmailrc.

- RustyTaco
Re:IMAP by uksv29 · 2002-11-03 07:27 · Score: 1

Oops, slashcode stripped my <pre> tags out... ------------- :0fw | spamassassin -P :0 * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\* /home/XXXXXXXXX/mail/Spam10 --------------
Re:IMAP by vondo · 2002-11-03 07:39 · Score: 2

Yep. I wrote IMAPAssassin (on sourceforge).
Its a perl script that uses SpamAssassin on runs on any machine as an IMAP client. Spam shows up in your INBOX and disappears shortly there after.
People are working on a Bayesian module for SpamAssassin, which will be promising. The great thing about SA (as many others have said) is that it uses a number of inputs to decide if a mail is spam-like, including auto-whitelists which keep track of the people who send you mail.
Re:IMAP by vondo · 2002-11-03 10:22 · Score: 2

The problem is that lots of us don't have root access to our mail servers to install these filters. And sending the mail to a home computer first for filtering is less than satisfactory.

Just guessing this is the poster's issue.
Re:IMAP by Drakonian · 2002-11-03 11:02 · Score: 1

Good question, I'm wondering that myself. Specifically, are there any IMAP filters for Windows so that I can use it at school? (Outlook)

--
Random is the New Order.

Does it use decimal radix ? by Anonymous Coward · 2002-11-03 06:22 · Score: 0

Perl can use hexadecimal. Is there decimal in the source? Then it is evil. Decimal is evil to geeks. Decimal is the Microsoft of radices.

Mozilla integration by Powerdog · 2002-11-03 06:24 · Score: 1

Mozilla has an open bug to integrate Bayesian spam filtering into the next release of the software. Most of the work is done. They're just waiting on incorporation of a message filtering plugin architecture.

Re:Mozilla integration by Anonymous Coward · 2002-11-04 02:41 · Score: 0

So let me get this right, this message filter is almost all done, all they need is architecture to actually write message filter plugins? What kind of microsoft double-speak is that?
I have designed a perfect window replacement for X11, all I need is the architecture for the window system.
Re:Mozilla integration by Anonymous Coward · 2002-11-04 17:59 · Score: 0

the guy said plug-in architecture not architecture. you can replace x11 without any good windowing toolkits. or you could wait to release it the right way. i guess we all know how you'd release it, jackass

Um. No. by 3-State+Bit · 2002-11-03 06:25 · Score: 1

Bad "spam"-like messages are bad. Good spamlike messages are not bad. A good spam-like message I consciously opted in to receive is indistinguishable from a welcome business proposal or newsletter.

Does this system know what businesses I've given my credit card to? Because EVERY ONE of those businesses has a right to e-mail me, so long as there is a clear opt-out link at the bottom of their e-mail.

If I trust a company enough to give it my credit card number, and I like it enough to do business with it, IT HAS A RIGHT TO SEND ME E-MAIL TO INFORM ME OF ITS PRODUCTS, as long as I choose to let it. Good businesses won't abuse the privilege, and I won't end up clicking the opt-out link.

The only thing this system is good for is filtering SOME penile-enlargement shady fly-by-night header-spoofing, open-relay-using shady shamster.

Oh, but that's the ONLY thing that the article defines as SPAM:

Let's take a quick look inside the mind of someone who responds to a spam [sic]. This person is either astonishingly credulous or deeply in denial about their sexual interests. In either case, repulsive or idiotic as the spam seems to us, it is exciting to them.

So this is not spam-filtering software; rather, it's software to filter pornographic messages that fit a certain low-level sales pitch. Lovely.

Robert.

product of marketrons by hfastedge · 2002-11-03 06:26 · Score: 2, Interesting

I don't know if it is true Bayesian

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

As long as you're not developing the idea, it shouldnt matter how it works as long as it works.

I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.

With 1 line of regex I eliminate 95% of my spam:
match and throw it out.

--

-- -- --

Help my mini cause: My journal

Re:product of marketrons by jez9999 · 2002-11-03 06:50 · Score: 1

This may be great if you communicate 100% of the time with people using Unix systems. Unfortunately, quite a few rather stupid e-mail clients (Microsoft Outlook Express, Microsoft Outlook, Microsoft Word, notice a trend?) have HTML e-mail enabled by default, and your average user isn't about to turn it off. So if you're talking to average users, filtering HTML mail is not a good idea.

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:product of marketrons by hfastedge · 2002-11-03 07:06 · Score: 1

i only speak from experience.

There might come a time when I will have to increase the abilities of my filtering system.

My quirk was that he's marketroning on the bayesian buzzword.

--
-- -- --
Help my mini cause: My journal
Re:product of marketrons by Helter · 2002-11-03 07:27 · Score: 1

Of course you're also filtering out all of those outlook and outlook express users who like to "decorate" their email without realizing that it adds html to their mail.

Prepare for an angry phone call from mom/grandfather about why you aren't responding to their emails about the new computer they just got...
Re:product of marketrons by hfastedge · 2002-11-03 07:33 · Score: 1

actually...

take note to my reply to the reply.

But i have plenty of known people that send me html email, like my dad.

I simply filter those as well, but instead of throwing them out, i put them somewhere like "family".

--
-- -- --
Help my mini cause: My journal
Re:product of marketrons by Anonymous Coward · 2002-11-03 07:53 · Score: 0

Yeah, unfortunately I forget to refresh before replying, so I don't see comments that have been added in the past few minutes (or half hour in this case).

oops.
Re:product of marketrons by spitzak · 2002-11-03 09:05 · Score: 2

Filtering out "
Re:product of marketrons by crucini · 2002-11-03 09:41 · Score: 3, Insightful

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

I think you may have misunderstood that comment. Since Paul Graham started talking about Bayesian filtering, there's been some tendency here to refer to all learning spam filters as Bayesian. Which results in complaints, which results in the designation "pseudo-Bayesian" for the many independently-discovered learning algorithms that don't have a theoretical underpinning.

Put another way: if an algorithm outputs a dimensionless "score", and the author can't set an upper bound on the score, it's at most pseudo-Bayesian. If it outputs a probability that the message meets certain criteria, then it could be "true Bayesian". Additional implication: the "pseudo-Bayesian" filter may have a stack of rules in addition to its table of probabilities.

I don't think we're splitting hairs on some deep statistical issue. I think we're groping for very rough categories in a new field of application software. If you can establish clearer categories, that might help.

With 1 line of regex I eliminate 95% of my spam: match and throw it out.

Graham addresses this in the article. One can identify most spam with a simple rules-based engine. That tends to make one lazy in reading the spam folder, which means false positives can languish unread. Enhancing the rules-based engine becomes an ongoing project as the volume and clerverness of spam increase. Hopefully Bayesian filtering can automate this.

your brain by Anonymous Coward · 2002-11-03 06:28 · Score: 0

you don't seem to use your brain either asking such questions, why would it be useful to you anyway?

As effective as a well trained secretary by Gribflex · 2002-11-03 06:29 · Score: 1, Insightful

As I understand it, the Bayesian mail filtering system works by:
a) you receiving mail
b) designating where it should go
c) the filter tries to understand your reasoning
d) in the future, before step 1 occurs, the filter tries to interpret whether or not you want the mail based upon statistical analysis of what you have done in the past

Where as current mail filtering techniques work by culling your mail on exact specifications (it doesn't try to interpret. If it doesn't know, it does nothing).

I quite like the idea of my mail filtering software becoming intelligent over time, however I can see a potential for email traffic being lost using this method. The Bayesian mail filter is essentially as effective as a (hopefuly well trained) secretary. When you first get your secretary, she brings you everything. Then she starts culling the most obvious junk mail. Then she would start examining the normal letters... are they important? Relevant? Is this the person who should be dealing with it?

After time, you have your secretary very well trained, and she culls out everything which is not of immediate importance. In real life, this leads to the following problems:

a) you receive mail from an unknown source which could be important (some guy's discovered a new way to _________) but who isn't credible by your standards. His mail gets tossed aside, or redirected to someone else who probably doesn't care.

b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.

We've all heard stories of the first example, and it's not too hard to imagine the second. My worry is that, just like a good secretary, my mail filtering software will begin to filter for me. I will lose some control and, for the convenience of not having to hit the delete key a few extra times, I may miss potentially important email.

Chance is never a good thing to bring into your business.

Re:As effective as a well trained secretary by bmwm3nut · 2002-11-03 06:48 · Score: 2, Insightful

but, unlike your secretary not showing you things. you can just set up the filter to put the spam in a spam folder. you can then periodically look at it and see if there are any false positives. or you can tell the filter to delete things that are 95% spam, but put things that are still most likely spam in a special folder. that's what's great about learning algorithims, they can always adapt to what you want (if you teach them enough).
Re:As effective as a well trained secretary by Anonymous Coward · 2002-11-03 06:50 · Score: 0

b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.
If you think your secretary would not know your son was in Zimbabwe you have never had a secretary.
If you ever get one you will experience that she/he quickly knows more about you than your wife does.
Re:As effective as a well trained secretary by jez9999 · 2002-11-03 06:54 · Score: 1

You just echoed exactly what I was thinking.

The problem I have with ANY e-mail filter is that there's always the chance that a genuine useful e-mail will accidentally be trashed. I'm not just saying this just for the sake of identifying a flaw; it's just the way I am that I would always feel twitchy about any e-mail going into a 'trash' folder without me looking at it to confirm it. And if I'm looking at it, I might as well not filter it at all.

And yes, that is how you spell Zimbabwe :-)

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:As effective as a well trained secretary by spitzak · 2002-11-03 09:09 · Score: 2

This is a problem for *any* email filter. I think they are trying make better solutions for this very problem. Actually solving it is impossible, but this may approximate a solution better than any other filter.
Re:As effective as a well trained secretary by Anonymous Coward · 2002-11-03 09:50 · Score: 0

"Then she would start examining the normal letters... are they important? Relevant? Is this the person who should be dealing with it?"
You are giving way too much leeway to your secretary if you trust his/her judgment on what is "important." If your secretary can make that judgement, then exactly what is it that you do? I know that in law and medicine, shifting the responsibility of communications directed to you to the secretary eventually ends up being called *malpractice*.

Not integrated solution by unfortunateson · 2002-11-03 06:32 · Score: 2, Insightful

What will make this thing work is if it is integrated with the e-mail client.

With this tool, you unfortunately have to manually add a message of a certain classification (work, pr0n, spam, family...) to the progrma through the perl script -- very awkward.

A tool like this need to run as a daemon and 'notice' when a message is added to a folder. Unfortunately, with different formats for e-mail folders, it's a much tougher job.

As it stands, with something like Outlook, I'd have to export each message individually, then run the Perl script. I can probably add a macro to do that (with its own pains -- you add a VBA macro to Outlook and it gripes every time you start up), and possibly even one that responds to filing in a folder.... hmm... maybe I will try this out.

--
Design for Use, not Construction!

Re:Not integrated solution by crisco · 2002-11-03 07:28 · Score: 2

This tool also has a web interface to reclassify mail. Not as good as client integration but a little easier than the command line for the masses.

--
Bleh!

Why hex and binary? by Anonymous Coward · 2002-11-03 06:34 · Score: 0, Troll

I have five fingers on each hand, so I prefer decimal.

If I had four fingers on each hand, I'd prefer octal.

If I had one finger on each hand, I'd prefer binary, but I think I could manage without using my fingers

If we had eight fingers on each hand, we'd prefer hex, but then it wouldn't be hex, because we'd have used a different numerical system, that'd be base 16, but with 16 numbers instead of 10 numbers and 6 characters.

My conclusion: You're stupid, ignorant and not a geek.

Re:Why hex and binary? by Anonymous+DWord · 2002-11-03 06:59 · Score: 1, Offtopic

Wow, you're smart. How can I be like you? The early Mayans used a base 20 system; ancient Romans used a base 12 system. The Babylonian system was a positional base 60 system. How many seconds in a minute? How many minutes in an hour? So much for your decimal.

My conclusion: You're stupid, ignorant and not a geek.

--
"If he thinks he can hide and run from the United States and our allies, he's sorely mistaken." Bush on bin Laden

You know what I'd kill for? by Saint+Aardvark · 2002-11-03 06:34 · Score: 3, Interesting

A version of this for Outlook Express.

I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)

But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)

The good folks at DeerSoft have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.

Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?

Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

--
Carousel is a lie!

Re:You know what I'd kill for? by Anonymous Coward · 2002-11-03 06:51 · Score: 0

> A version of this for Outlook Express.

Available. Check out the PopFile manual at http://popfile.sourceforge.net/manual.html

It explains how to configure it for Outlook Express.
Re:You know what I'd kill for? by Anonymous Coward · 2002-11-03 06:52 · Score: 0

what i'd like is a program that identifies spam and bounces an error message back to the sender "recipient unknown" or whatever. that way, the adress at some point in time gets deleted from the database.
i know, the reply-to-adress is often forged, but if this could be done, you could not only block spam, but also reduce it.
Re:You know what I'd kill for? by Saint+Aardvark · 2002-11-03 06:57 · Score: 1

Super-sweet! Thanks for the link.

--
Carousel is a lie!
Re:You know what I'd kill for? by bstadil · 2002-11-03 06:57 · Score: 3, Informative

You know what I'd kill for?
It might be smarter to read the article, than killing someone.
You could have installed the program for Outlook in the time it took you to type your rant, but then you would not get any Mod point would you.

--
Help fight continental drift.
Re:You know what I'd kill for? by Saint+Aardvark · 2002-11-03 07:02 · Score: 1

Well shet my mouth, you're right...my fault: I assumed that "Perl script" meant Unix-only (or at least -mostly). But that was hardly a rant; if I had ranted, you wouldn't be left standing. :-)

--
Carousel is a lie!
Re:You know what I'd kill for? by Saint+Aardvark · 2002-11-03 07:18 · Score: 1
Okay, so now that I've read the manual...close, *very* close to what I'm after. But for my customers it's not quite there yet:
- They'll have to install Perl as well
- and they have to run a command-line program (if I'm reading the manual right; haven't run Perl on Windows, so feel free to correct me) to train it.
Keep in mind that I'm after something I can recommend to retirees and soccer moms; the few times I have to send 'em to the command line are inevitably painful (and why not? it's not like they've ever needed to use it before, or that they have any idea what this big black window you can't click in is supposed to do -- it's completely outside what they know. And no, that's not a flame -- they're not bad people for not knowing what to do with a DOS prompt).
These folks will have kept their good mail (probably), but who the hell keeps spam? Which means they'll need to train it repeatedly as spam comes in (if I'm reading the man. right) and that means repeated command line work...and that's not gonna fly with most people I work for/help out.
--
Carousel is a lie!
Re:You know what I'd kill for? by crisco · 2002-11-03 07:19 · Score: 2

As others have pointed out, thats exactly what POPFile is. Unfortunately it is not yeat a point and click kind of install but that is the direction it is heading.

--
Bleh!
Re:You know what I'd kill for? by Anonymous Coward · 2002-11-03 07:33 · Score: 0

It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
How about comparing it to telemarketing calls or telephone scams except the caller can fake the caller id? Most of the calls are obviously "do you need new siding", but some are harder to figure out. I might be able to convince you I'm a legitimate caller and have you hand the phone to your spouse. ("Hi, I'm a old friend of $spouse, is he/she there?")
Re:You know what I'd kill for? by Helter · 2002-11-03 07:36 · Score: 1

Cloudmark is good, and in conjunction with ISP filtering would probably get pretty much everthing.
I don't know if it's available for Outlook Express yet, but that's supposedly on the "to do" list.
Re:You know what I'd kill for? by bstadil · 2002-11-03 07:42 · Score: 1

I assumed that "Perl script" meant Unix-only (or at least -mostly)
That is interesting as I did the same. I was surprised that the three Examples of installation was Windows Email clients. I think that a lot of people has the same mental link Perl == Unix, and what could (should?) be done to change this.

--
Help fight continental drift.
Re:You know what I'd kill for? by Anonymous Coward · 2002-11-03 08:01 · Score: 0

Why put it on your client? The proper place for spam filtering is on the server, which is where CanIt works.
For ISP's, we're coming out with CanIt-PRO, which allows all your customers to have their own settings and preferences, without installing software on their PC's. And it works regardless of which mail client they use.
Re:You know what I'd kill for? by Anonymous Coward · 2002-11-03 08:29 · Score: 0

Apple's Mail.app has a bounce to sender button, but in practice 90% of spams lack a legitimate sender address.
Re:You know what I'd kill for? by Webmonger · 2002-11-03 08:54 · Score: 2

This tool does work with windows. It's probably also possible to set it up as an alternate mail server for your users.
Re:You know what I'd kill for? by q2k · 2002-11-03 13:01 · Score: 2

You could try pointing customers to http://www.pocomail.com as their Windows email client. Not only does it solve all the MS Email Viri issues, but the built in spam filtering is pretty damn good. I'm not sure what the theroretical underpinnngs are, but out of the box it will put most spam in the junk mail folder, and with some tweaking it will get 90% of it.
Re:You know what I'd kill for? by keithww · 2002-11-04 02:51 · Score: 1

I have played with Norton Internet Security 2003 and it has pretty good spam filtering routines. I also run Pocomail to set up for what Norton misses, but NIS will get about 95 percent of my spam. I have had the same email address for over 8 years and I get between 60 and 200 pieces of spam a day. It is nice to have to only deal with 3 to 10 pieces of spam. I am migrating to my own domain, and will use temporary email addresses. Have people found that this helps with the spam?

Re:Um. No. by judd · 2002-11-03 06:36 · Score: 2

I think you have failed to understand how the filter works.

It is "trained" on a corpus of spam, which is compared to a corpus of known good messages. The important part is that YOU, the user, supply the spam corpus and the good messages. Thus in your case, as long as your "good spamlike messages" are in your "known good pile", similar new ones from the same source will not be tagged as spam. This is where the statistical approach shines over simple keyword matching.

Go on, read about how it works. You might learn something.

SquirrelMail has a Bayesian plug-in by ptbarnett · 2002-11-03 06:37 · Score: 4, Informative

Plugins - BayesSpam - Intelligent Spam Filter

SquirrelMail is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor).

Re:Um. No. by jjo · 2002-11-03 06:38 · Score: 2

Well, if all spam is indistinguishable from the legitimate spamlike messages you want to see, then no filter will help you.

However, it seems more likely that a large proportion of spam is distinguishable from mail you want to see. It's quite plausible that you don't want to see messages about nympho sluts, or penis enlargement, or breast enlargement (or at least not all three), and that a naive Bayesian filter could easily distinguish these and other spams from mail you do want to see.

CRAP by gnillort · 2002-11-03 06:38 · Score: 0, Offtopic

www.goatse.cx is a bad site!
don't go there!

Normal people.. by egarland · 2002-11-03 06:40 · Score: 1

have the ability to learn new things.

--
set softtabstop=4 shiftwidth=4 expandtab nocp worlddomination

Re:Normal people.. by Ozymandias_KoK · 2002-11-03 06:56 · Score: 1

Heh...your point is debateable. :)

Then again maybe I am confusing normal with average. But they should be the same, dammit!

Uhmm.. like bogofilter? by Jamuraa · 2002-11-03 06:42 · Score: 3, Informative

Bogofilter has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better or not as good as spamassassin.

--
You can't see this if you have sigs turned off.

Re:Um. No. by Fastolfe · 2002-11-03 06:43 · Score: 1

Please read the article. Classification of messages is done by you. If you are routinely receiving pitches that you both solicit and arrive unsolicited, it might have a hard time differentiating, sure, but keep in mind that spam filtering is just one form of classification that can be performed here.

If you choose to set up a spam classification, and routinely file penis enlargement ads, the system will quickly learn that e-mails with words common to penis enlargement ads are generally going to always be classified as spam, and will file it as such. Other pieces of e-mail that share content with "legitimate" ads may be misfiled in your "legitimate pitches" folder.

You can set this up however you want it. It learns by remembering the words in messages you manually classify, so you are not taking their definition of "spam". You are setting up a classification that you call "spam" and it's keeping track of the types of things you put in there. It will then apply that to future messages.

Professional Looking Spam May Be Impossible by Bob9113 · 2002-11-03 06:44 · Score: 4, Insightful

This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.

--
Stop-Prism.org: Opt Out of Surveillance

Re:Professional Looking Spam May Be Impossible by ceswiedler · 2002-11-03 07:34 · Score: 3, Informative

I don't think you're talking about the Skinner box, which is a device used in the psychology of learning, but rather the Chinese room, which is John Searle's take on AI and the Turing test.
Re:Professional Looking Spam May Be Impossible by Anonymous Coward · 2002-11-03 12:09 · Score: 1, Interesting

Actually, in my experience, spam is written by very intelligent people to look a very specific way to reach a very specific audience.

There is nothing accidental or slap-dash about the layout, or use of colour, or any of the factors involved in laying out an email that will generate sales. I know this because it's my job to know about - I'm in the porn business.

You might hate spam - I know I do - but it works. It works very well. And the way the email looks makes it work best of all.
Re:Professional Looking Spam May Be Impossible by SubtleNuance · 2002-11-03 14:33 · Score: 1

Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small.

Now join us in real close /. newbies, and see the rare wonder of a full-size troll at his best. Here in the wilds of the slahdot forums, so much pure noise cannot possibly keep itself clean from ill-logic and Trolling.

Notice the pseudo-academic (self-regulating/skinner box/prose) mixed w/ common microeconomics (profitable ventures/margins..small). PURE GENIUS!

Now, look at the wonderful finish, the piece-de-resistance, the last jibe and challenge "those spams are being written by morons because morons are cheap" playfully challenging you to disagree, but offering up the Spammers to insult and daring you to contradict; lest you too may be labeled a "moron".... if to say "They sure are stupid -- dont you agree?"

lovely, excellent work.
Re:Professional Looking Spam May Be Impossible by Bob9113 · 2002-11-04 02:36 · Score: 2

Ahh yes, thank you for the correction!

--
Stop-Prism.org: Opt Out of Surveillance

Risk management by hansroy · 2002-11-03 06:44 · Score: 1

Finally, paying attention in those statistics & risk management courses pays off!

Statistics are cool. by Fuzzums · 2002-11-03 06:46 · Score: 1

I write a simple script to recognize languages by their letter frequencies. [http://www.fuzzums.nl/talenknobbel/].
this methis isn't very strong, but with a fair amount of input the resulte get better. it even recognised the difference between dutch and a dutch dialect. the problem was that the alphabet only hat 26 characters, so i came up with the idea of using letter pairs.

when i read the article it was really funny. the methids he uses are almost the same as my method. and when i read about using word pairs: LOL.

this will be a very cool sam-filter. i love it already.

--
Privacy is terrorism.

Ximian Evolution? by Namtar · 2002-11-03 06:47 · Score: 1

This looks really good. Anyone out there know how/if it can be used with ximian evolution?

--
Linux. Because a 386 is a terrible thing to waste.

Re:Ximian Evolution? by rgmoore · 2002-11-03 07:40 · Score: 2, Informative

With some cleverness, you can use any outside filter with the most recent version (i.e. the develpment fork) of Evolution. They've added the ability to pipe incoming messages to an outside program and read back the exit code. So if the program is written using standard Unixisms- i.e. it reads on standard input and returns a different value depending on whether the incoming message is spam or not- it can be used with Evolution. I know that bogofilter can do this because I'm using it with Evolution and it works pretty well.

--
There's no point in questioning authority if you aren't going to listen to the answers.

Oops, screwed up the URL... by cmeans · 2002-11-03 06:47 · Score: 2

Apache Jakarta James is at http://jakarta.apache.org/james.

--
Give a hand, not a hand-out.

perlcc by Camel+Pilot · 2002-11-03 06:53 · Score: 3, Insightful

I just received the November edition of the TPJ which included a fine article "perlcc & Compiling Perl Script".

In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.

Spam will be spam by dazdaz · 2002-11-03 06:53 · Score: 1

I get tired of copy and pasting spam emails into spamcop from the same ISP's. I use The Bat! quite a lot, any suggestions?

Re:Spam will be spam by acceleriter · 2002-11-03 07:17 · Score: 2

I try the creative step of prepending common Chinese names, e.g. zhao@chinacenter.com, chen@chinacenter.com, lchen@chinacenter.com, chang@chinacenter.com. Along with a nice "Thank you" for the beautiful picture of the Dalai Lama they sent me, and good wishes that the freedom of information contrary to the PRC's politics continues.

--
CEE5210S The signal SIGHUP was received.
Re:Spam will be spam by Anonymous Coward · 2002-11-03 10:38 · Score: 0

http://www.silverstones.com/thebat/Library.html helps with submitting to spamcop, but I've found a few regexs can cut my spam down to pretty much zero.

Re:The decimal issue by Spock+the+Baptist · 2002-11-03 06:58 · Score: 2

One of my pet peeves is the obsession that folks have with zeros. An example is the year 2000. In base 10 you get beaucoup zeros whereas with hex you get 7D0, or 11A6 (base 12), or 3720 (octal), or 11111010000 (binary). Zeros are an artifice of both the base, and numeral system used to represent a pure number. Thus, the fact that most humans use the decimal Indo-Arabic numeral system to represent it is the only reason for all those zeros. Use another base, or numeral system to represent 2000, you don't get beaucoup zeros.

The real properties of pure numbers are the relationships that they have with other numbers, and not the symbology used to represent them.

--
"Oh drat these computers, they're so naughty and so complex, I could pinch them." --Marvin the Martian

Re:But secretaries use decimal . by Anonymous Coward · 2002-11-03 06:59 · Score: 0

And nerds hate decimal. So, use hexadecimal, not decimal. Computers are good, they use hexadecimal.

46 75 63 6B 20 6F 66 66 2E

Staged Categories by irritating+environme · 2002-11-03 07:05 · Score: 2, Interesting

An advertised false positive rate of 0% is nice, but why not additional research into the spam, to attempt to categorize into blatant spam, probable spam, borderline, and non-spam, and see if false positives can be plopped into the borderline categories.

Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.

--

Hey, I'm just your average shit and piss factory.

Re:Um. No. by rgmoore · 2002-11-03 07:05 · Score: 2

You're wrong, though. The whole point of this kind of filter is that it develops its rules based on the information that you give it, not what somebody else thinks. If you tell it that mails from your legitimate business partners aren't spam, it learns to tell them apart. I use a Bayesian filter on my mail, and it has no trouble telling my legitimate business mail, like messages from Amazon about books I've been waiting for, from illegitimate ones. Some of that is that the legitimate mail is written with a very different style from the illegitimate stuff, but I assume that the filter has also learned that mail with amazon.com as the sender is OK. In any case, I find that it just plain works.

--

There's no point in questioning authority if you aren't going to listen to the answers.

Where's the news? by Roadmaster · 2002-11-03 07:09 · Score: 4, Informative

Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.

Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.

Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.

Good in combination with spamassassin? by Moritz+Moeller+-+Her · 2002-11-03 07:12 · Score: 2

I am just about to put bogofilter in my mail filtering system. I am thinking about combining this baby with spamassassin, as described here:
http://www.randomhacks.net/2002/09/23/#usin g-bogof ilter-with-spam-assassin

I will use the pass through option and I can use spamassassin to protect against false positives and to adjust the sensitivity.

BTW: Does anyone know if the number of SPAM and nonSPAM have to be about equivalent or is this accounted for? I have 4000 spam mails in a folder, but just about 500 nonspam mails.

--
Moritz

Re:Good in combination with spamassassin? by Matts · 2002-11-03 08:37 · Score: 2

FWIW, SpamAssassin 2.50 will include a statistical filter that works like similar bayesian filters.

It should be pretty cool, in that it will automatically train on spamassassin results, as well as allowing you to add or remove spam and non-spams.

Matt (a spamassassin developer)

--

Matt. Want XML + Apache + Stylesheets? Get AxKit.
Re:Good in combination with spamassassin? by Anonymous Coward · 2002-11-04 04:08 · Score: 0

Yes, I'm curious whether any of these things actually work *better* than SpamAssassin -- though if SA is getting a Bayesian filter, I guess it doesn't matter. SA works great for me, requiring just a teeny bit of manual tuning (which would, admittedly, not be great for the masses).

Developers missed this... by bigberk · 2002-11-03 07:15 · Score: 3, Insightful

In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.

A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat, Pimmy, JBMail and PocoMail will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.

Re:Developers missed this... by crisco · 2002-11-03 07:58 · Score: 2

Thats a good idea. Message classification would get less accurate on just the headers or headers+top of message but that might be enough to avoid downloading spam (biggest drawback to POPFile, you still download the spam, only to delete it).

--
Bleh!
Re:Developers missed this... by bigberk · 2002-11-03 08:23 · Score: 1

Message classification would get less accurate on just the headers or headers+top of message

Does POPFile do any caching of messages? If it does, then it might actually be worth RETRieveing the whole message even when the client sends just a TOP, presumably because they will want a RETR later anyway. But yeah, this gets a bit weird!
Re:Developers missed this... by crisco · 2002-11-03 08:33 · Score: 2

No, not presently. It seems the author wants to keep it as simple as possible. However, as it matures it might be great to look at all the different ways people use mail and mail clients and start making allowance for what people like to do.

--
Bleh!
Re:Developers missed this... by Anonymous Coward · 2002-11-03 11:14 · Score: 0

This would increase the incidence of false positives, because the Bayesian filters also look for the presence of "good" words. If for example your filter learned that things with X-RBL-Warning headers were "bad", that letter from your Aunt Thelma at AOL is gonna bounce unless the filter is also allowed to read the message body.

Easy filter by kraf · 2002-11-03 07:17 · Score: 1

filter everything that only has a text/html attachment

Re:Um. No. by kirkjobsluder · 2002-11-03 07:19 · Score: 1

I don't think that you understand how this form of filtering works.

1: You decide what content is spam and what content is not spam because you train the filter. One of the things that I disliked about SpamAssassin was its tendency to mark conference announcements as spam. I don't have this problem with my pseudo-baysian filter because it recognizes that mail about education tends to be good while mail about mortgages, pot and penis enlargement tends to be bad.

2: Perhaps more importantly the filter not only checks for trusted content, but trusted souces and routes. If honestcorp.com never sends you spam, then honestcorp.com becomes a trusted route for email.

outlook integration by jdkane · 2002-11-03 07:21 · Score: 1

As far as Outlook integration goes, for the time it takes me to drag and drop all my messages to disk, I might as well delete the junk mail manually using a much-less statistical approach. :)

Of course it looks like Outlook is Outnumbered here.

Is this intended for server, client, or both? by Rooney444 · 2002-11-03 07:29 · Score: 3, Insightful

If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?

Re:Is this intended for server, client, or both? by dzym · 2002-11-03 07:38 · Score: 4, Informative

Yes, but remember, who runs the SMTP servers?
The very design of the whole system specifies that anyone can just turn on a machine, hook it up to a network somewhere, and start spewing out messages to smtp ports all over the world.
It doesn't have to be a sendmail, qmail, or exim server, remember. Some Windows viruses have taken advantage of that loophole to set up mini-SMTP servers in the network stack to continue propagating viruses without needing to connect to anything that provides authenticated external relay.
Re:Is this intended for server, client, or both? by tsg · 2002-11-03 09:09 · Score: 2

It's another tool to use. Just because it doesn't solve the entire spam problem doesn't make it useless.

The main issue of spam for most end users is that they have to waste time wading through the spam to get to their real email. This filter does most of that for them.

There isn't going to be one solution to end spam completely. It's going to take a lot of people nibbling away at the problem until it becomes more bother to send spam than it's worth. My father always said, "Do you know how to eat an elephant? One bite at a time." This is another bite out of the elephant.

--
People's desire to believe they are right is much stronger than their desire to be right.

Re:Um. No. by dvdeug · 2002-11-03 07:30 · Score: 2

Does this system know what businesses I've given my credit card to?

Do you understand what a bayesian filter does? It tries to figure out what you consider spam. I don't like dentists sending me advertising junk; bogofilter trashes it. Anything about Esperanto or Project Gutenberg or Linux could probably fly on through, as it's got a lot of words that actually appear in my good email in it. At worst, a couple messages from that business get caught, and then it will recognize that the messages are good based on sender and embedded URL's.

In any case, there tends to be a huge difference between the messages I've got from companies I've given my credit card to and the ones that are sending me spam. Usually, one is quietly informing me of new items for sale, and one is screaming about crap. A bayesian filter can often tell the difference.

what is the point then? by zogger · 2002-11-03 07:32 · Score: 1

--I understand wanting to filter spam, but all these techniques near as I can see you still have to read all the spam to make sure it is A-learning to find spam, and B to make sure that you didn't filter out important mail by having your automatic filter filter it out. Umm, what's the point then? Isn't this an example of the department of redundancy department?

Re:what is the point then? by rgmoore · 2002-11-03 07:55 · Score: 2, Insightful

Well, there are potentially three points. One is that hopefully after a while the filter will work well enough that you can develop some real confidence in it and you won't have to check every time to see that it's working right. I'm pretty close to that point with bogofilter; I so rarely see any false positives that I can almost afford to flush the messages without checking. Actually, I assume that what I'll really do is to change the rules a bit so that alleged spam is sent to a waiting folder and doesn't even show up in my main inbox.
That gets to point two: now I'll be able to check for spam in batch mode. Instead of going through my inbox every time I look for messages, marking some as spam and reading others, I'll be able to read just about everything in my inbox without worrying about spam. Then once a week I can check my spam box and see if there's actually anything legitimate there. This is going to be faster than doing it every time a new message shows up in my inbox.
I'm not a compulsive mail reader, but for some people this would also be really useful because it would protect them from distractions. They are working on something and then their mailbox beeps them to let them know that a message has arrived. Unfortunately, when they check it out it turns out that their train of thought has been needlessly disrupted by another spam. If they can filter out the spam before the notification while still being alerted promptly when a real message shows up, that's a big win.

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:what is the point then? by marmoset · 2002-11-03 08:00 · Score: 1

Not really. My mail client automatically sorts suspected spam into a "Junk" box and, um, colors it brown [heh]. I just glance in the box once a day (at the Subject lines and Senders), which is almost always enough to tell me whether there are any false positives or not (i.e. are any of the senders known acquantances, do any of the subject lines correspond to any active projects). Since the spam filter is the last one that runs against my incoming messages [earlier filters take care of legitimate mailing lists, etc.], false positives are extremely rare: on the order of one or two a week, and I get thousands of emails. Rarely do I actually need to read a message it's flagged to determine whether it's spam or not.
Re:what is the point then? by boots@work · 2002-11-05 10:42 · Score: 1

Here's the point: if I develop confidence in bogofilter and build a good database, then I can use it as an ingress filter on some large (>8000 members) mailing lists that I run. Every spam that gets through there costs us bandwidth, and wastes the time of every subscriber. In addition, spam messages which do get through to the list typically generate additional traffic, either by people complaining about spam, or in bounces from recipients whose servers reject it.

So the return from a good spamstopper is potentially enormous.

Any messages which are false positives can be handled by some other means; e.g. forwarding to a postmaster for review.

Use prisoners for your spam filter. by Anonymous Coward · 2002-11-03 07:38 · Score: 0

Many mail order firms use prisoners to answer phones and take your orders.

An intelligent ISP level spam filter could consist of sending any message that hits multiple subscribers (set some reasonable threshhold) to prison for evaluation.

The prisoner's station would be a screen with one button that approves the message for delivery and one which deletes it.

Re:Um. No. by Anonymous Coward · 2002-11-03 07:41 · Score: 0

Um, kind of.

SPAM is generally undesired email, often with forged header information and without a caring person on the sending end.

For example, I was getting a ton of SPAM email promoting a major credit card (not Visa, AE, or MC). The email that was sent didn't have a real return address. In fact, the email said "don't reply - any messages sent to this email address will be deleted" !!!

The false return address made it impossible for me to have a two-way conversation with the sender. That's not customer service, and that's not even a friendly form of marketing. I don't understand why anyone would expect people to tolerate such marketing nonsense.

I know the organization I work for would NEVER resort to such unsavory tactics.

This kind of SPAM may or may not be illegal. But I don't care about it's legality - it's inappropriate behavior, and I refuse to tolerate it.

Image-based spam by Anonymous Coward · 2002-11-03 07:43 · Score: 1, Insightful

Why wouldn't spammers do something like this to circumvent the filter (i.e. simple image-based spam with text that doesn't raise any alarms):

Content-Type: multipart/related;
type="multipart/alternative";
boundary="----=_NextPart...."

This is a multi-part message in MIME format.

------=_NextPart_....
Content-Type: multipart/alternative;
boundary="----=_NextPart_...."

------=_NextPart_....
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi

------=_NextPart_....
Content-Type: image/jpeg;
name="Spam goes here.jpg"
Content-Transfer-Encoding: base64
Content-ID: /9j/4AAQSkZJRgABAgEBLAEsAAD/7QlMUGhvdG9zaG9wIDMuMA A4QklNA+0KUmVzb2x1dGlvbgAA
etc...

Re:Image-based spam by marmoset · 2002-11-03 08:44 · Score: 1

Because if you get enough of these and flag them as spam, the filter will start to catch them, too -- "Hmm... multipart message with no text content and embedded images is usually spam." That's the beauty of the way these filters work.

Same approach works in Lotus Notes! by scottme · 2002-11-03 07:45 · Score: 1

I've been working on a spam tool in Lotus Notes (I know, but it's what we have to use where I work) that uses the same underlying methods. I've designed it to be outboard of the mail database, and it's "pure" Notes so should run on any supported platform.

I have it pretty much working now, and it is uncanny how well it sorts the spam from the rest of the stuff. Using even a very dumb tokenizer, the thing catches 95% or more of the spam, and so far the only false positives have been a result of miscategorized stuff in the input corpus -- i.e. I had filed something as spam that was not spam, and the filter started recognizing similar stuff as spam. That actually looks like one of the main possible failure modes for this approach.

Another of my concerns is that there are so many possible tweaks to these algorithms (mainly various ways of tuning the tokenizer, but also whether to focus on specific elements of messages, what to do with URLs, HTML comments, etc.) that could make a difference to the filter's performance.

I'm seeing a lot of interest from colleagues at work, and I'm starting to share it with them. If/when it feels mature enough, I may be able to get permission to release it to the outside world too. (Mine is a private, one-man project, but done on company time and with company resources, so they get first call on it.)

What about random misspellings? by archeopterix · 2002-11-03 07:46 · Score: 2, Interesting

Hm... what about an anti-anti spam filter that mangles the message inserting random misspellings into the spam-identifying words? The bayesian filter would perceive this as a message consisting of many 'unclassified' words, just like a message in some unknown language. Sure, the short words probably haven't got many possible misspellings (cock, c0ck, coock, cokc - hm... starts to look undecipherable ), so they would probably get classified after some time. And this would hopefully lower the spam success ratio. But the possibility still remains...

Re:What about random misspellings? by PigleT · 2002-11-03 08:21 · Score: 3, Interesting

Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time.
Ifile does this, bogofilter does this with some wangling in procmail, ...

That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.

--
~Tim
--
.|` Clouds cross the black moonlight,
Rushing on down to the circle of the turn
Re:What about random misspellings? by archeopterix · 2002-11-03 08:45 · Score: 2, Interesting

Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time. Ifile does this, bogofilter does this with some wangling in procmail, ... That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
This is clever, but might have some undesirable side effects. Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.
Re:What about random misspellings? by Dun+Malg · 2002-11-03 12:27 · Score: 2

This is clever, but might have some undesirable side effects. Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.

This is a possibility, but if the words are "neutral" then they'll likely show up on the "notSPAM" side too, which keeps them neutral. For an otherwise "good" word to become "bad" it'll either have to never show up as "good" (which means it might as well be bad), or show up in a LOT of spam, and I don't think you'll ever see that level of co-operation between spam spewers.

--
If a job's not worth doing, it's not worth doing right.
Re:What about random misspellings? by wilhelm · 2002-11-04 10:57 · Score: 1

if someone sends something that's still mostly spam... then all the neutral words will be tarnished as well.

The way a filter such as this works is all about word frequency: if a single spam contains a neutralish word, then it becomes more spammy based on the total number of times that word has been seen in all mail. If it's been seen many times in innocent mails, the one-time appearance in a single spam won't taint it much.

If a word is neutral (that is, it signals a ~0.50 probability of the mail being a spam), it's not going to count much toward the final probability anyway. Graham's prototype filter took only the 15 most interesting words into account, where interestingness is abs(word_probability - 0.50). Words that are truly neutral don't actually matter, just those which are a strong indicator one way or the other.

I've written a filter of this type myself, and once it's been trained reasonably well, it's quite accurate. I've looked through my word probabilities list, and some of the results are surprising, but they work. My userid actually signals a 31% chance of the mail it appears in being spam, but by the filtering criteria, that's actually a pretty uninteresting word (i.e. close to 50%, which is absolutely neutral, and completely uninteresting). As the number of spams I use to train approaches the number of legitimate mails I use to train, I expect my userid's spam probability will more closely approach 50%.
Re:What about random misspellings? by boots@work · 2002-11-05 11:10 · Score: 1

Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.

No, that shouldn't help them.

Broadly, these filters look for "interesting" words: either ones that are often found in spam but not in nonspam, or vice versa.

The word "the" often occurs in both types; the word "slartibartfast" rarely in either. Therefore both of them are neutral.

The word "stop-on-solib-event" has never occurred in my spam yet, but if they started adding it (through a web robot) then eventually the filter would come to think that it was neutral, rather than a nonspam indicator as at present.

Interestingly, misspellings can be a really good indicator of spamminess. Many spams are sent repeatedly, either through multiple addresses, or because they're chain letters, or just because the stupid spammers send them repeatedly. If there's a characteristic misspelling that hasn't occurred anywhere else, it becomes a good way to identify the spam.

Another nice thing about bogofilter (and possibly others) is that it considers origin IPs and domains along with body words. So things like seed.net.tw are likely to be dumped unless they have some other strongly nonspammy words. For me, AOL and Yahoo have slightly spammy smells -- more because they're often forged than because much spam originates there, I think.

Eventually these filters may be defeated, but I think they will work well for a while.

I have a theory that the reason many people want to filter spam is not just because of bandwidth or time, but rather because the moronic writing and presentation is an insult to a thinking reader. It is quite literally junk mail. If by trying to get past these filters spammers have to act more like reasonable humans it's probably a good thing.

This is less of a blunt instrument than DNS blacklists and therefore probably a good thing.

Missing the point? by crisco · 2002-11-03 07:46 · Score: 5, Informative

I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.

This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).

Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.

Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

--

Bleh!

Re:Missing the point? by Zuke8675309 · 2002-11-03 11:09 · Score: 2, Informative

Exactly true. POPfile isn't just about filtering spam. It's about sorting email. Slightly different. One could think of the nuance this way - out of all the email you get you could teach POPfile to filter out the GOOD email and delete everything else. I've found POPfile extremely useful for bringing order to the clutter of my inbox. I have buckets for spam, fantasyfootball, personal, and several work related subject matters. I just pull up the web interface, classify the messages properly and POPfile works it's magic.

Yahoo! Mail by sfe_software · 2002-11-03 07:48 · Score: 2

Noone has mentioned it so far, but Yahoo mail has a Bulk Mail folder. SPAM is automatically sent there, and I have yet to see a single false positive (and false negatives are quite rare as well).

The system works surprisingly well. I checked the FAQ and it doesn't go into any detail about how it works, but I wouldn't doubt if something like this is being used.

I've been thinking, and it seems that this could potentially have a lot of use, aside from Spam filtering. Perhaps a mail client could let you categorize email in general (SPAM, Business-related, forwarded stuff from AOL users, etc), and learn how to spot and organize things.

I'm putting this (either the POPfile or bogofilter) into place with a modified SquirrelMail, just to give it a good run; I might try and modify it to also categorize other types of email, just to see if something like that could work.

I could easily see a mail client (web-based or otherwise) that lets you drag mail to specific folders, and eventually learns how to do this for you (and of course you can always correct it by simply dragging to another folder, which also contributes to the learnig process)...

After reading this article my mind is just spinning with ideas... Bayesian search engines... perhaps speech/voice recognition applications... classifying text/html/doc files... organize songs (processing the lyrics)... ugh, I should stop now :)

--
NGWave - Fast Sound Editor for Windows

Re:Yahoo! Mail by q2k · 2002-11-03 13:06 · Score: 2

I think Yahoo is using Brightmail. Earthlink uses it too and I was quite impressed when I used Earthlink. I quickly got to a point where I never logged into the server to check the junk folder.

Bad idea by archeopterix · 2002-11-03 07:52 · Score: 1

Many mail order firms use prisoners to answer phones and take your orders. An intelligent ISP level spam filter could consist of sending any message that hits multiple subscribers (set some reasonable threshhold) to prison for evaluation. The prisoner's station would be a screen with one button that approves the message for delivery and one which deletes it.

This is a bad idea. Many people would commit crimes just to get to prison! Man, watching pr0n samples all day long, it's a dream job!

Re:Um. Yes by crisco · 2002-11-03 07:54 · Score: 2

Try it, you might be surprised.

You can separate the newsletters from the businesses you've opted in to from the penile-enlargement spam. Thats one of the beautiful things about POPFile, it isn't just about spam vs useful mail. In fact, it seems to be more accurate and learn faster when you define categories for all the different types of mail you recieve, not just spam vs inbox.

--

Bleh!

Re:New open source business-model? by Anonymous Coward · 2002-11-03 07:55 · Score: 0

Yet another business-model!

1: Write free software.
2: ?
3: Be a faggot.
4: Profit!

Bayes by John+Garvin · 2002-11-03 07:56 · Score: 5, Funny

Now we can tell spammers: "All your Bayes are belong to us."

SpamOracle by flockofseagulls · 2002-11-03 08:04 · Score: 1

I've been using SpamOracle with great success for a few weeks. Plays well with procmail. It's based on the Bayesian technique described by Paul Graham.

It's written in OCaml so getting it up and running takes a little work (though not much). Once it's installed the command-line learning interface is quite easy to use.

http://pauillac.inria.fr/~xleroy/software.html#s pa moracle

Mail.app by Arker · 2002-11-03 08:05 · Score: 2

The apple mail client, mentioned in the blurb, works very well with IMAP, that's what impressed me enough that I'm actually using it.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.

Re:Mail.app by Anonymous Coward · 2002-11-03 08:10 · Score: 0

It shouldn't impress you, .Mac is imap, it would be a horrible shame if their mail client didn't do imap well.

Multi-purpose tool by B'Trey · 2002-11-03 08:10 · Score: 3, Interesting

An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.

--

"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.

Re:Multi-purpose tool by Anonymous Coward · 2002-11-05 04:54 · Score: 0

POPFile does exactly that. It's not just for spam.

John.

Junkfilter is good enough by Anonymous Coward · 2002-11-03 08:12 · Score: 0

Wow, another overkill solution for a non-problem. junkfilter is good enough for me.

this battle cannot be won by mboedick · 2002-11-03 08:13 · Score: 4, Insightful

These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.

Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.

Re:this battle cannot be won by shayne321 · 2002-11-03 10:48 · Score: 4, Insightful

These technologies are interesting, but the problem of spam should be solved at the source.
And how do you propose we solve the problem at its source? Make it illegal? They'll just find loopholes in the law and/or move to a country where it is legal. Hunt them down and murder their wife and kids in front of them then hang them from a tree? Satisfying though it may be, last I checked murder was illegal.
Techniques like this CAN eventually solve the problem.. As others have pointed out, for someone to buy something from a spammer they have to READ the spam. If they send out 1 million spams and 500,000 read them and 20 of them buy something, they'll keep doing it. If they send out 1 million and only 500 people read it and 1 person buys something, they'll loose their source of income and have to find a new line of work.
Also, for each obstacle we put in their way (checksum databases, open relay databases, filters, etc) it costs them more time, effort and therefore, money to send their crap - all for less income.
Shayne

--
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
Re:this battle cannot be won by crucini · 2002-11-03 10:56 · Score: 3, Insightful

It's all very well to say that spam should be stopped at the source, but how do you plan to do that? Blocklists that pressure the ISP? SPEWS is pretty effective, but Verio, UUNet and Sprint are deeply committed to spam. They won't dislodge their pet spammers until they feel financial pain. Want the government to stop spam at the source? I see lots of problems with that. One of them is the creation of another eternal government responsibility like the war on drugs. They will forever need more funding for "the war on spam" because spammers are getting more clever. These federal agencies develop a symbiotic relationship with the "problems" they're trying to "solve".

In practice, a multipronged approach will work best, combining prosecution, litigation, blocklists, content-based filtering, complaints to upstream providers and education of new users. Graham's article, in fact, shows how attempts to avoid prosecution push spammers into the arms of content-based filtering.

I don't ask for a 100% solution to spam, because any such solution will have awful side effects.

Pedantry! by Tim+Browse · 2002-11-03 08:45 · Score: 3, Funny

that's not irony, it's sarcasm.

Actually, irony is generally considered to be "use of words to express something different from and often opposite to their literal meaning".

Sarcasm is often defined as a form of irony (but not necessarily), intended to be cutting/offensive etc.

So while his comment may have been sarcasm, it was also irony.

And I'm not pedantic, I'm pernickety. :-)

Tim

Re:Pedantry! by Spunk · 2002-11-03 15:04 · Score: 1

It is true that you are not pedantic; had you been, you would have spelled persnickety correctly.
Re:Pedantry! by Tim+Browse · 2002-11-03 15:21 · Score: 2

Ha, I was well aware of the alternative spellings - I just happened to choose one that wasn't your favourite :-)
Tim
Re:Pedantry! by Spunk · 2002-11-03 23:31 · Score: 1

Honestly, English.

Do you really need two different spellings of that word? Don't go blaming me when people have such a hard time using you properly.

CRM114? by Anonymous Coward · 2002-11-03 08:50 · Score: 0

I wonder why nobody's mentioned Bill Yerazunis' CRM114. It's even linked to from Paul Graham's article, and apparently "achieves 99.87% accuracy".

http://www.paulgraham.com/wsy.html
http://crm11 4.sourceforge.net/

(It's almost like folks around here just read the headlines, but don't ever bother to read the articles...)

What's the problem? by LS · 2002-11-03 09:20 · Score: 2

I don't understand why everyone has so much difficulty with spam. Ever since my yahoo mail got deluged, I abandoned it and set up another account. I only gave it out ONLY to my friends/family (about 60 people in my address book right now), and no one else. I keep another mail alias for online purchases and other sites where I MUST give a real mail address. If my alias address starts getting spam, then I will simply redirect it to it's own folder instead of my inbox, then start using a new one. But I'm very selective about whom I purchase from on the net (read: no porn).

I haven't received any spam in over a year.

Ellis

--
There is a fine line between being a cultivated citizen and being someone else's crop. - A. J. Patrick Liszkie

Bayesian Filter is a nice start but ... by Anonymous Coward · 2002-11-03 09:50 · Score: 0

Bayesian Filter | perl script > /etc/ipf.rules

Bayesian Filter | perl script > Cisco router config to route entire /24 to Null0 if spam is in any way associated with the address block

Bayesian Filter | perl script | mail to global spam corpus
global spam corpus | perl script > MAPS-RBL or similar scheme.

I own a regional ISP and one of my two BGP peers is a colocation firm. We're both Cisco shops and I handle their infrastructure as well as my network.

The colocation firm has recently taken on a 'bulk mailer' client and I'm worried - I've been writing route-maps that never should have seen the light of day to balance the traffic and in general doing a lot of futzing around for a low margin client that is eventually going to get a netblock banned. If the ban is just their /24 I guess its their problem, but the space they're in is part of an aggregate that is also used on my network.

Not sure where I'm going with the previous paragraph - but I think the idea is that if the spam problem becomes a financial problem for the ISPs that support it, it'll cease to be a business model.

Welcome to the future by disarray · 2002-11-03 09:55 · Score: 3, Informative

Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.

Welcome to the future: the mail client in Mac OS X 10.2 uses latent semantic analysis. (This isn't just marketingspeak--my mail folder includes "LSMMap"--LS as in "latent semantic".)

I was wrong by crucini · 2002-11-03 10:13 · Score: 2

I didn't read the POPFile link. Had I read it, I would have known that POPFile is a POP Proxy. Therefore it is a good candidate for conversion to a standalone executable. In other words, given the lack of standard email hooks on the Windows platform, POPFile cleverly avails itself of the one standard to which mail clients are pretty much forced to adhere - POP3.

However while the proxy itself can live as an .exe, integration with the mail client is still desirable if the user is to categorize mail and thus "teach" the system. I guess the alternative, for naive users, is to ship the proxy with a static table of probabilities which can be periodically updated like virus definitions.

Re:The decimal issue by Suppafly · 2002-11-03 10:39 · Score: 1

what does that have to do with bayesian mail filtering?

Re:Um. No. by Anonymous Coward · 2002-11-03 10:58 · Score: 0

It has a right to send you shit, you have a right to filter the shit. I, personally, don't really want to end up covered in shit, so I filter it.

Other applications... by Ed+Avis · 2002-11-03 11:29 · Score: 3, Funny

How long until we can set up Bayesian by-word filtering on Slashdot comments?

--
-- Ed Avis ed@membled.com

Re:Other applications... by WolfWithoutAClause · 2002-11-04 09:51 · Score: 2

What do you mean? Picking out keywords is how the moderation system actually works isn't it?

--
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"

Growing a spam filter -- a firsthand experience by devphil · 2002-11-03 11:30 · Score: 4, Interesting

So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:

Ignore the actual contents of the message. 34% of the time, it's spam.

And it's right.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)

Re:Growing a spam filter -- a firsthand experience by Anonymous Coward · 2002-11-03 14:11 · Score: 0

Someone please mod this clodd off as troll.

you are making my point.... by zogger · 2002-11-03 12:19 · Score: 1

....my point is made you are still reading the headers and the from addy. THAT is my point. I do the same thing, delete the spam, done. You just get it moved into another folder, I skip that step, it's an unnecessary middleman process. It makes no difference to me if I filter it into 19 other folders or not, you are still eyeballing them, a "glance" is still reading, unless you purposely skip any, then you'll never know if you missed something critical.

Stuff happens. Recent slashdot story about the missed email that was leading to the 60k job, granted, the isp blocked it, but still a missed email might be important. Maybe, maybe not. But from the technical viewpoint, it's like pregnant, you ARE or you AREN'T. A filter for email is not useful if you value your email unless it is no joke 100% effective, not 99.999%, because you still read the headers. If you are gonna do that, skip the extra program, delete all after extracting your gems.

Here's an easy analogy, this filter acts as a remote control to run the remote control on your tv. ya, nifty, but what's it good for? Skip the middleman.
I think this software is cute but unnecessary. I can also "glance" at my list of emails, pick out the verifiable ones, delete the rest, it takes no longer in one window then another. It's the same amount of time. previous poster commented it breaks train of thought. well, umm, I do my email in bulk, it's turned on, then off, I don't leave it running with odd beeps and flashes, rather not be bothered, but that is personal preference no right or wrong to it. The point of the deal is, you are still checking. Whether you do it now, later, makes no difference as long as it happens. The label of the folder makes no difference, the color of it, nada. If you are still reading it, it's not filtering except as cute busywork. If you can really trust it, then have it delete emails it considers as spam and be happy with it, but if you check it, it's not useful you are doing the same amount of work as before, just it's in new folders, ie, no difference.

I applaud the attempts, I can see they got it down to very few false positives, but people are still reading the headers at least using their real cognitive human intelligence as opposed to AI, because real intelligence actually works and AI is still guessing.

For my loot, if ya want to filter, you have an "allow only" list, as in "these addresses only, period, no exceptions" and everything else isn't allowed, has to be a from addy you entered manually, nothing else gets in. That will stop spam. Well, that and around a few thousand successful prosecutions of spammers including jail time and fines equal to triple of what they profited spamming. That gets around, most will cease, overseas, yes, it would be harder, but there are steps that could be taken to make those nations leaders deal with their own spammers. That's another topic entirely.

Re:you are making my point.... by WolfWithoutAClause · 2002-11-03 13:19 · Score: 2

No. I think you've missed atleast one point. Quite a lot of spammers send mail messages that contain URLs that are unique to you. If you open an email message that contains a URL like that- your browser opens the URL and that tells the spammer that the email is a good one, and they can sell it to other spammers for money- and then you're gonna get more spams.
With the filters you can go offline before checking through the list of suspected spams; that way the URLs don't resolve and the spammers don't know you are there; and you get less spam.

--
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"
Re:you are making my point.... by Blkdeath · 2002-11-03 14:55 · Score: 2

No. I think you've missed atleast one point. Quite a lot of spammers send mail messages that contain URLs that are unique to you. If you open an email message that contains a URL like that- your browser opens the URL ...
Whoa! What kind of browser and e-mail client are you using?!? I don't even think Outlook/IE are stoopid enough to automatically request URLs!
Don't tell me you're confusing human stupidity with intelligent SPAM filtering?

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.
Re:you are making my point.... by WolfWithoutAClause · 2002-11-03 16:01 · Score: 2

I'm currently using Netscape 7.0 which does this; but I've used Outlook previously and it is stupid enough to request GIFs from a web server unless you explicitly turn it off; that misfeature is on by default in both.
Whoa! Don't tell me you assume that GIFs always come with the email? Hey I've got this bridge, wanna buy one? ;-)

--
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"
Re:you are making my point.... by Blkdeath · 2002-11-03 16:32 · Score: 2

I'm currently using Netscape 7.0 which does this; but I've used Outlook previously and it is stupid enough to request GIFs from a web server unless you explicitly turn it off; that misfeature is on by default in both.
Ok; common-sense measures indicates you turn off remote images and plugins for your mail + news reading.
Weren't we talking about people intelligent enough to be proactive about their SPAM prevention measures?
Now then - what URLs are your e-mail client loading for you?

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.

Spamassasin by fireboy1919 · 2002-11-03 12:59 · Score: 3, Interesting

This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

What really interests me is that Spamassasin claims to use a genetic algorithm to rate how likely an e-mail is to be spam.

--
Mod me down and I will become more powerful than you can possibly imagine!

Privacy questions... by cerebrum · 2002-11-03 13:04 · Score: 1

One thing that I was thinking when ESR first posted his implementation of the Bayseian spam filter, I thought he should also include the "accept-word" file and "unacceptable-word" file.

Then, that brought up one really interesting point (at least to me). One could learn a lot about a person by having their "accept-word" and "unacceptable-word" file. Seems like they keep reasonably private type of information.

Did that hit anyone else?

I've already got an Outllook (VBA) version by Red+Herring · 2002-11-03 13:05 · Score: 1

I've gotten an Outlook version (using VBA) running, and it works vry well. I'm working on tuning some of the a priori probabilities, but right now I'm getting very good success... with much lower false negative than I've gotten on any other straight filter based method. (Meaning it very rarely classifies good email as spam.)

The key to making this work is having a very large corpus of both "good" and "bad" email with which to generate the word probability lists... I have ~2000 spams and ~10,000 good mails in the training corpuses. With 1000 messages, it still works well, but has occasional false positives and negatives.

--
#include "standard_disclaimer.h"

Re:The decimal issue by Anonymous Coward · 2002-11-03 14:38 · Score: 0

Base 10 is used as convenience. Not all areas of study should choose (rather arbitrarily) to use some-other base. What woule be to gain? using base 10.. and rarely using anything else keeps you from having to agree on the base.

Tell me .. what is the differnce between 3720 base 8 and 3720 base 10?

The difference is I dont have to tell you 3270 is MOST LIKELY base 10.

I'll concede on that... by zogger · 2002-11-03 14:53 · Score: 1

..ok, on that one point it makes sense. If you use text based mail though it won't matter, no urls open anyway. I personally don't use html or script enabled emails, that eliminates at least 90% of the problem there, that and deleting. I maybe get 6 spams a day now whereas years ago I got as many as anyone, hundreds sometimes. All I do is text based/delete spam, it seems to have worked admirably. I don't load remote images, etc. And I have more or less trained email senders to not send me bogus attachments or forwards with all the other recipients CC'ed, etc by the simple matter of informing them once and if it persists I stop reading their mail. It's a tough call, but I made it years ago and it's paid off, email is not unmanageable, I get hardly any spam and very little sent viruses, and those can't effect me anyway-at least as far as I know, no executables run from just text based email, but perhaps I am wrong on that, I honestly do not know.

All in all I'll sum it up. IF you trust this thing after a suitable "learning period" for it, AND you never waste your time thereafter checking the saved spam, yes, it IS an email spam filter. BUT, IF you open the spam folder to check it-ever, after the learning period- and read all the headers, you've defeated the purpose, at best it's a middle man gee- whizz placebo effect, as any perceived "time" savings is now illusory, and that's the whole point of a filter, yes? To save time wading through the spam and to avoid getting internet cooties sent to you? The emphasis should be on never getting it or getting on spam lists in the first place, filters are locking the barn door after the horse got out.

Just showing that not all tech is useful to all people, here's a prime example, for some folks it's apparently what they think they need, good for them, for others it's nice to read about it a little, but it's irrelevant as the problem got solved(more or less generally speaking now) long ago by numbers of people the old fashioned byt practical and effective high tech way of using human biochemical intelligence over primitive manufactured electronic only artificial intelligence.

To each their own, no one is correct or wrong in this per se, it's just a matter of taste and priorities. I see this as an overly complicated way to solve a simple problem-for some people, not all. Some folks have no choice, unfortunately they have gotten on so many lists that they are deluged with spam, it happens obviously. whoops. Others avoid it in the first place and regulate it on an as-needed basis. Two paths to the same destination,they are different from each other but it is the destination that is important, not the travel there. I could build a robot arm to open the fridge door, add blinkenneonlights, but I don't think I will at this juncture. See?

I'll leave it at that, have fun with it, hope it works for ya'all.

Spamvertised URL Tracker by herbierobinson · 2002-11-03 16:05 · Score: 2

The tool we really need to combat spam is a personal tracking database for spamvertised URLs. The idea would be to put every URL adversited by spam into the database and then send DAILY complaints to the level 1 ISP for the host until either

1. The URL no longer works.

2. The ISP responds with proof that the URL owner filed criminal complaints against the spammer.

I, for one, am thoroughly fed up with with the amount of time I have to waste dealing with spam. It's time to make it really painful for any ISP that tollerates it.

--
An engineer who ran for Congress. http://herbrobinson.us

Re:Spamvertised URL Tracker by pne · 2002-11-05 00:19 · Score: 2

Um, hello? How long would it take to set up a filter that would file all of your complaints straight to /dev/null? I fail to see how your solution would be very painful to any ISP with half a clue. I also fail to see how it can be effective.

--
Esli epei etot cumprenan, shris soa Sfaha.
Re:Spamvertised URL Tracker by herbierobinson · 2002-11-05 07:58 · Score: 2

Then I would have a record of the ISP habitually not responding to spam complaints. I am running a business. I am suffering 5-10% losses due to having to deal with spam. If I can prove an ISP is not responding to spam complaints, then I can sue them for damages in small claims court. The logic there is that the ISP's action (not responding to spam complaints) constitutes negligence and makes them directly liable for my damages.

Not to mention that once you have proof that they are not responding to SPAM complaints, you also have some great material for press releases. I especially like the idea of sending press releases about publiicly traded companies to the financial press. A press release about pink contracts with a pron spammer sent to local news media can also be quite effective -- especially if the pron was sent to children (I have lots of examples of that archived for future use). This kind of pressure got level3 to drop a huge pron outfit (or at least fix it so traceroutes from my machine to to thier don't go through level 3). In that case, I was also sending e-mail directly to some of their major investors... Like I said, I am being severely damaged by SPAM and I am just starting to fight back.

--
An engineer who ran for Congress. http://herbrobinson.us

Just switch to the Mac (troll?:) by Anonymous Coward · 2002-11-03 17:05 · Score: 0

Seriously, I don't know what algorithms the 10.2 mail client is using, but it's damn good and having a mail client that's really built for IMAP (with POP being more secondary) is awesome.

This was the least of their worries by edunbar93 · 2002-11-03 17:19 · Score: 2

Heh. People that annoyed Stalin were exiled to Siberia if they were lucky, and were too important to simply kill. :)

--
"No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert

Re:This was the least of their worries by Evil+Adrian · 2002-11-03 19:47 · Score: 1

People that annoyed Stalin were exiled to Siberia if they were lucky

Hehehe... like half of my family on my mother's side... almost makes me feel guilty for making the joke in the first place. :-)

--
evil adrian

Bayesian for Qmail by Anonymous Coward · 2002-11-03 17:37 · Score: 0

Coincidentally, I just implemented a bayesian filter for Qmail, which installed quite easily via the .qmail files.

The corpus lives in a BerkeleyDB database, and, so far "looks" ok -- we'll see how smart the filter becomes.

One thing I've noticed is that for the filter to perform well, I have to leave email in my box which I would normally read and delete, just so the filter can scan it and know that I *want* it, albeit just for a short time.

Here's the link: http://www.garyarnold.com/projects.php#bayespam

Several points come to mind by CySurflex · 2002-11-03 18:42 · Score: 2

I just spent over two hours reading through Paul Graham's web pages, POPFile's web pages, this slashdot thread and most of the other links provided as well. Some points come to mind:

1. Yahoo Mail has an interesting way of dealing with spam - you can "report as spam" any message that comes into your inbox. I suspect that they don't have a human reading these, but instead try to match multiple copies of the same e-mail being reported as spam by multiple people. When you have millions of users, if 10,000 report the same e-mail message as spam, it's probably spam. It would be interesting to have an open source program using P2P technology to do the same thing.

2. Like somebody mentioned above, this could be very useful in categorizing helpdesk e-mails, and even providing some canned automatic repsonses for them. E-mails with the words "forgot", "password" and "can't" and "login" would have a very high probability of being about a user can't logging in for some reason, and could be resolved by an automated "HOWTO" and save a company some man hours.

3. I'm going to try to integrate this into our exchange server at work tomorow if the IT guys will let me mess with it, and if not I'll try to integrate it into my (gasp) outlook exchange client.

spammers debugging the code right now by SystematicPsycho · 2002-11-03 19:07 · Score: 1

spammers are dl'ing and debugging the code as we speak to figure out loopholes. They've been tossing and turning over legal and technical loopholes ever since spamming^H^H^Hdirect marketing became popular.

--
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold

Donations? by Alari · 2002-11-03 19:16 · Score: 0

I get 10+ megs of spam a month on my oldest e-mail account, anyone need some samples? =)

Alari

--
I use Windows... like a two dollar wh.. why don't I just go ahead and not finish that sentence.

Re: OT: here's someone that provides what you're l by Anonymous Coward · 2002-11-03 19:52 · Score: 0

http://www.purifieddata.net

They don't offer an outlook plugin, but they do the site wide filtering, without even needing a box installed at the client location (though that's an option).
Really nice interface to whitelists/blacklists/virus scanner/spam actions/etc. Might be worth checking out.

Already patented by MicrosofT by barfy · 2002-11-03 20:29 · Score: 3, Informative

This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

patent 6,161,130

Re:Other applications...Prior art. by Anonymous Coward · 2002-11-03 21:22 · Score: 0

You got a funny rating. But I did suggest slashdot use something similiar for the moderators to use. Make their life easier, and maybe make the "point" system a lot fairer. I also suggested same for catagorizing incoming submissions for relevence and sorting purposes. Ah the joys of being an AC. :/

Bogofilter with IMAP integration by giggls · 2002-11-03 22:05 · Score: 1

I'm using Bogofilter (http://bogofilter.sf.net) and I would like to see an IMAP Server where this can be integrated. This way reclassification becomes a matter of moving to and from a junk Folder.

Spamassasin and Bayesian by tangent3 · 2002-11-03 23:41 · Score: 2

BTW, the current unstable version 2.50 of Spamassassin also utilises a Bayesian filter as one of the rulesets. Pretty cool.

Re:SpamOracle (There is even a debian package) by luther · 2002-11-04 00:56 · Score: 1

And i uploaded a debian package of it, it got accepted into the archive this morning, so it will probably be available starting tomorrow or even this afternoon.

Been posted before... by NNland · 2002-11-04 05:43 · Score: 2, Informative

I hate to mention this, but I will anyways.

Popfile was announced here in late August, shortly after the Paul Graham article came out. It was originally closed source, which prompted the creation of multiple other projects. Among them is Spambayes and even my own Pasp (both in python, both open source).

As well, Popfile was announced open source at the end of September...on Slashdot. I know this because it was released under such a license as I was finishing up Pasp.

So yeah. As for how well Popfile categorizes mail into multiple categories, I have not run many tests with multiple category bayesian filtering, though the Spambayes group has, and has discovered that filtering mail based on multiple categories is far less accurate (many false categorizations). In the minimal tests I have done, I find this to be the case as well (we are used to less than 2% FP and FN rates, and with >2 bin categorization, error rates spike easily into the 10% range).

So yeah. Popfile has been announced here no less than 3 times now. I've not seen Spambayes announced at all (they deserve it), and Pasp has also not been announced, though I could care less about that.

Re:Been posted before... by zonker · 2002-11-04 08:36 · Score: 0

yeah i discovered this after i posted it... however, regardless of the project, i think this is good for people to know about, at least so they know there are options. here's a few of the previous articles on the subject (in order of appearance):

The End Of The Paperclip
Slashback: Pop-Ups, Books, Qmail
Slashback: Google, Prince, Bayesian
More on Bayesian Spam Filtering
Working Bayesian Mail Filter (hehe recursion is fun)

so there ya go. with enough support these projects (or derivatives) will hopefully make it so that my grandmother doesn't have to read 20 porn emails on aol everyday. that is the ultimate goal of spam filtering right? =)

btw, i posted about popfile specifically because it was one of the first filters that i saw that didn't require me to be know how to compile source code or have a linux system etc... i knew there were others (in various forms of completion) out there, but this one worked for me without mucking around (which explains the title of the article).

--
Large print giveth, and the small print taketh away

Not quite Re:Bayes Explained by WolfWithoutAClause · 2002-11-04 09:58 · Score: 2

Kind of. The way to avoid the filter is to use words that are most commonly used in non spam email messages. The Bayesian classification doesn't actually recognize any kind of sales pitch per se.

So it's really only sensitive to phraseology.

--

-WolfWithoutAClause

"Gravity is only a theory, not a fact!"

Re:Not quite Re:Bayes Explained by B'Trey · 2002-11-06 05:48 · Score: 2

Actually, you not only have to use words that are commonly used in non-spam but you also have to avoid words that are commonly used in spam. So how do you convince someone to buy something without using words like "buy" "purchase" "special" "deal" "unique" etc? How do you convince people to come to your porn site without describing the pics you'll find there? Care to post an example spiel?

And if someone does find a way, you'll correct the filter and it will start to recognize the new format.

I don't claim that the Bayesian filter is in any way intelligent and actually recognizes the sales pitch. I do claim that sales pitches have certain characteristic word usages that can be identified by statistical analysis, and that the filter in effect recognizes those characteristics.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:Not quite Re:Bayes Explained by WolfWithoutAClause · 2002-11-06 07:14 · Score: 2

So how do you convince someone to buy something without using words like "buy" "purchase" "special" "deal" "unique" etc? How do you convince people to come to your porn site without describing the pics you'll find there? Care to post an example spiel?
I don't know you from Adam; you may be a spammer for all I know, or a spammer may read this. Either would be bad. So no, I don't care to.

--
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"

John Walker's Annonyance Filter by boustrophedon · 2002-11-04 15:04 · Score: 1

Annoyance Filter is another "paulgrahamian" mail filter written by John Walker, founder of Autodesk, co-author of AutoCAD, and creator of the Hacker's Diet*.

Annoyance filter has many tuning and reporting options. It can plot a histogram of junk words. In addition to scanning the message header and body, Annoyance Filter can pull text out of Flash, PDF, and other attachments.

It includes a 180-page PDF manual, mostly the source code presented in literate programming style. The TEX typesetting is beautiful, so turn to page 17 to see Paul Graham's LISP function presented in readable mathematics notation.

* Walker's Hacker's Diet has been discussed on Slashdot here, here, and here.

Aaaarrrggghhh!!! by upper · 2002-11-04 15:31 · Score: 2

The patent covers any method at all like Paul Grahm's method. He's discusses the patent here.

The patent claims boil down to using a probabilistic classifier to recognize spam. There are many claims, but they're mostly trivial elaborations. Probabilistic classifiers aren't new, and there's no claim they invented them. And it doesn't look like they had to solve any real technical hurdles to apply it. It's one of the most egregiously obvious patents I've seen in a while.

I say there's only one way to test whether an idea is obvious to people skilled in the field, and that's to pose the problem to people skilled in the field and see if they can find the solution. Anything less is a joke.

Not to diss Horvitz and Heckerman -- they're big names in Bayesian inference and Bayes nets. They've been behind a bunch of solid research.

bayesian modslapping by psamuels · 2002-11-04 17:52 · Score: 2

Great, next we'll see slashcode automoderating based on bayesian probability of a troll. Use leetspeak, go to -1, Offtopic.

Please someone tell me I didn't just give them the idea..

--
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README

[OT] Aluminum foil high? by dirtyhippie · 2002-11-05 01:20 · Score: 0, Offtopic

Did you know that if you chew on a piece of aluminum foil for couple of minutes, you'll get high?

No, are you serious? A claim like that needs to be backed up!

Re:[OT] Aluminum foil high? by wheany · 2002-11-05 05:30 · Score: 0, Offtopic

Pardon for stereotyping, but I thought someone with a nick like that would have known. I don't know how or why it works. It's easy to test, though. Just put a piece of aluminum foil in your mouth and chew for a few minutes. Try not to swallow little pieces of aluminum...

Spammers Counter Tactics by htaccess · 2002-11-06 16:52 · Score: 1

Having recently started collecting a spam and ham mbox to teach the baysian spam filter I am planning to install (havent decided which one to use yet). I was intrested to recieve a spam which appears to be using counter tactics, using html comments. Observe the wiley spammer: * Increase energy and cardiac output * Turn back your body's biological time clock 10-20 years in 6 months of usage !!! You are receiving this email as a subscriber to the Opt-In America Mailing List. To remove yourself from all related maillists, just reply with off. the contents of the comments are obviously inserted into high scoring spam words and contain random non spam words, clearly in this case catlover and dogsbark (2 strings inserted as comments) are not found in many spam wordlists, this accomplishes 2 things, it reduces the number of high scoring words and increases the number of low scoring words - pretty devious - obviously the spammers who live at genemarketmanager.com read slashdot. Looks like the arms race has begun! ---Arrrg - I cant seem to post the whole spam without triggering slashdots Lameness filters, reason too many junk characters, Ive posted the full message at: http://www.gamma.net.nz/spam.txt Note Ive changed the email address but the user is dns hence all the  tags

Last Post! by alpg · 2002-11-17 05:36 · Score: 1

Does biff in bo work
coz it biffin doesn't beep
an if biff in bo is broke
then biff in bo I will delete

I've tried biff in bo with 'y'
I've tried biff in bo with '-y'
no biffin output does it show
so poor wee biff is gonna go.
-- John Spence on debian-user

- this post brought to you by the Automated Last Post Generator...

312 comments