Working Bayesian Mail Filter
zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
Would anyone care to explain what is a "Bayesian" mail filter?
This is a clear and present threat to our society. Good thing the FBI acted quickly!
In theory, practice and theory are the same. In practice, they rarely are.
From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian.
Who cares? Whatever works best should be used, not the one with the coolest name or whitepaper, right?
Does it give hexadecimal output (like for messages blocked)? I hate decimal.
Saw this a few weeks back... Spam filter in Python using Naive Bayes.
Honest to god? or God? Just which god/God is it honest to? Capital or lowercase G?
And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.
Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".
Beware: In C++, your friends can see your privates!
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.
evil adrian
Sure it's great that someone made one, but its a perl script. We might be able to use perl , but most of the "normal" people have never even heard of perl, let alone them having knowledge of running perl scripts. It would be great if someone ported this, to an .exe file or something that everyone could run. It'll probably happen eventually.
Can someone explain why this filter would be useful to me?
"The lesson to be learned is not to take the comments on slashdot too literally." --Vinnie Falco, BearShare
What about the use of decimal in these sites? Can I filter out sites that use the decimal cancer? Geeks hate decimal.
This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.
-Stype
Bus error -- driver executed.
Pr(h|D) = Pr(D|h) * Pr(h)
where Pr is probabilty, h is the hypothesis and D is the data. In this case it would be
Pr("SPAM"|Email) = Pr(Email|"SPAM") * proportion of spam.
The trick is how to estimate the second term. This is a very popular machine learning algorithm due to its simplicity and elegance. For more info, check out this link Bayes
If you don't want spam then DONT USE AOL OR HOTMAIL!
Keep your email private and only give it to freinds and family. Set up a spamcop account to report any spam that does get through, and never 'remove' an email!
Ive never recieved a single spam in my blueyonder email account and rightly so.
Does anyone know of any spam solutions for IMAP? Everything I've seen out there is POP3, but goddammit I like my IMAP folders!!! (Not to mention that the server on which my e-mail resides gets backed up nightly...)
evil adrian
Perl can use hexadecimal. Is there decimal in the source? Then it is evil. Decimal is evil to geeks. Decimal is the Microsoft of radices.
Mozilla has an open bug to integrate Bayesian spam filtering into the next release of the software. Most of the work is done. They're just waiting on incorporation of a message filtering plugin architecture.
Bad "spam"-like messages are bad. Good spamlike messages are not bad. A good spam-like message I consciously opted in to receive is indistinguishable from a welcome business proposal or newsletter.
Does this system know what businesses I've given my credit card to? Because EVERY ONE of those businesses has a right to e-mail me, so long as there is a clear opt-out link at the bottom of their e-mail.
If I trust a company enough to give it my credit card number, and I like it enough to do business with it, IT HAS A RIGHT TO SEND ME E-MAIL TO INFORM ME OF ITS PRODUCTS, as long as I choose to let it. Good businesses won't abuse the privilege, and I won't end up clicking the opt-out link.
The only thing this system is good for is filtering SOME penile-enlargement shady fly-by-night header-spoofing, open-relay-using shady shamster.
Oh, but that's the ONLY thing that the article defines as SPAM:
Let's take a quick look inside the mind of someone who responds to a spam [sic]. This person is either astonishingly credulous or deeply in denial about their sexual interests. In either case, repulsive or idiotic as the spam seems to us, it is exciting to them.
So this is not spam-filtering software; rather, it's software to filter pornographic messages that fit a certain low-level sales pitch. Lovely.
Robert.
You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.
As long as you're not developing the idea, it shouldnt matter how it works as long as it works.
I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.
With 1 line of regex I eliminate 95% of my spam:
match and throw it out.
-- -- --
Help my mini cause: My journal
you don't seem to use your brain either asking such questions, why would it be useful to you anyway?
As I understand it, the Bayesian mail filtering system works by:
a) you receiving mail
b) designating where it should go
c) the filter tries to understand your reasoning
d) in the future, before step 1 occurs, the filter tries to interpret whether or not you want the mail based upon statistical analysis of what you have done in the past
Where as current mail filtering techniques work by culling your mail on exact specifications (it doesn't try to interpret. If it doesn't know, it does nothing).
I quite like the idea of my mail filtering software becoming intelligent over time, however I can see a potential for email traffic being lost using this method. The Bayesian mail filter is essentially as effective as a (hopefuly well trained) secretary. When you first get your secretary, she brings you everything. Then she starts culling the most obvious junk mail. Then she would start examining the normal letters... are they important? Relevant? Is this the person who should be dealing with it?
After time, you have your secretary very well trained, and she culls out everything which is not of immediate importance. In real life, this leads to the following problems:
a) you receive mail from an unknown source which could be important (some guy's discovered a new way to _________) but who isn't credible by your standards. His mail gets tossed aside, or redirected to someone else who probably doesn't care.
b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.
We've all heard stories of the first example, and it's not too hard to imagine the second. My worry is that, just like a good secretary, my mail filtering software will begin to filter for me. I will lose some control and, for the convenience of not having to hit the delete key a few extra times, I may miss potentially important email.
Chance is never a good thing to bring into your business.
What will make this thing work is if it is integrated with the e-mail client.
With this tool, you unfortunately have to manually add a message of a certain classification (work, pr0n, spam, family...) to the progrma through the perl script -- very awkward.
A tool like this need to run as a daemon and 'notice' when a message is added to a folder. Unfortunately, with different formats for e-mail folders, it's a much tougher job.
As it stands, with something like Outlook, I'd have to export each message individually, then run the Perl script. I can probably add a macro to do that (with its own pains -- you add a VBA macro to Outlook and it gripes every time you start up), and possibly even one that responds to filing in a folder.... hmm... maybe I will try this out.
Design for Use, not Construction!
I have five fingers on each hand, so I prefer decimal.
If I had four fingers on each hand, I'd prefer octal.
If I had one finger on each hand, I'd prefer binary, but I think I could manage without using my fingers
If we had eight fingers on each hand, we'd prefer hex, but then it wouldn't be hex, because we'd have used a different numerical system, that'd be base 16, but with 16 numbers instead of 10 numbers and 6 characters.
My conclusion: You're stupid, ignorant and not a geek.
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.
Carousel is a lie!
I think you have failed to understand how the filter works.
It is "trained" on a corpus of spam, which is compared to a corpus of known good messages. The important part is that YOU, the user, supply the spam corpus and the good messages. Thus in your case, as long as your "good spamlike messages" are in your "known good pile", similar new ones from the same source will not be tagged as spam. This is where the statistical approach shines over simple keyword matching.
Go on, read about how it works. You might learn something.
SquirrelMail is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor).
Well, if all spam is indistinguishable from the legitimate spamlike messages you want to see, then no filter will help you.
However, it seems more likely that a large proportion of spam is distinguishable from mail you want to see. It's quite plausible that you don't want to see messages about nympho sluts, or penis enlargement, or breast enlargement (or at least not all three), and that a naive Bayesian filter could easily distinguish these and other spams from mail you do want to see.
www.goatse.cx is a bad site!
don't go there!
have the ability to learn new things.
set softtabstop=4 shiftwidth=4 expandtab nocp worlddomination
Bogofilter has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better or not as good as spamassassin.
You can't see this if you have sigs turned off.
Please read the article. Classification of messages is done by you. If you are routinely receiving pitches that you both solicit and arrive unsolicited, it might have a hard time differentiating, sure, but keep in mind that spam filtering is just one form of classification that can be performed here.
If you choose to set up a spam classification, and routinely file penis enlargement ads, the system will quickly learn that e-mails with words common to penis enlargement ads are generally going to always be classified as spam, and will file it as such. Other pieces of e-mail that share content with "legitimate" ads may be misfiled in your "legitimate pitches" folder.
You can set this up however you want it. It learns by remembering the words in messages you manually classify, so you are not taking their definition of "spam". You are setting up a classification that you call "spam" and it's keeping track of the types of things you put in there. It will then apply that to future messages.
This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.
Stop-Prism.org: Opt Out of Surveillance
Finally, paying attention in those statistics & risk management courses pays off!
I write a simple script to recognize languages by their letter frequencies. [http://www.fuzzums.nl/talenknobbel/].
this methis isn't very strong, but with a fair amount of input the resulte get better. it even recognised the difference between dutch and a dutch dialect. the problem was that the alphabet only hat 26 characters, so i came up with the idea of using letter pairs.
when i read the article it was really funny. the methids he uses are almost the same as my method. and when i read about using word pairs: LOL.
this will be a very cool sam-filter. i love it already.
Privacy is terrorism.
This looks really good. Anyone out there know how/if it can be used with ximian evolution?
Linux. Because a 386 is a terrible thing to waste.
Give a hand, not a hand-out.
I just received the November edition of the TPJ which included a fine article "perlcc & Compiling Perl Script".
In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.
I get tired of copy and pasting spam emails into spamcop from the same ISP's. I use The Bat! quite a lot, any suggestions?
One of my pet peeves is the obsession that folks have with zeros. An example is the year 2000. In base 10 you get beaucoup zeros whereas with hex you get 7D0, or 11A6 (base 12), or 3720 (octal), or 11111010000 (binary). Zeros are an artifice of both the base, and numeral system used to represent a pure number. Thus, the fact that most humans use the decimal Indo-Arabic numeral system to represent it is the only reason for all those zeros. Use another base, or numeral system to represent 2000, you don't get beaucoup zeros.
The real properties of pure numbers are the relationships that they have with other numbers, and not the symbology used to represent them.
"Oh drat these computers, they're so naughty and so complex, I could pinch them." --Marvin the Martian
46 75 63 6B 20 6F 66 66 2E
An advertised false positive rate of 0% is nice, but why not additional research into the spam, to attempt to categorize into blatant spam, probable spam, borderline, and non-spam, and see if false positives can be plopped into the borderline categories.
Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.
Hey, I'm just your average shit and piss factory.
You're wrong, though. The whole point of this kind of filter is that it develops its rules based on the information that you give it, not what somebody else thinks. If you tell it that mails from your legitimate business partners aren't spam, it learns to tell them apart. I use a Bayesian filter on my mail, and it has no trouble telling my legitimate business mail, like messages from Amazon about books I've been waiting for, from illegitimate ones. Some of that is that the legitimate mail is written with a very different style from the illegitimate stuff, but I assume that the filter has also learned that mail with amazon.com as the sender is OK. In any case, I find that it just plain works.
There's no point in questioning authority if you aren't going to listen to the answers.
Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.
Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.
Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.
I am just about to put bogofilter in my mail filtering system. I am thinking about combining this baby with spamassassin, as described here:n g-bogof ilter-with-spam-assassin
http://www.randomhacks.net/2002/09/23/#usi
I will use the pass through option and I can use spamassassin to protect against false positives and to adjust the sensitivity.
BTW: Does anyone know if the number of SPAM and nonSPAM have to be about equivalent or is this accounted for? I have 4000 spam mails in a folder, but just about 500 nonspam mails.
Moritz
In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.
A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat, Pimmy, JBMail and PocoMail will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.
filter everything that only has a text/html attachment
I don't think that you understand how this form of filtering works.
1: You decide what content is spam and what content is not spam because you train the filter. One of the things that I disliked about SpamAssassin was its tendency to mark conference announcements as spam. I don't have this problem with my pseudo-baysian filter because it recognizes that mail about education tends to be good while mail about mortgages, pot and penis enlargement tends to be bad.
2: Perhaps more importantly the filter not only checks for trusted content, but trusted souces and routes. If honestcorp.com never sends you spam, then honestcorp.com becomes a trusted route for email.
Of course it looks like Outlook is Outnumbered here.
If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?
Does this system know what businesses I've given my credit card to?
Do you understand what a bayesian filter does? It tries to figure out what you consider spam. I don't like dentists sending me advertising junk; bogofilter trashes it. Anything about Esperanto or Project Gutenberg or Linux could probably fly on through, as it's got a lot of words that actually appear in my good email in it. At worst, a couple messages from that business get caught, and then it will recognize that the messages are good based on sender and embedded URL's.
In any case, there tends to be a huge difference between the messages I've got from companies I've given my credit card to and the ones that are sending me spam. Usually, one is quietly informing me of new items for sale, and one is screaming about crap. A bayesian filter can often tell the difference.
--I understand wanting to filter spam, but all these techniques near as I can see you still have to read all the spam to make sure it is A-learning to find spam, and B to make sure that you didn't filter out important mail by having your automatic filter filter it out. Umm, what's the point then? Isn't this an example of the department of redundancy department?
Many mail order firms use prisoners to answer phones and take your orders.
An intelligent ISP level spam filter could consist of sending any message that hits multiple subscribers (set some reasonable threshhold) to prison for evaluation.
The prisoner's station would be a screen with one button that approves the message for delivery and one which deletes it.
Um, kind of.
SPAM is generally undesired email, often with forged header information and without a caring person on the sending end.
For example, I was getting a ton of SPAM email promoting a major credit card (not Visa, AE, or MC). The email that was sent didn't have a real return address. In fact, the email said "don't reply - any messages sent to this email address will be deleted" !!!
The false return address made it impossible for me to have a two-way conversation with the sender. That's not customer service, and that's not even a friendly form of marketing. I don't understand why anyone would expect people to tolerate such marketing nonsense.
I know the organization I work for would NEVER resort to such unsavory tactics.
This kind of SPAM may or may not be illegal. But I don't care about it's legality - it's inappropriate behavior, and I refuse to tolerate it.
Why wouldn't spammers do something like this to circumvent the filter (i.e. simple image-based spam with text that doesn't raise any alarms):
/9j/4AAQSkZJRgABAgEBLAEsAAD/7QlMUGhvdG9zaG9wIDMuMA A4QklNA+0KUmVzb2x1dGlvbgAA
Content-Type: multipart/related;
type="multipart/alternative";
boundary="----=_NextPart...."
This is a multi-part message in MIME format.
------=_NextPart_....
Content-Type: multipart/alternative;
boundary="----=_NextPart_...."
------=_NextPart_....
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Hi
------=_NextPart_....
Content-Type: image/jpeg;
name="Spam goes here.jpg"
Content-Transfer-Encoding: base64
Content-ID:
etc...
I've been working on a spam tool in Lotus Notes (I know, but it's what we have to use where I work) that uses the same underlying methods. I've designed it to be outboard of the mail database, and it's "pure" Notes so should run on any supported platform.
I have it pretty much working now, and it is uncanny how well it sorts the spam from the rest of the stuff. Using even a very dumb tokenizer, the thing catches 95% or more of the spam, and so far the only false positives have been a result of miscategorized stuff in the input corpus -- i.e. I had filed something as spam that was not spam, and the filter started recognizing similar stuff as spam. That actually looks like one of the main possible failure modes for this approach.
Another of my concerns is that there are so many possible tweaks to these algorithms (mainly various ways of tuning the tokenizer, but also whether to focus on specific elements of messages, what to do with URLs, HTML comments, etc.) that could make a difference to the filter's performance.
I'm seeing a lot of interest from colleagues at work, and I'm starting to share it with them. If/when it feels mature enough, I may be able to get permission to release it to the outside world too. (Mine is a private, one-man project, but done on company time and with company resources, so they get first call on it.)
Hm... what about an anti-anti spam filter that mangles the message inserting random misspellings into the spam-identifying words? The bayesian filter would perceive this as a message consisting of many 'unclassified' words, just like a message in some unknown language. Sure, the short words probably haven't got many possible misspellings (cock, c0ck, coock, cokc - hm... starts to look undecipherable ), so they would probably get classified after some time. And this would hopefully lower the spam success ratio. But the possibility still remains...
This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).
Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.
Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?
Bleh!
Noone has mentioned it so far, but Yahoo mail has a Bulk Mail folder. SPAM is automatically sent there, and I have yet to see a single false positive (and false negatives are quite rare as well).
:)
The system works surprisingly well. I checked the FAQ and it doesn't go into any detail about how it works, but I wouldn't doubt if something like this is being used.
I've been thinking, and it seems that this could potentially have a lot of use, aside from Spam filtering. Perhaps a mail client could let you categorize email in general (SPAM, Business-related, forwarded stuff from AOL users, etc), and learn how to spot and organize things.
I'm putting this (either the POPfile or bogofilter) into place with a modified SquirrelMail, just to give it a good run; I might try and modify it to also categorize other types of email, just to see if something like that could work.
I could easily see a mail client (web-based or otherwise) that lets you drag mail to specific folders, and eventually learns how to do this for you (and of course you can always correct it by simply dragging to another folder, which also contributes to the learnig process)...
After reading this article my mind is just spinning with ideas... Bayesian search engines... perhaps speech/voice recognition applications... classifying text/html/doc files... organize songs (processing the lyrics)... ugh, I should stop now
NGWave - Fast Sound Editor for Windows
You can separate the newsletters from the businesses you've opted in to from the penile-enlargement spam. Thats one of the beautiful things about POPFile, it isn't just about spam vs useful mail. In fact, it seems to be more accurate and learn faster when you define categories for all the different types of mail you recieve, not just spam vs inbox.
Bleh!
Yet another business-model!
1: Write free software.
2: ?
3: Be a faggot.
4: Profit!
Now we can tell spammers: "All your Bayes are belong to us."
I've been using SpamOracle with great success for a few weeks. Plays well with procmail. It's based on the Bayesian technique described by Paul Graham.
s pa moracle
It's written in OCaml so getting it up and running takes a little work (though not much). Once it's installed the command-line learning interface is quite easy to use.
http://pauillac.inria.fr/~xleroy/software.html#
The apple mail client, mentioned in the blurb, works very well with IMAP, that's what impressed me enough that I'm actually using it.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Wow, another overkill solution for a non-problem. junkfilter is good enough for me.
These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.
Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.
Actually, irony is generally considered to be "use of words to express something different from and often opposite to their literal meaning".
Sarcasm is often defined as a form of irony (but not necessarily), intended to be cutting/offensive etc.
So while his comment may have been sarcasm, it was also irony.
And I'm not pedantic, I'm pernickety. :-)
Tim
I wonder why nobody's mentioned Bill Yerazunis' CRM114. It's even linked to from Paul Graham's article, and apparently "achieves 99.87% accuracy".
1 4.sourceforge.net/
http://www.paulgraham.com/wsy.html
http://crm1
(It's almost like folks around here just read the headlines, but don't ever bother to read the articles...)
I don't understand why everyone has so much difficulty with spam. Ever since my yahoo mail got deluged, I abandoned it and set up another account. I only gave it out ONLY to my friends/family (about 60 people in my address book right now), and no one else. I keep another mail alias for online purchases and other sites where I MUST give a real mail address. If my alias address starts getting spam, then I will simply redirect it to it's own folder instead of my inbox, then start using a new one. But I'm very selective about whom I purchase from on the net (read: no porn).
I haven't received any spam in over a year.
Ellis
There is a fine line between being a cultivated citizen and being someone else's crop. - A. J. Patrick Liszkie
Bayesian Filter | perl script >
Bayesian Filter | perl script > Cisco router config to route entire
Bayesian Filter | perl script | mail to global spam corpus
global spam corpus | perl script > MAPS-RBL or similar scheme.
I own a regional ISP and one of my two BGP peers is a colocation firm. We're both Cisco shops and I handle their infrastructure as well as my network.
The colocation firm has recently taken on a 'bulk mailer' client and I'm worried - I've been writing route-maps that never should have seen the light of day to balance the traffic and in general doing a lot of futzing around for a low margin client that is eventually going to get a netblock banned. If the ban is just their
Not sure where I'm going with the previous paragraph - but I think the idea is that if the spam problem becomes a financial problem for the ISPs that support it, it'll cease to be a business model.
Welcome to the future: the mail client in Mac OS X 10.2 uses latent semantic analysis. (This isn't just marketingspeak--my mail folder includes "LSMMap"--LS as in "latent semantic".)
I didn't read the POPFile link. Had I read it, I would have known that POPFile is a POP Proxy. Therefore it is a good candidate for conversion to a standalone executable. In other words, given the lack of standard email hooks on the Windows platform, POPFile cleverly avails itself of the one standard to which mail clients are pretty much forced to adhere - POP3.
.exe, integration with the mail client is still desirable if the user is to categorize mail and thus "teach" the system. I guess the alternative, for naive users, is to ship the proxy with a static table of probabilities which can be periodically updated like virus definitions.
However while the proxy itself can live as an
what does that have to do with bayesian mail filtering?
It has a right to send you shit, you have a right to filter the shit. I, personally, don't really want to end up covered in shit, so I filter it.
How long until we can set up Bayesian by-word filtering on Slashdot comments?
-- Ed Avis ed@membled.com
So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.
Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
And it's right.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
....my point is made you are still reading the headers and the from addy. THAT is my point. I do the same thing, delete the spam, done. You just get it moved into another folder, I skip that step, it's an unnecessary middleman process. It makes no difference to me if I filter it into 19 other folders or not, you are still eyeballing them, a "glance" is still reading, unless you purposely skip any, then you'll never know if you missed something critical.
Stuff happens. Recent slashdot story about the missed email that was leading to the 60k job, granted, the isp blocked it, but still a missed email might be important. Maybe, maybe not. But from the technical viewpoint, it's like pregnant, you ARE or you AREN'T. A filter for email is not useful if you value your email unless it is no joke 100% effective, not 99.999%, because you still read the headers. If you are gonna do that, skip the extra program, delete all after extracting your gems.
Here's an easy analogy, this filter acts as a remote control to run the remote control on your tv. ya, nifty, but what's it good for? Skip the middleman.
I think this software is cute but unnecessary. I can also "glance" at my list of emails, pick out the verifiable ones, delete the rest, it takes no longer in one window then another. It's the same amount of time. previous poster commented it breaks train of thought. well, umm, I do my email in bulk, it's turned on, then off, I don't leave it running with odd beeps and flashes, rather not be bothered, but that is personal preference no right or wrong to it. The point of the deal is, you are still checking. Whether you do it now, later, makes no difference as long as it happens. The label of the folder makes no difference, the color of it, nada. If you are still reading it, it's not filtering except as cute busywork. If you can really trust it, then have it delete emails it considers as spam and be happy with it, but if you check it, it's not useful you are doing the same amount of work as before, just it's in new folders, ie, no difference.
I applaud the attempts, I can see they got it down to very few false positives, but people are still reading the headers at least using their real cognitive human intelligence as opposed to AI, because real intelligence actually works and AI is still guessing.
For my loot, if ya want to filter, you have an "allow only" list, as in "these addresses only, period, no exceptions" and everything else isn't allowed, has to be a from addy you entered manually, nothing else gets in. That will stop spam. Well, that and around a few thousand successful prosecutions of spammers including jail time and fines equal to triple of what they profited spamming. That gets around, most will cease, overseas, yes, it would be harder, but there are steps that could be taken to make those nations leaders deal with their own spammers. That's another topic entirely.
This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.
(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).
What really interests me is that Spamassasin claims to use a genetic algorithm to rate how likely an e-mail is to be spam.
Mod me down and I will become more powerful than you can possibly imagine!
One thing that I was thinking when ESR first posted his implementation of the Bayseian spam filter, I thought he should also include the "accept-word" file and "unacceptable-word" file.
Then, that brought up one really interesting point (at least to me). One could learn a lot about a person by having their "accept-word" and "unacceptable-word" file. Seems like they keep reasonably private type of information.
Did that hit anyone else?
I've gotten an Outlook version (using VBA) running, and it works vry well. I'm working on tuning some of the a priori probabilities, but right now I'm getting very good success... with much lower false negative than I've gotten on any other straight filter based method. (Meaning it very rarely classifies good email as spam.)
The key to making this work is having a very large corpus of both "good" and "bad" email with which to generate the word probability lists... I have ~2000 spams and ~10,000 good mails in the training corpuses. With 1000 messages, it still works well, but has occasional false positives and negatives.
#include "standard_disclaimer.h"
Base 10 is used as convenience. Not all areas of study should choose (rather arbitrarily) to use some-other base. What woule be to gain? using base 10.. and rarely using anything else keeps you from having to agree on the base.
.. what is the differnce between 3720 base 8 and 3720 base 10?
Tell me
The difference is I dont have to tell you 3270 is MOST LIKELY base 10.
..ok, on that one point it makes sense. If you use text based mail though it won't matter, no urls open anyway. I personally don't use html or script enabled emails, that eliminates at least 90% of the problem there, that and deleting. I maybe get 6 spams a day now whereas years ago I got as many as anyone, hundreds sometimes. All I do is text based/delete spam, it seems to have worked admirably. I don't load remote images, etc. And I have more or less trained email senders to not send me bogus attachments or forwards with all the other recipients CC'ed, etc by the simple matter of informing them once and if it persists I stop reading their mail. It's a tough call, but I made it years ago and it's paid off, email is not unmanageable, I get hardly any spam and very little sent viruses, and those can't effect me anyway-at least as far as I know, no executables run from just text based email, but perhaps I am wrong on that, I honestly do not know.
All in all I'll sum it up. IF you trust this thing after a suitable "learning period" for it, AND you never waste your time thereafter checking the saved spam, yes, it IS an email spam filter. BUT, IF you open the spam folder to check it-ever, after the learning period- and read all the headers, you've defeated the purpose, at best it's a middle man gee- whizz placebo effect, as any perceived "time" savings is now illusory, and that's the whole point of a filter, yes? To save time wading through the spam and to avoid getting internet cooties sent to you? The emphasis should be on never getting it or getting on spam lists in the first place, filters are locking the barn door after the horse got out.
Just showing that not all tech is useful to all people, here's a prime example, for some folks it's apparently what they think they need, good for them, for others it's nice to read about it a little, but it's irrelevant as the problem got solved(more or less generally speaking now) long ago by numbers of people the old fashioned byt practical and effective high tech way of using human biochemical intelligence over primitive manufactured electronic only artificial intelligence.
To each their own, no one is correct or wrong in this per se, it's just a matter of taste and priorities. I see this as an overly complicated way to solve a simple problem-for some people, not all. Some folks have no choice, unfortunately they have gotten on so many lists that they are deluged with spam, it happens obviously. whoops. Others avoid it in the first place and regulate it on an as-needed basis. Two paths to the same destination,they are different from each other but it is the destination that is important, not the travel there. I could build a robot arm to open the fridge door, add blinkenneonlights, but I don't think I will at this juncture. See?
I'll leave it at that, have fun with it, hope it works for ya'all.
The tool we really need to combat spam is a personal tracking database for spamvertised URLs. The idea would be to put every URL adversited by spam into the database and then send DAILY complaints to the level 1 ISP for the host until either
1. The URL no longer works.
2. The ISP responds with proof that the URL owner filed criminal complaints against the spammer.
I, for one, am thoroughly fed up with with the amount of time I have to waste dealing with spam. It's time to make it really painful for any ISP that tollerates it.
An engineer who ran for Congress. http://herbrobinson.us
Seriously, I don't know what algorithms the 10.2 mail client is using, but it's damn good and having a mail client that's really built for IMAP (with POP being more secondary) is awesome.
Heh. People that annoyed Stalin were exiled to Siberia if they were lucky, and were too important to simply kill. :)
"No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert
Coincidentally, I just implemented a bayesian filter for Qmail, which installed quite easily via the .qmail files.
The corpus lives in a BerkeleyDB database, and, so far "looks" ok -- we'll see how smart the filter becomes.
One thing I've noticed is that for the filter to perform well, I have to leave email in my box which I would normally read and delete, just so the filter can scan it and know that I *want* it, albeit just for a short time.
Here's the link: http://www.garyarnold.com/projects.php#bayespam
1. Yahoo Mail has an interesting way of dealing with spam - you can "report as spam" any message that comes into your inbox. I suspect that they don't have a human reading these, but instead try to match multiple copies of the same e-mail being reported as spam by multiple people. When you have millions of users, if 10,000 report the same e-mail message as spam, it's probably spam. It would be interesting to have an open source program using P2P technology to do the same thing.
2. Like somebody mentioned above, this could be very useful in categorizing helpdesk e-mails, and even providing some canned automatic repsonses for them. E-mails with the words "forgot", "password" and "can't" and "login" would have a very high probability of being about a user can't logging in for some reason, and could be resolved by an automated "HOWTO" and save a company some man hours.
3. I'm going to try to integrate this into our exchange server at work tomorow if the IT guys will let me mess with it, and if not I'll try to integrate it into my (gasp) outlook exchange client.
spammers are dl'ing and debugging the code as we speak to figure out loopholes. They've been tossing and turning over legal and technical loopholes ever since spamming^H^H^Hdirect marketing became popular.
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
I get 10+ megs of spam a month on my oldest e-mail account, anyone need some samples? =)
Alari
I use Windows... like a two dollar wh.. why don't I just go ahead and not finish that sentence.
http://www.purifieddata.net
They don't offer an outlook plugin, but they do the site wide filtering, without even needing a box installed at the client location (though that's an option).
Really nice interface to whitelists/blacklists/virus scanner/spam actions/etc. Might be worth checking out.
This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...
patent 6,161,130
You got a funny rating. But I did suggest slashdot use something similiar for the moderators to use. Make their life easier, and maybe make the "point" system a lot fairer. I also suggested same for catagorizing incoming submissions for relevence and sorting purposes. Ah the joys of being an AC. :/
I'm using Bogofilter (http://bogofilter.sf.net) and I would like to see an IMAP Server where this can be integrated. This way reclassification becomes a matter of moving to and from a junk Folder.
BTW, the current unstable version 2.50 of Spamassassin also utilises a Bayesian filter as one of the rulesets. Pretty cool.
And i uploaded a debian package of it, it got accepted into the archive this morning, so it will probably be available starting tomorrow or even this afternoon.
I hate to mention this, but I will anyways.
Popfile was announced here in late August, shortly after the Paul Graham article came out. It was originally closed source, which prompted the creation of multiple other projects. Among them is Spambayes and even my own Pasp (both in python, both open source).
As well, Popfile was announced open source at the end of September...on Slashdot. I know this because it was released under such a license as I was finishing up Pasp.
So yeah. As for how well Popfile categorizes mail into multiple categories, I have not run many tests with multiple category bayesian filtering, though the Spambayes group has, and has discovered that filtering mail based on multiple categories is far less accurate (many false categorizations). In the minimal tests I have done, I find this to be the case as well (we are used to less than 2% FP and FN rates, and with >2 bin categorization, error rates spike easily into the 10% range).
So yeah. Popfile has been announced here no less than 3 times now. I've not seen Spambayes announced at all (they deserve it), and Pasp has also not been announced, though I could care less about that.
So it's really only sensitive to phraseology.
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"Annoyance filter has many tuning and reporting options. It can plot a histogram of junk words. In addition to scanning the message header and body, Annoyance Filter can pull text out of Flash, PDF, and other attachments.
It includes a 180-page PDF manual, mostly the source code presented in literate programming style. The TEX typesetting is beautiful, so turn to page 17 to see Paul Graham's LISP function presented in readable mathematics notation.
* Walker's Hacker's Diet has been discussed on Slashdot here, here, and here.
The patent claims boil down to using a probabilistic classifier to recognize spam. There are many claims, but they're mostly trivial elaborations. Probabilistic classifiers aren't new, and there's no claim they invented them. And it doesn't look like they had to solve any real technical hurdles to apply it. It's one of the most egregiously obvious patents I've seen in a while.
I say there's only one way to test whether an idea is obvious to people skilled in the field, and that's to pose the problem to people skilled in the field and see if they can find the solution. Anything less is a joke.
Not to diss Horvitz and Heckerman -- they're big names in Bayesian inference and Bayes nets. They've been behind a bunch of solid research.
Great, next we'll see slashcode automoderating based on bayesian probability of a troll. Use leetspeak, go to -1, Offtopic.
Please someone tell me I didn't just give them the idea..
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
Having recently started collecting a spam and ham mbox to teach the baysian spam filter I am planning to install (havent decided which one to use yet). I was intrested to recieve a spam which appears to be using counter tactics, using html comments. Observe the wiley spammer:
* Incre<!--dns-->ase ener<!---->gy and card<!---->iac output<br>
* Turn bac<!--dns-->k your body's biol<!---->ogical time cl<!---->ock 10-20 years<br>
in 6 months of usage !!!<br><br>
You are receiving this email as a subscr<!--catlover-->iber<br>
to the Opt<!--catlover-->-In Ameri<!---->ca Mailin<!---->g Lis<!---->t.
To remo<!--dogsbark-->ve your<!---->self from all related mailli<!--me-->sts,<br>just reply
with off.
the contents of the comments are obviously inserted into high scoring spam words and contain random non spam words, clearly in this case catlover and dogsbark (2 strings inserted as comments) are not found in many spam wordlists, this accomplishes 2 things, it reduces the number of high scoring words and increases the number of low scoring words - pretty devious - obviously the spammers who live at genemarketmanager.com read slashdot.
Looks like the arms race has begun!
---Arrrg - I cant seem to post the whole spam without triggering slashdots Lameness filters, reason too many junk characters, Ive posted the full message at:
http://www.gamma.net.nz/spam.txt
Note Ive changed the email address but the user is dns hence all the <!--dns--> tags
Does biff in bo work
coz it biffin doesn't beep
an if biff in bo is broke
then biff in bo I will delete
I've tried biff in bo with 'y'
I've tried biff in bo with '-y'
no biffin output does it show
so poor wee biff is gonna go.
-- John Spence on debian-user
- this post brought to you by the Automated Last Post Generator...