Sorting the Spam from the Ham
MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond."
Math buffs might enjoy reading
these pages
or browsing
this writeup
and its many links.
But without spam, I wouldn't get any email!
I get to do something to stop my boss from enlarging his penis anymore... It's really starting to hurt.
I went to battle MC Escher but drew a blank
What happens if Slashdot runs a Bayesian filter which runs a day after the stories are posted and programs itself with all the -1 comments as "Spam" and all the +5 comments as "Ham". Then let the Bayesian filter adjust all incoming messages by up to 2 points.
I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.
is a scalable popfile for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock. I've heard spamasassin does Bayesian, but I couldnt see how it was trainable (and I dont want other people on my server to read each others mail, obviously).
And if you could, would you really want to?
bloodninja: Baby, I been havin a tough night so treat me nice aight?
BritneySpears14: Aight.
bloodninja: Slip out of those pants baby, yeah.
BritneySpears14: I slip out of my pants, just for you, bloodninja.
bloodninja: Oh yeah, aight. Aight, I put on my robe and wizard hat.
BritneySpears14: Oh, I like to play dress up.
bloodninja: Me too baby.
BritneySpears14: I kiss you softly on your chest.
bloodninja: I cast Lvl 3 Eroticism. You turn into a real beautiful woman.
BritneySpears14: Hey...
bloodninja: I meditate to regain my mana, before casting Lvl 8 Penis of the Infinite.
BritneySpears14: Funny I still don't see it.
bloodninja: I spend my mana reserves to cast Mighty of the Beyondness.
BritneySpears14: You are the worst cyber partner ever. This is ridiculous.
bloodninja: Don't shit with me biznitch, I'm the mightiest sorcerer of the lands.
bloodninja: I steal yo soul and cast Lightning Lvl 1, 000, 000 Your body explodes into a fine bloody mist, because you are only a Lvl 2 Druid.
BritneySpears14: Don't ever message me again you piece.
bloodninja: Robots are trying to drill my brain but my lightning shield inflicts DOA attack, leaving the robots as flaming piles of metal.
bloodninja: King Arthur congratulates me for destroying Dr. Robotnik's evil army of Robot Socialist Republics. The cold war ends. Reagan steals my accomplishments and makes like it was cause of him.
bloodninja: You still there baby? I think it's getting hard now.
bloodninja: Baby?
--
bloodninja: Ok baby, we got to hurry, I don't know how long I can keep it ready for you.
j_gurli3: thats ok. ok i'm a japanese schoolgirl, what r u.
bloodninja: A Rhinocerus. Well, hung like one, thats for sure.
j_gurli3: haha, ok lets go.
j_gurli3: i put my hand through ur hair, and kiss u on the neck.
bloodninja: I stomp the ground, and snort, to alert you that you are in my breeding territory.
j_gurli3: haha, ok, u know that turns me on.
j_gurli3: i start unbuttoning ur shirt.
bloodninja: Rhinoceruses don't wear shirts.
j_gurli3: No, ur not really a Rhinocerus silly, it's just part of the game.
bloodninja: Rhinoceruses don't play games. They fucking charge your ass.
j_gurli3: stop, cmon be serious.
bloodninja: It doesn't get any more serious than a Rhinocerus about to charge your ass.
bloodninja: I stomp my feet, the dust stirs around my tough skinned feet.
j_gurli3: thats it.
bloodninja: Nostrils flaring, I lower my head. My horn, like some phallic symbol of my potent virility, is the last thing you see as skulls collide and mine remains the victor. You are now a bloody red ragdoll suspended in the air on my mighty horn.
bloodninja: Fuck am I hard now.
--
BritneySpears14: Ok, are you ready?
eminemBNJA: Aight, yeah I'm ready.
BritneySpears14: I like your music Em... Tee hee.
eminemBNJA: huh huh, yeah, I make it for the ladies.
BritneySpears14: Mmm, we like it a lot. Let me show you.
BritneySpears14: I take off your pants, slowly, and massage your muscular physique.
eminemBNJA: Oh I like that Baby. I put on my robe and wizard hat.
BritneySpears14: What the fuck, I told you not to message me again.
eminemBNJA:
BritneySpears14: I swear if you do it one more time I'm gonna report your ISP and say you were sending me kiddie porn you fuck up.
eminemBNJA: Oh
eminemBNJA: damn I gotta write down your names or something
"Study your math, kids. Key to the universe." -The Archangel Gabriel
I've now lost one of my primary arguments for switching my colleagues to Mozilla!
Trouble making decisions? Just flip for it.
My own personal account is on a shared server at pair.com, and I run SpamAssassin (the perl script, can't put the spamc/d on there since I'm not root).
. zip
I have written on here before how I have saved myself a lot of hassle over the last few months by installing SA. I now stop 100+ messages a day (usually more like 140 now).
My stats tell me that since Feb, I've stopped over 15K Spam messages. Hot damn.
Where I currently work now we have Exchange and I wanted SpamAssassin on there, but we weren't getting the money approved to put it on.
So I hacked in SpamAssassin via an Exchange 2000/2003 EventSink.
If you want the code for it, feel free to grab it from http://www.cardboardutopia.com/ExchangeSpamFilter
But do note that if you have many users on your machine, you aren't going to want to use this - an EventSink on Exchange runs in serial, so SpamAssassain's Perl script (the spamc/d doesn't work under Win32) will get executed on every incoming mail, and it will have to wait until it is done before it gets the next one.
We process about 2000-5000 incoming messages a day and it does okay, but we have a very light load.
There are some odd things afoot now, in the Villa Straylight.
I use Spambayes with Outlook 2000, and it takes a little tweaking, but it works as advertised. Ahhh, the magic of mathematics. Just now, brought up Outlook, checked my mail and three little messages offering a free Sony headset, 70% off cell accessories, and a chance to take an IQ test just got tossed into my spam folder. Thanks anyway, but I think that means I just passed my IQ test.
Every so often I go in and take out some old, old spam, just to make sure my current preferences are being represented and that's all the maintenance that's required.
This is, however, the second time I've trained the filter. The first time, it incorrectly identified my FreeBSD status mails as spam, and from then on was throwing those into the Spam folder. My own fault, though, since I hadn't included any of these messages in my representative ham.
If you run Outlook, download this filter and use it. You'll be doing yourself, and a world that doesn't need fat-injected, herbally enhanced penises, a favor.
Chr0m0Dr0m!C
I don't know if I'd want it in Python, though... it does seem to be a good deal slower already than other spam filtering methods without putting it in a scripting language. Getting it in Outlook can only be good for the net (can Bayesian be applied to things like spam from Internet virii as well?)
I've been using spambayes for months now and it really is quite amazing. Now, when I get the occasionaly spam in my mailbox it's actually interesting because I want to figure out why it made it in. The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made. It's made reading email a much more enjoyable activity.
-Adam
The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.
.03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.
Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].
When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.
When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than
So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.
One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.
But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.
Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.
Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.
Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.
I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live hum
Would you use the phone if you had to listen to a 10-second brothel advertisement every time you made a call?
Yes.
Definately Yes.
Is that a feature I can have added?
http://use.perl.org
Eudora 6.0 beta has spam filtering which seems to be Bayesian. It's a little slower to learn than PopFile, but it's pretty good so far, and of course integrated with the Eudora UI.
http://eudora.com/betas
I sat on the E-Mail policy team (a branch of the Strategic Planning team) for Miami University (Oxford, OH, not Florida) this last year (as a technical advisor, student and support desk employee. We looked at all sorts of spam solutions, as the president decided this should be a main focus (apparently the Viagra adds hit a bit too close to home for comfort ;)).
The problem in the educational market, though, is that, not being a business that can make rules and force people to live by them, educational establishments have annoyed customers (students and faculty) sometimes if any spam is blocked. (research, etc) False positives absolutely can't be tolerated. So a ranked system (spam assasian) that suggests the possibility of spam is not on the best but the only solution we have avalible. Mail will be ranked and users can make rules that trash everything but a guarenteed perfect mail, if they so desired. Or they can leave them all alone. So intelligent filtering is a necessity, not just a bennefit.
On another page, I had an odd place during this discussion of the team. I do not receive spam. (Please, don't start now). My MUOhio.edu address simply doesn't get a single piece of spam e-mail. I have had the account for two years. I have over 3000 messages in various folders. And none are spam at all. I just haven't signed up for anything with it. I put the e-mail addy on webpages too (that I author) and haven't gotten a single thing. But oh my the trash "spam" account gets 60 a day. On AOL. That blocks 80% of incoming mail. Ironically, they had MUOhio.edu blocked weeks back.
I haven't posted in so long, my sig is out of date.
In case anyone hasn't tried it yet, the Bayesian filters in the mail client of the Mozilla suite are really impressive. They have worked close to flawless for myself.
I wrote an article on how to set up SpamProbe on a server, and make it easy to train. You could also use Bogofilter or any other trainable spam filter, set up the same way.
I get at least 100 spam messages a day now, and I only see about a half-dozen or so. SpamProbe deals with the rest, and I don't have any problems with false positives. (SpamAssassin thinks that ads for LinuxWorld Expo are spam, but as I have it trained, SpamProbe doesn't.)
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
I use PopFile. What I like about it is that it easily lets me use multiple personalities in Eudora, Outlook or any other mail client. Nice web based interface and a very active development community.
You can run it locally on Windows or Linux. But, you can also set it up on a server and then use it to filter e-mail from multiple client machines. That's what I like about it. I have a home machine in my basement office but also upstairs in the TV room. Unlike plug-ins that only work locally, I can have my reclassification decisions apply to multiple client machines.
Right now, they do not have multiple user capabilities so that my wife and I can both use the same instance and not have our classifications interfere with each other. However, you can set up multiple instances bound to different ports. The developers list multi-user capability as a priority.
Worth checking out along with the other choices.
For an article in an "IT tech" section of a paper, this is really very weak.
It really doesn't do much more than precis Paul Graham's arguments, then ends in a blatant plug for just one Outlook addon.
I suppose if there are still people in the column's audience who haven't heard this all before, and it gets the message out that spam can be effectively filtered, it's a minor goodness.
I've noticed that the spam that has been getting through my Mozilla filter are the ones with innocuous sounding subjects and an embedded image.
Could this be the future of spam?
Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)
Having used the spam filtering built in to Mozilla for the last six months, I can testify to its effectiveness. In very little time at all, I'd trained it to send 95% of the filth to the spam directory and avoid doing the same for 95% of good mails. For me, not having to run a "middle man" piece of software was a real boon.
However, my life isn't totally spam free, as I find that I become neurotic about those 5% false positives that get unhelpfully moved to the spam directory, so still end up having to sift through the grot every once in a while. On the plus side, I now have a solution to my tiny cock problem, I've arranged cheaper home insurance and I have the email address of several horny co-eds who I'm assured are hungry for man juice.
http://www.davetansley.com - you proba
As I wrote only late last night, using Bayesian classification with only two categories (spam and "non-spam") is somewhat short-sighted, since if properly trained, a Bayes classifier can do a much better job than ordinary mail filtering (procmail, Mozilla or Mail.app filters, you name it).
In fact, if I had to bet on the next "killer apps", mail sorting and RSS filtering based on Bayesian classification would be right at the top of my list, based solely on the actual time-saving benefits for users. And I can't see any reason for Bayesian filtering not being included in Mozilla Mail and Apple's own (revamped) Mail.app.
I have to use Outlook at work, and after setting up Outclass (which requires POPfile) with several "buckets" to classify my corporate e-mail by project and field, I'm definetly not going back. Outlook, even with extensive use of Rules Wizard and categories, simply cannot cope with the diverse kinds of project-related e-mail I swap with colleagues, and Outclass is the only thing I could find that could deal with Exchange, PST folders and multiple Bayesian "buckets" categories.
Come on, do the right thing and tell Apple and The Mozilla Project that you want configurable Bayesian filtering on their mail clients.
I hate spam just as much as the next person, but I must admit, without it I wouldn't be the horse-sized love stud that I am. Thanks spam.
I have been using the Mozilla junk mail filter for a couple of months now. One pop mail account is one that I started using in 1996. It is a spam magnet. In the time I have been using Mozilla, it has accumulated over 12,000 spam messages. That should be plenty of training for the Bayesian filter.
Mozilla's filter does a reasonably good job at catching spam, but I still get a handful of messages every day that slip through the filter. The ones that slip through seem to be messages that have intentionally munged the spammy words with spaces, numbers, and misspellings. The spammers know that people are filtering, and they are successfully getting through the filter with their dirty tricks. Another trick spammers use is to send a message with nothing but a graphic ad. The filter doesn't have enough words to judge the the spam, so the message slips through.
I also had some 'ham' messages get filtered, so I still have the annoyance of having to check the 'junk' folder periodically for wanted messages. The filtering makes life easier, but it is still not an ideal solution to the spam problem.
Much of the spam that gets past it is so minimalist it cannot be blocked by a Bayesian filter. I get messages like this:
It's like someone is trying to put so little in the message, that there is nothing to filter. If only they would use the stock "We are sending you this because you opted-in on it. Click on this link to remove your address." If they used that, I'll never see the message; SpamProbe will grab it. But how could I train SpamProbe to detect the minimalist ones, without blocking everything forever?
So far I don't get too many of the minimalist ones, and I just hit delete. If it becomes widespread, I'll have to start using Vipul's Razor or something.
The other kinds of spam that get past SpamProbe are the ones that have rampant misspellings. Since none of the words are in the database, they don't match as spam terms:
I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Suppose
...)
1. I have a friend who uses the same kinds of words as I do and who uses Outlook (ok, an aquaintance, because friends don't let friends
2. An email virus attacks this person, snarfs up his Ham, runs a Bayesian filter on it and comes up with Spam specifically tailored for this person's aquaintances.
There's a science fiction book waiting to happen in here somewhere. If so, I own the SCOpyright on it.
about this kind of filtering is that it has to download the email content - not always as good idea, especialy in a Windows environment. Besides, I can identify spam just by looking at message header information. Sender, recipient, and subject line are nearly always enough. Plus I don't need to waste time, bandwidth, or get subjected to offensive graphics, or risk 1-pixel confirmations or getting hacked by the latest security issue. My homespun message header analysis program drops nearly all spam, and results in few legit email rejections. I score the headers based on missing recipient, sender info, keywords in subject, string match in sender email or name, punctuation count in subject line, number of contiuous spaces in subject line, plus a few other things that seem to run common in the spam I get. I can also permit certain email addresses to pass no matter the score. It's not fancy, but it works, and I never have to waste time drawing the whole content down to my local machine. What I do may not work for everyone, but it seems that in most cases it should, unless you get a lot of email from unknown (non-spam) sources - not typical for the average email user.
this is like inventing something as useful as the Knife, and using it only to attack salesmen. Why bother stopping with spam? Why not apply this filter to, say, absolutely everything? Since I just said "absolutely everything", I wont bother giving examples.
Training something to know how likely something is to be true, that sounds too useful to waste any time with on spam at all.
-- 'The' Lord and Master Bitman On High, Master Of All
Mark Hammond then wrote the Outlook plugin, which, admittedly, is considerably more code than SpamBayes, but not SpamBayes itself.
For the complete background on why SpamBayes is so good at what it does, and it's history, see:
- SpamBayes Background
Marc's is not the only application frontend for SpamBayes, here is a list of others:- SpamBayes Applications
No apologies for this my pedantry offered."The truth shall make ye fret" -- The Truth, Terry Pratchett
At work I have Outlook always running with the excellent bayesian FREE filter Spammunition www.upserve.com. I also do check the mailbox from home over a dial-up connection.
If I wouldn't use Spammunition, then I would spend a lot of time downloading spam messages; as it is right now, I get just the ham (several messages instead of many).
Serban
I use spamnet by cloudmark. It catches everything. I can't remember the last time I had to click the "block" button. I'm very conscious of where my email ends up and I'm a hardcore advocate of email aliases. As a result, since September (last major crash), spamnet has blocked 4000 pieces while I've actively blocked only 11.
That's pretty f'n good in my book. So good, in fact, that I send all blocked messages to the "Delete" folder instead of the default "spam" folder and set outlook to permanently delete on close.
I have two concerns about this program:
--Money. They are now charging and pretty much deserve it from the average user.
--Reliability. This company could disappear tomorrow and sell off the server that has compiled spam data.
Since mathematics isn't going anywhere, I'm leaning towards switching to an open source Bayesian alternative but, as mentioned above, all my spam gets thrown out the door on contact.
What is the approximate training time of a Bayesian filter?
Laws are for people with no friends.
Bayes rocks, been using it with spamassassin and it kills 99% of my spam. The problem is when some asshole spammer uses my email address in the 'From' header of his spam ... then I get scores of 'user not found' or 'virus detected' emails from legitimate mail servers ... it's not spam, but it's just as annoying. How do you guys deal with this problem?
(Score:-1, Wrong)
I think Tom Mitchell did a good job in explaining the math in his book Machine Learning. It's a very pricy book, so maybe you can look for a used copy.
Turn off html mail for Outlook and help keep them from validating your address through this method.
.reg files of their own and be able to quickly switch between viewing html and plain text mail. taah dahhh!
. 0\ Outlook\Options\Mail]0 1
. 0\ Outlook\Options\Mail]0 0
Place these two keys in
[HKEY_CURRENT_USER\Software\Microsoft\Office\10
"ReadAsPlain"=dword:000000
OR to turn it back on and view those pretty pictures
[HKEY_CURRENT_USER\Software\Microsoft\Office\10
"ReadAsPlain"=dword:000000
A Bayesian filter that reads personal ads, compares them to ads posted by women who are KNOWN to have been "easy" (on a sliding scale, configurable, ranging from "mildly slutty" to "dangerously psychotic nymphomaniac"), and returns a list of likely phone numbers.
Hell, I'd pay MONEY for a piece of software THAT good (Hmm, clickety-click, select "nymphomanic", enter search site... Ah! This one has an oral fixation! Thank you, Mr. Bayes!).
Farewell! It's been a fine buncha years!
You guys are a bunch of hypocrites. You don't really want spam to stop. You love spam.
Every spam thread is the same: I use X, and it blocks 98% of my spam, with no false positives! I use Y, and it blocks 99.9% -- take that! Here, I use Z + Y with these custom Perl scripts I wrote that interface with procmail and stop 101% percent of spam! It doesn't matter, because I never get ANY spam! Spam is only because people buy things in spam! What morons! Bow before me, for I am 1337!
Spam gives you something to fight. Spam gives you an excuse to solve an interesting technical problem (i.e. separating spam from ham). Spam gives you a reason to boast. Spam gives you people to dislike.
Admit it.
You love spam.
1. Use Debian /etc/inetd.conf: pop3 stream tcp nowait nobody /usr/sbin/tcpd /usr/bin/hotwayd
2. apt-get install spamassassin
3. apt-get install hotway
4. Add this to your
5. Switch to Kmail
6. Menu: Settings|Configure Filters
7. Add first filter.
a. Select Match Any of the following
b. Select size 250000
c. Filter action: PIPE THROUGH spamassassin
8. Add second filter
a. Select 'Match any of the following'
b. Type 'X-Spam-Flag' (no quotes)
c. Select equals. Type 'YES'
d. Filter action: Move to folder [your spam folder]
9. It's crucial thta the second filter happes after the first (use the arrows to the left).
There you have it - a spam-free Hotmail account. Not quite setup.exe, but this is Linux after all.
I DON'T LIKE SPAM! I DON'T LIKE SPAM! I DON'T LIKE SPAM!
I've just been migrated to Notes from Outlook. Not a happy bunny till I discovered how powerful it is with stuff like agents.
The only thing I'm missing now is a spam classification tool like popfile for notes.
Government of the people, by corporate executives, for corporate profits.
As for your indictment of spam filtering providers, could you please explain where the spamassassin devteam is making money?
My choices with regards to spam at the moment are simple. Use spamassassin or something like it, or wade through spam myself. I know which I'd prefer.
Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)