How Apple's Mail.app Junk Filter Works
fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"
Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...
bash: rtfm: command not found
Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.
Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.
The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.
Visit Jonesblog and say hello.
Now we understand why Apple is so good at doing full text searches and filesystem wide searches. I wish we had the same type of search functionality in Mozilla that Mail.app boasts of.
That is the one feature that Mozilla's mail client really could use.
--
According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.
Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.
Very cool...
but it's a whole lot better with junkmatcher central
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.
I have marked every single announcement and special offer i've ever received from Apple as junk, and yet the filter still refuses to classify them as such automatically.
I wonder if there's a loophole here that spammers could take advantage of: masquerade as Apple using the hole they've left in their filter. Spam Mac users to your heart's content. Bundle a Mac virus along with it for extra damage.
Please don't mod this down just because you like Macs. I like Macs too, but it really looks like there is a back door in the spam filter and I'm just reporting it - not mac bashing.
Actually from my understanding of it, its fairly different.
I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.
What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam
The advantage to this method I would suppose is to fold:
A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...
At least thats my understanding of it.
Kickin it Apple Old School.
Where would we be if Wheel had hid her round rock in a cave instead of showing everyone how it rolls?
Here's the problem I have with mail.app's spam filtering:
I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database. So the training winds up being sort of haphazard.
I suppose I should designate a particular machine to be the spam filtering IMAP client and have the rest of them not participate, but then I can't train on those subservient machines.
It'd be much better if multiple Mail.app IMAP clients could store their database on the server and share it.
Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.
Mail CHOKED on them. The early version of Mail chugged for 2 something hours and I gave up and killed it. The latest version was slightly better; 1000 messages or so still took well over 10 minutes. It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. My inbox is 820, and several mailing list boxes are well over 5,000 if I forget to clean them out. I have hundreds of MB of mail, and Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).
But that was just the importing- then it had to thread them or something, and THEN it had to index them all, both of which it did in the background, but still took forever.
Searching? Well, ok, it's "better" than Eudora in that it gives relevancy and Eudora is an on/off sorta deal, but that's fine- and I prefer 1 second for an exact search in a 2,000 message mailbox over 5-10 seconds for a fuzzy search.
Sorry, but Eudora, despite being a lumbering dinosaur technology-wise(MIME support is broken- PGP-MIME just doesn't work right; no address book integration is another thing that really irritates me), it is just plain hands-down the fastest mail client around.
The MBOX-with-index format also works exceedingly well, is portable (although some minor massaging with text-processing tools may be needed in some cases), and hard to corrupt- unlike almost every other mail client's DB (especially outlook). I've used Eudora for ten years, and never lost a single message except for one early beta version which munged a mailbox on me.
Please help metamoderate.
The big problem I see in spamland today isn't the classification technology. It's the word recognition problem. Sure, "VIAGRA" may be deeply embedded in a "spam" cluster, but what about "V1_4G ra"? If spammers weren't disguising their words, I think that Bayesian filtering and other techniques work fine. I'm not really sure that more advanced techniques in word classification are really needed here.
I had emails out to every link in the chain, but no one knew what was going on.
In Apple Mail, I had my 'reply to' names set to my emai addys - I changed it to short descriptive names and now they're not bouncing anymore. (odd error, so I thought I'd post it)
Why this started all of a sudden, and why no host or ISP had heard of this before. I don't know.
I do know that being on a blacklist and attempting to get off of it is nigh impossible, so I'd be all over Apple making spam filtering software so overzealous wizards of blacklists can be kicked to the curb. (Why is this in use anywhere..?)
You mean these "Vectors" (sounds foreign) are watching everything in my email?!!
Well, if that isn't a gross invasion of privacy then my name's not Liz Figueroa.
I'm drafting a letter to the Senate immediately... on a typewriter.
In short: a vector is the result of a calculation based on the number of times a term is used in a document and the terms in the other documents it is being compared with (the document set).
The angle between document (email) vectors is a representation of their likeness. For example if the angle is very small the documents have a lot in common.
This is how the mail app works. It compares known junk emails (ie the query) to the incoming document set (new emails)
There are a number of weighting schemes, for example Term Frequency Weights (TF Weights) or Term Frequency Inverse Document Frequency (TF-IDF Weights).
There are a few particiularly relevant laws to Information Retrieval. Heaps Law (the larger a document gets the less new words are added to it).
http://planetmath.org/encyclopedia/HeapsLaw.html
Zipfs Law: More relevant to document weighting schemes. It states that frequently used words are less relevant. For example stop words such as "a, the, it, and, is" all carry little meaning and are used frequently.
http://planetmath.org/encyclopedia/ZipfsLaw.html
Less frequently used words in a document are better at describing its content. For example " pixel intensity mathematical concepts".
-- Agent
The sender would just receive a message from the mail server saying that their mail was marked as spam
Sadly, if it is spam, then you'll be punishing thousands of innocent people whose email addresses have been forged by the spammers, by sending them the bounce messages. Very little actual spam gets past my bayesian filters, but I do get a lot of bounces from other people's spam filters for messages and virusses that I never sent.