How Apple's Mail.app Junk Filter Works
fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"
and no, it doesn't use white magic...
Black, then?
Or is that reserved exclusively for Microsoft?
The coolest voice ever.
Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...
bash: rtfm: command not found
Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.
Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.
The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.
Visit Jonesblog and say hello.
it's simple. it uses it's extremely uninsipired app name to scare away spam.
The "Insert Quote Here" line is almost as predictable as inserting an actual quote.
The article mentions...
"In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."
Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.
Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.
I believe I remember reading somewhere that the same sort of vector/clustering calculations are used in face recognition software?
Just goes to show how solid math/calculations can have some useful applications!
Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?
Now we understand why Apple is so good at doing full text searches and filesystem wide searches. I wish we had the same type of search functionality in Mozilla that Mail.app boasts of.
That is the one feature that Mozilla's mail client really could use.
--
Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.
It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.
Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.
Visit Jonesblog and say hello.
If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."
According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.
If the Junk Mail filter snagged a message the first time, it'll probably get it on subsequent tries too. If the message is legitimate, it probably can't be changed enough to make it through. It's a much better idea to check Junk Mail for legit messages and only empty it manually (or automatically for messages that are at least a week old).
Actually, if you read the article it specifically states that Mail's spam filtering is not like Mozilla Mails. You use it in much the same manner, butt the underlying technology is completely different.
Yaz.
and it's not really mail. it's more iCal. iCal + exchange. as in, let me talk to exchange with ical. i'd love to get rid of entourage, the slowest mail client ever.
-- Who is the bigger fool? The fool or the fool who follows him? --
Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.
Very cool...
The author is awfully dismissive of bayesian filtering, which works extremely well for me and for lots of other people. See mozilla, spam assassin, others.
I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
Umm, how much would you want to bet? I'll take that action!
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
but it's a whole lot better with junkmatcher central
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.
I have marked every single announcement and special offer i've ever received from Apple as junk, and yet the filter still refuses to classify them as such automatically.
I wonder if there's a loophole here that spammers could take advantage of: masquerade as Apple using the hole they've left in their filter. Spam Mac users to your heart's content. Bundle a Mac virus along with it for extra damage.
Please don't mod this down just because you like Macs. I like Macs too, but it really looks like there is a back door in the spam filter and I'm just reporting it - not mac bashing.
Actually from my understanding of it, its fairly different.
I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.
What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam
The advantage to this method I would suppose is to fold:
A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...
At least thats my understanding of it.
Kickin it Apple Old School.
Where would we be if Wheel had hid her round rock in a cave instead of showing everyone how it rolls?
How does this technology compare to Bayesian filters such as PopFile ? PopFile was not made by Apple, so clearly it doesn't have the cult appeal, but it has been working flawlessly for me for about a year now. What really irks me about this article is how it implies that Apple invented trainable filters -- where, in reality, this is very far from the truth. Apple does the same thing with pretty much everything it sells... sort of like Soviet Russia, who claimed to have invented flight, radio, transistors, and probably elephants too.
>|<*:=
reading that has cleary shown me for the first time why my friends/family complain when i talk technical about chemistry to them.
And i thought i spoke english!
I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.
In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?
Ya, this is off topic trolish flame bait... and I am an OS X user...
;)
nevertheless, I still laughed at this
"Things are more moderner than before- bigger, and yet smaller- it's computers-- San Dimas High School football RULES!"
I thought it was those underpants gnomes all this time...
We apologise for the fault in this post. Those responsible have been sacked. -- Signed RICHARD M. NIXON
This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.
To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)
A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.
Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.
ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words
(further down
so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm
ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.
Here's the problem I have with mail.app's spam filtering:
I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database. So the training winds up being sort of haphazard.
I suppose I should designate a particular machine to be the spam filtering IMAP client and have the rest of them not participate, but then I can't train on those subservient machines.
It'd be much better if multiple Mail.app IMAP clients could store their database on the server and share it.
Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.
Mail CHOKED on them. The early version of Mail chugged for 2 something hours and I gave up and killed it. The latest version was slightly better; 1000 messages or so still took well over 10 minutes. It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. My inbox is 820, and several mailing list boxes are well over 5,000 if I forget to clean them out. I have hundreds of MB of mail, and Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).
But that was just the importing- then it had to thread them or something, and THEN it had to index them all, both of which it did in the background, but still took forever.
Searching? Well, ok, it's "better" than Eudora in that it gives relevancy and Eudora is an on/off sorta deal, but that's fine- and I prefer 1 second for an exact search in a 2,000 message mailbox over 5-10 seconds for a fuzzy search.
Sorry, but Eudora, despite being a lumbering dinosaur technology-wise(MIME support is broken- PGP-MIME just doesn't work right; no address book integration is another thing that really irritates me), it is just plain hands-down the fastest mail client around.
The MBOX-with-index format also works exceedingly well, is portable (although some minor massaging with text-processing tools may be needed in some cases), and hard to corrupt- unlike almost every other mail client's DB (especially outlook). I've used Eudora for ten years, and never lost a single message except for one early beta version which munged a mailbox on me.
Please help metamoderate.
I'd much preferr to use Mail over Entourage but I can't because it doesn't support multiple databases for accounts or have the ability to move my mail to a Junk folder on my IMAP server. I run my own server and space is not an issue so I do not delete mail. This makes it easier to train a new application if I need it - and makes sure that i never miss a message. Until mail can handle multiple accounts better I won't be using it.
As far as I know, I am the first to make this up.
1. have several spam email accounts to gather spam
2. use all these emails as a filter(can be partial matching) to eliminate spam at the isp or server or client level.
Why can't they tell that pen1s and penis are essentially the same word.
I use Spam Assassin, and yeah it flags a lot of stuff, but the stuff that does get through is really obvious spam to a human being, yet it fools SA no sweat.
Actually, if you read the article it specifically states that Mail's spam filtering is not like Mozilla Mails. You use it in much the same manner, butt the underlying technology is completely different.
Wow, the article states it.. It's gotta be true..
It does a poor job of explaining the difference.
The big problem I see in spamland today isn't the classification technology. It's the word recognition problem. Sure, "VIAGRA" may be deeply embedded in a "spam" cluster, but what about "V1_4G ra"? If spammers weren't disguising their words, I think that Bayesian filtering and other techniques work fine. I'm not really sure that more advanced techniques in word classification are really needed here.
If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.
As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I mean, it's great and all that we've gotten pretty good at filtering spam. I use Opera quite a bit, and its spam filters work with 99% accuracy after sufficient training. But there's still a chance something can slip through. I still have to download all the spam, and occasionally go through it, deleting it all, while making sure something legit didn't accidentally get flagged as spam. It's rare, but it happens. The most annoying thing is just that I get it at all. I'd be more impressed to see something like this running on the mail server, turning back spam. I even wouldn't mind if the rare legitimate message got bounced. The sender would just receive a message from the mail server saying that their mail was marked as spam, and that they should try again, or let me know by some other means. Heck, I wouldn't mind missing the occasional e-mail if I never had to download another spam again. That's what would impress me at this point.
Mail's junk filter may be okay, but it's not nearly as good as the article makes it out to be. You'll get much better results using a combination of the built-in filter along with external filtering/tagging devices.
For example, I ran across JunkMatcher some time ago and have been enjoying 99-100% accurracy with less than 1% false positives. It's a huge improvement over the 80-90% accurracy I was getting with the built-in filter alone.
I had emails out to every link in the chain, but no one knew what was going on.
In Apple Mail, I had my 'reply to' names set to my emai addys - I changed it to short descriptive names and now they're not bouncing anymore. (odd error, so I thought I'd post it)
Why this started all of a sudden, and why no host or ISP had heard of this before. I don't know.
I do know that being on a blacklist and attempting to get off of it is nigh impossible, so I'd be all over Apple making spam filtering software so overzealous wizards of blacklists can be kicked to the curb. (Why is this in use anywhere..?)
Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).
In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.
That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
You need to first select the text to summarize, then you can go to the application menu, Services, and choose the Summarize option. This then launches the SummaryService, which then allows you to set the desired summary length and displays the summary accordingly.
"I like systems, their application excepted", George Sand (French)
I guess that's what the "load images" button in Mail.app is for. So you only load pictures from sources you trust.
I assume this is the same in Outlook? Anyway. I'll never stop blaming MS. Think of how much bloody extra traffic their introduction of HTML is causing (look at spam numbers).
look here! ;o)
I am NaN
"None are more hopelessly enslaved than those who falsely believe they are free." -- Goethe
You mean these "Vectors" (sounds foreign) are watching everything in my email?!!
Well, if that isn't a gross invasion of privacy then my name's not Liz Figueroa.
I'm drafting a letter to the Senate immediately... on a typewriter.
time to get new friends and disown your family
In short: a vector is the result of a calculation based on the number of times a term is used in a document and the terms in the other documents it is being compared with (the document set).
The angle between document (email) vectors is a representation of their likeness. For example if the angle is very small the documents have a lot in common.
This is how the mail app works. It compares known junk emails (ie the query) to the incoming document set (new emails)
There are a number of weighting schemes, for example Term Frequency Weights (TF Weights) or Term Frequency Inverse Document Frequency (TF-IDF Weights).
There are a few particiularly relevant laws to Information Retrieval. Heaps Law (the larger a document gets the less new words are added to it).
http://planetmath.org/encyclopedia/HeapsLaw.html
Zipfs Law: More relevant to document weighting schemes. It states that frequently used words are less relevant. For example stop words such as "a, the, it, and, is" all carry little meaning and are used frequently.
http://planetmath.org/encyclopedia/ZipfsLaw.html
Less frequently used words in a document are better at describing its content. For example " pixel intensity mathematical concepts".
-- Agent
and this is diffrent from mozilla mails spam filter how?
oh and can we please get a mac software article that dont sound like a raveing zealot doing the "hail mac/jobs" routine?
even linux reviewser are more critical...
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
..."Wow, the article just turned me on."
"Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. "
Which you'll see more of as DB Filesystems, and Meta this and Meta that, become popular.
Well, since you brought it up, yes, let's compare:
Apple method:
Open Prefs
Click Viewing Options
Uncheck 'Display images and embedded objects in HTML messages'
I'll stick with Apple's method thanks.
If Jesus wants me it knows where to find me.
After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.
s ition.html
e stNeighbor
Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.
The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).
Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.
You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.
I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.
For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.
[1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.
[2] Singular Value Decomposition -- from MathWorld. http://mathworld.wolfram.com/SingularValueDecompo
[3] Frobenius Norm -- from MathWorld. http://mathworld.wolfram.com/FrobeniusNorm.html
[4] Artificial Intelligence Wiki: NearestNeighbour. http://www.ifi.unizh.ch/ailab/aiwiki/aiw.cgi?Near
This is Information Retrieval not Information Dispersal...Information Transit got the wrong man. I got the right man. The wrong one was delivered to me as the right man, I accepted him on good faith as the right man. Was I wrong?
My name's Lowry. Sam Lowry. I've been told to report to Mr. Warrenn.
Thirtieth floor, sir. You're expected.
Um... don't you want to search me?
No sir.
Do you want to see my ID?
No need, sir.
But I could be anybody.
No you couldn't sir. This is Information Retrieval.
There you are, your own number on your very own door. And behind that door, your very own office! Welcome to the team, D7-105! Welcome to Information Retrieval
"Music is everybody's possession. It's only publishers who think that people own it." - John Lennon.
i mean "them" as in macs...
very sexy laptops. They just cost too much.
So I settle for the "discount girl" you know...dell.
-Grump
Is it true that more people vote for the winner of American Idol, than vote for the president? -Ali G.
From The Article: "What Should I Do with Spam Once It's Flagged? " :)
Why, send it on to all your friends ofcourse!
The Bigger The Headache The Bigger the Pill
Isn't that how the MCP got started in "Tron"... something... small...?
:)
It's time to nip this one in the bud! Before it's too late!!!
"People" using "unnecessary" quotes should be "shot".
An underlying Bayesian model. Not necessarily the Bayesian model used by current "Bayesian" spam filters.
So, is any free software project working on this sort of thing? Given that Unix docs tend to be plain text, this kind of approach should work better here than with all those nasty proprietary binary formats. Reading the description, it doesn't sound that difficult to do, although I've not enough maths background to no for certain.
Enter JunkMatcher Central.
it uses rules based filtering to complement Mail.app's methods. And, as a bonus, you can have it mark what it finds as junk mail, which trains mail.app.
It requires some tweaking, but is great, updated often, and free!
Of course, having half of those accounts on an Xserve G5 running OS X Server 10.3.3 and referencing about a half-dozen blacklists helps, too. :)
MacTacToe - for every problem, an elegant solution
It's jus tnice that you can leave it on and have to explicitly load it for mail marked as junk.
I was using Thunderbird (well, really the earlier mozilla mail) for a few years, but I have to say it got really slow with a lot of messages and Mail.app has just the right set of features for me at the moment. I might move back to Thunderbird someday though.
"There is more worth loving than we have strength to love." - Brian Jay Stanley