How Apple's Mail.app Junk Filter Works
fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"
Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...
bash: rtfm: command not found
it's simple. it uses it's extremely uninsipired app name to scare away spam.
The "Insert Quote Here" line is almost as predictable as inserting an actual quote.
The article mentions...
"In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."
Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.
Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.
Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?
Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.
It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.
Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.
Visit Jonesblog and say hello.
If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."
According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.
Actually from my understanding of it, its fairly different.
I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.
What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam
The advantage to this method I would suppose is to fold:
A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...
At least thats my understanding of it.
This behaviour is due to the rules set up in apple mail. To disable this behaviour, go to the mail preferences, select rules and remove the entry "news from apple"
This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.
To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)
A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.
Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.
The magic doesn't come from vectors. Vectors are just how you throw the numbers around
And your point is?
The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test.
For a univariate space (or perhaps bivariate space) this will work, but now try implementing standard chi-square analysis in multivariate (or hyperspectral) space. Starts to fall short rather quickly thus the measures of distances between clusters analysis.
Image clustering is hard, and the problem comes from picking a good representation of the image.
Yes, I do image clustering almost every day. Well, at least a couple times a week. With proper discriminands one can overcome "good image representation" problems.
Of course, a "word histogram" for an image makes no sense.
Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.
Just considering pixel intensity or pixel color doesn't work either.
Actually, yes it does. This is how many standard measures of image cluster analysis work.
You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc.
Actually, no. For many image classification algorithms that examine pixel value (oil bearing strata, concrete vs granite, types of aluminum in missiles etc...), structure or anatomy play absolutely no role in the identification of classes.
Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques.
That is a very difficult approach to take for image classification that begins to rely on machine processing and image "interpretation" which is a much higher order problem.
But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.
Simply add more discriminands or filters and don't worry about "describing" the image. Other properties (like structure and anatomy) fall out after image clustering.
Visit Jonesblog and say hello.
Dude, you seriously need to seek help for your mail-archiving condition :)
Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!
If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.
As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Well, since you brought it up, yes, let's compare:
Apple method:
Open Prefs
Click Viewing Options
Uncheck 'Display images and embedded objects in HTML messages'
I'll stick with Apple's method thanks.
If Jesus wants me it knows where to find me.