How Apple's Mail.app Junk Filter Works
fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"
I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.
In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?
ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words
(further down
so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm
ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.
Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.
Image clustering is hard, and the problem comes from picking a good representation of the image. Of course, a "word histogram" for an image makes no sense. Just considering pixel intensity or pixel color doesn't work either. You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.
I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)