Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

3 of 273 comments (clear)

  1. Hmmm. Document visualization by mveloso · · Score: 3, Insightful

    I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.

    In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?

  2. Crystal clear ... erm ... by Too+Much+Noise · · Score: 4, Insightful

    Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.


    ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words ... or something (lexical analysis).

    (further down ...)


    Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.


    so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???

    ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.
  3. Re:Vectors..... by RovingSlug · · Score: 4, Insightful
    Ah, it uses vector math. ... Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

    Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.

    Image clustering is hard, and the problem comes from picking a good representation of the image. Of course, a "word histogram" for an image makes no sense. Just considering pixel intensity or pixel color doesn't work either. You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

    I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)