Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

21 of 273 comments (clear)

  1. Maybe... by ErichTheWebGuy · · Score: 5, Interesting

    Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...

    --
    bash: rtfm: command not found
    1. Re:Maybe... by Anonymous Coward · · Score: 5, Informative

      That's why, at our site, all incoming email goes through the Anomy Sanitizer. It removes unknown HTML tags, like <vframe> or <script>, as well as filters offsite images to eliminate so called web-bugs.

      Oh, and it's fast, too.

    2. Re:Maybe... by karmatic · · Score: 5, Informative

      Macs are vulnerable to the so-called "hole" as well. In fact, _any_ html compliant email client with image support is.

      For example, I wrote some software which takes your email address, and assigns a 5 letter id. The img tag loads an image with the url http://mailserver/get/yourid/image.gif

      From this, it's possible to tell 1) If the email is valid, 2) If you click the image (the url contains your ID) 3) How long before you click 4) If you buy.

      So, if you're dumb enough to buy from spam you get on a sucker list.

      Quit blaming MS - they are unfortunatly the ones who introduced HTML mail, but everyone else who follows suit has problems too.

    3. Re:Maybe... by tkokesh · · Score: 5, Informative
      Actually, Mail.app in Mac OS X 10.3 (Panther) has an option in the "Viewing" Preferences: "Display images and embedded objects in HTML messages".

      When this option is unchecked, the user has to click a specific "Load Images" button in order to see the images in an HTML email, which means that the GIF does not get loaded unless the user lets it. For obvious spam emails, of course, the user can just junk the email, and the spammer gets no confirmation of delivery.

      --

      A pride of lions.
      A gaggle of geese.
      A murder of crows.
      A vista of bugs.
  2. i know how by ShallowThroat · · Score: 5, Funny

    it's simple. it uses it's extremely uninsipired app name to scare away spam.

    --
    The "Insert Quote Here" line is almost as predictable as inserting an actual quote.
  3. subspaces? by thedogcow · · Score: 5, Funny

    The article mentions...

    "In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

    Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.

    --
    Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.
  4. ...moderation ideas.... by j3ll0 · · Score: 5, Funny

    Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

    1. Re:...moderation ideas.... by wheresdrew · · Score: 5, Funny
      Yes, but the combination of too many all too common terms could cause the system to implode.

      "In Soviet Russia imagine a beowulf cluster of insenstive clods who don't RTFA because they're using linux to beat the GNAA to the first post."

  5. n-space by Anonymous Coward · · Score: 5, Funny

    Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.

    It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.

  6. Re:Kinda like Mozilla Mail? by BWJones · · Score: 5, Informative

    Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.

    --
    Visit Jonesblog and say hello.
  7. GD, RTFA! by Zen+Programmer · · Score: 5, Informative

    If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."

  8. how does it compare to Bayesian? by the+quick+brown+fox · · Score: 5, Interesting
    Is there any hard data out there that shows the cluster analysis actually improves on the better Bayesian algos out there? After all, most of the good ones also achieve the 98%+ that this article cites.

    According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.

    1. Re:how does it compare to Bayesian? by inburito · · Score: 5, Funny

      Wow. If your grandma is suggesting you viagra I think your problems go way deeper than Bayesian misfirings..

    2. Re:how does it compare to Bayesian? by SimplyCosmic · · Score: 5, Informative

      Bayesian spam filtering doesn't mark an email as spam simply because of the presence of one single word, but using a mathematical equation based on the likelyhood of each of the words being in the message being symptoms of spam. What you're talking about is simply a spam filter based on a blacklist of words. Bayesian spam filtering uses mathematics to consider how those words are used in the context of the rest of the message, and do a surprisingly good job of it.

      Therefore, "viagra" in your grandmother's email might have a high indication of spamminess, but all the other words will lower the score below the rather high threshold needed to be considered spam.

      That's why training your bayesian spam filter on the email you receive is so important, as it learns what you consider spam from the type of email you receive.

  9. Sounds sufficiently different to me by Anonymous Coward · · Score: 5, Interesting

    Actually from my understanding of it, its fairly different.

    I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.

    What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam

    The advantage to this method I would suppose is to fold:

    A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.

    B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...

    At least thats my understanding of it.

  10. Re:Apple spam by timgoh0 · · Score: 5, Informative

    This behaviour is due to the rules set up in apple mail. To disable this behaviour, go to the mail preferences, select rules and remove the entry "news from apple"

  11. Re:Kinda like Mozilla Mail? by DrSchlock · · Score: 5, Informative

    This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.

    To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)

    A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.

    Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.

  12. Re:Vectors..... by BWJones · · Score: 5, Informative

    The magic doesn't come from vectors. Vectors are just how you throw the numbers around

    And your point is?

    The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test.

    For a univariate space (or perhaps bivariate space) this will work, but now try implementing standard chi-square analysis in multivariate (or hyperspectral) space. Starts to fall short rather quickly thus the measures of distances between clusters analysis.

    Image clustering is hard, and the problem comes from picking a good representation of the image.

    Yes, I do image clustering almost every day. Well, at least a couple times a week. With proper discriminands one can overcome "good image representation" problems.

    Of course, a "word histogram" for an image makes no sense.

    Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.

    Just considering pixel intensity or pixel color doesn't work either.

    Actually, yes it does. This is how many standard measures of image cluster analysis work.

    You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc.

    Actually, no. For many image classification algorithms that examine pixel value (oil bearing strata, concrete vs granite, types of aluminum in missiles etc...), structure or anatomy play absolutely no role in the identification of classes.

    Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques.

    That is a very difficult approach to take for image classification that begins to rely on machine processing and image "interpretation" which is a much higher order problem.

    But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

    Simply add more discriminands or filters and don't worry about "describing" the image. Other properties (like structure and anatomy) fall out after image clustering.

    --
    Visit Jonesblog and say hello.
  13. Re:Fast?!? by Alan · · Score: 5, Funny

    Dude, you seriously need to seek help for your mail-archiving condition :)

    Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!

  14. Not if email is marked as junk... by SuperKendall · · Score: 5, Informative

    If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.

    As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  15. Good god, man by thatguywhoiam · · Score: 5, Informative
    Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too [msnwar.com].

    Well, since you brought it up, yes, let's compare:

    Apple method:
    Open Prefs
    Click Viewing Options
    Uncheck 'Display images and embedded objects in HTML messages'

    ... or I can go hunting on the web for this weirdo, non-sanctioned 'patch' for Outlook, and install that. Oh yeah, and ZoneAlarm.

    I'll stick with Apple's method thanks.

    --
    If Jesus wants me it knows where to find me.