Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

33 of 273 comments (clear)

  1. Maybe... by ErichTheWebGuy · · Score: 5, Interesting

    Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...

    --
    bash: rtfm: command not found
    1. Re:Maybe... by nacturation · · Score: 4, Interesting

      I assume web bug images aren't filtered out if they are, for example:

      http://host.com/images/1F59C6EA.jpg

      A spammer could setup their server (mod_url I think?) so that this gets translated to:

      http://host.com/serve_image.php?email_id=1F59C6E A

      This would still verify the email address and would generally be transparent to the user. The filter could get smarter and search for numbers, but this is also easily overcome by dictionary words. If you used 5 letter words, you'd have about 10,000 of them to use. You could then represent 100,000,000 (10,000 ^ 2) email addresses using only two five letter words in succession in a URL, such as:

      http://host.com/img/abash/zymin/logo.jpg

      and rewriting it as before. Each user gets a unique combination of two words that uniquely identifies them. If abash is the 9th word and zymin is the 9914th word, then this is user id (9 * 10,000 + 9914) = 99,914.

      Really, the only solution to web bugs is to not load images from unknown senders. Make the user manually load images (mail.app has this feature as do many other clients) if they are not attached as files with the message.

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
    2. Re:Maybe... by Merk · · Score: 2, Interesting

      Why leave any HTML? Does <blink> make a message more compelling? Do you really need someone to send a message with baloons in the background? If someone really likes the handwriting font, should I be forced to see that in their email?

      Sure, sometimes in a complex email it would be nice to be able to use headers or bulleted lists. But nobody should be able to force me to display the message with their ugly-ass markup.

      The only thing that makes any sense here is to use strict stylesheet-based markup. Someone can label things as 'headers' and 'bulleted lists'. Then, the receiver can have a stylesheet that properly renders these types of content markers so the information isn't lost. That way, 'chick who likes baloon backgrounds' can display all her incoming emails that way, and 'guy who likes unreadable fonts' can have all his incoming emails displayed in that font... but those of us who like black, 12pt Times New Roman text on white backgrounds can avoid being driven insane.

      Tags like <i> and <b> and <blink> and <font> shouldn't ever be part of email.
  2. Vectors..... by BWJones · · Score: 4, Interesting

    Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

    Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.

    The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

    --
    Visit Jonesblog and say hello.
    1. Re:Vectors..... by RovingSlug · · Score: 2, Interesting
      The magic doesn't come from vectors. Vectors are just how you throw the numbers around

      And your point is?

      Ah, that's the main point. Both the article and your original post focus on the fact that vectors are being used. While true, this doesn't really impact the essense of the algorithm -- effectively addressing the lower-level data structures instead of the higher-level algorithms. Perhaps an analogy might be someone describing Google's search by explaining B-trees instead of getting into what process actually determines that one page is better than another for a given search.

      I'm not going to address the finer details of image classification further than that the techniques you describe require a significant amount of preparation, selection, and manipulation up-front by a human before a computer can produce useful results. Rather, I used image classification as a motivation to describe why discussing only the notions of "vectors" and "clusters" misses a huge part of the story of what actually makes these sort of techniques work.

  3. Full text search goodness by vikman · · Score: 3, Interesting

    Now we understand why Apple is so good at doing full text searches and filesystem wide searches. I wish we had the same type of search functionality in Mozilla that Mail.app boasts of.
    That is the one feature that Mozilla's mail client really could use.

    --
    --
  4. how does it compare to Bayesian? by the+quick+brown+fox · · Score: 5, Interesting
    Is there any hard data out there that shows the cluster analysis actually improves on the better Bayesian algos out there? After all, most of the good ones also achieve the 98%+ that this article cites.

    According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.

    1. Re:how does it compare to Bayesian? by turkmenistani · · Score: 2, Interesting

      But, like the article mentions, what happens when your grandma sends you an email mentioning viagra? Traditional Bayesian algorithms would automagically flag it as spam and delete it. The problem with traditional spam filters is that they might block all incoming spam, but they might also block something you might have wanted to read.

    2. Re:how does it compare to Bayesian? by jcr · · Score: 2, Interesting

      Bayesian filtering is a subset of what LSM can do. If you get to WWDC this year, find Kim Silverman and ask him to explain it to you.

      -jcr

      --
      The only title of honor that a tyrant can grant is "Enemy of the State."
    3. Re:how does it compare to Bayesian? by lupin_sansei · · Score: 2, Interesting

      No they wouldn't. Bayesian filters would see the word "viagra" and give that a high spam score, but all the other words that your Aunty used would probably have a very high ham score (not spam). Thus it would probably score the entire email as ham.

      That's the great thing about Bayesian filters, they score the entire email not just look for single keywords.

    4. Re:how does it compare to Bayesian? by Ibanez · · Score: 2, Interesting

      Actually, I saw this article and figured I could rant a little. I really am not impressed by it. I get 200 or so junk mail every week, and about a quarter of that gets through. And some of these to me seem really obvious. It doesn't really seem to learn anymore either. I've never had a false positive, which is pretty good, but I'd still love to find a way to implement a Bayesian filter in Mail.

      Blake

    5. Re:how does it compare to Bayesian? by wirelessbuzzers · · Score: 2, Interesting

      It's pretty hard to compare algorithms, at least ones that might work, such as chi squared (SpamBayes) vs Bayesian (Plan for Spam, CRM114, lots more) vs point totals (SpamAssassin) vs cluster analysis (Mail.app).

      As for implementations, CRM114 kicks the shit out of Mail.app's filter, at least on my and my roommate's mixes. About the only thing that CRM114 hasn't caught for me is those 1-line virus spams with a .zip attached, and new classes of spam (last week I received my first stock spam). The false positive rate is very low and generally confined to advertisements that I don't want to read, but are from other students over the house lists, or the like. I've been considering retraining those as spam anyway.

      The author claims 99.984% filtering rate, which is higher than I get... but then, I don't get as much spam as he does, and I use whitelists, which are said to hurt the accuracy in favor of zero false positives from that segment.

      --
      I hereby place the above post in the public domain.
    6. Re:how does it compare to Bayesian? by Nuclear+Elephant · · Score: 3, Interesting

      98% is pretty pathetic - 1 error in 50. Most good Bayesian filters (SpamProbe, CRM114, DSPAM) can reach at least 99.9% (1 error in 1000) with ease. Others can grow far beyond this and reach as high as 99.985%, as a recent slashdot article covered (and this one). I reset my stats a few weeks ago, and out of 1800 spams so far, 0 have made it through. The only problem with Bayesian filtering is that it's mismarketed by companies who insist they have a better solution (although it's less accurate).

      And to answer your question - collaborative filtering, such as message inoculation works quite well at boosting accuracy even beyond the high levels of accuracy Bayesian filters are really capable of, whereas things like shared groups and such hurt it.

  5. Summary Service by spankalee · · Score: 4, Interesting

    Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

    If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

    Very cool...

    1. Re:Summary Service by nikster · · Score: 2, Interesting
      below is the default output:
      In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.

      ...Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily.

      ...The Apple data kit allows the user to find the single document that best represents each topic.

      ...The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.

      ...So, a document that contains "Aunt Emma" and "cooking tips" at the beginning and the end of a page can well be in the same cluster as a text that talks precisely about "the time Aunt Emma sent you cooking tips."

      ...Imagine this: take the biggest issue you can find of the Mac Developer Journal and put it in your left hand, and put your favorite dictionary in your right hand.

      ...Let's say, for example, that your Aunt Emma, in her cooking tips, talks about a "hippopotamus" (as in "For the turkey to be tasty, it should be quite large but obviously, you don't want a hippopotamus-sized one.").

      ...If each document is a point in a X0,000-dimension space or so, we reduce its dimensionality into a small number of dimensions that capture the salient patterns and the majority of the variation in the corpus.

      ...Like we did before, you can perform a bit of cluster analysis and find clusters of documents that each represent a topic.

      ...Because words are distributed in the same space as documents, you can find the words that are closer to the center of a document cluster.

      ...Even though Apple is not the only company working on such technologies, they do seem to be the only ones to have made it so accessible to end users and powerful at the same time. In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.

      ...The other traditional approach is to look at the sender and not accept any message from any known junk-mail sender.
  6. os x's mail filter is great by squarefish · · Score: 3, Interesting

    but it's a whole lot better with junkmatcher central

    --
    Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.
  7. Apple spam by seanadams.com · · Score: 4, Interesting

    I have marked every single announcement and special offer i've ever received from Apple as junk, and yet the filter still refuses to classify them as such automatically.

    I wonder if there's a loophole here that spammers could take advantage of: masquerade as Apple using the hole they've left in their filter. Spam Mac users to your heart's content. Bundle a Mac virus along with it for extra damage.

    Please don't mod this down just because you like Macs. I like Macs too, but it really looks like there is a back door in the spam filter and I'm just reporting it - not mac bashing.

  8. Sounds sufficiently different to me by Anonymous Coward · · Score: 5, Interesting

    Actually from my understanding of it, its fairly different.

    I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.

    What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam

    The advantage to this method I would suppose is to fold:

    A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.

    B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...

    At least thats my understanding of it.

  9. It's Cyberdog! by Blackbrain · · Score: 2, Interesting
    Apple has finally brought Cyberdog back!

    Kickin it Apple Old School.

    --
    Where would we be if Wheel had hid her round rock in a cave instead of showing everyone how it rolls?
  10. Missing functionality by nsayer · · Score: 4, Interesting

    Here's the problem I have with mail.app's spam filtering:

    I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database. So the training winds up being sort of haphazard.

    I suppose I should designate a particular machine to be the spam filtering IMAP client and have the rest of them not participate, but then I can't train on those subservient machines.

    It'd be much better if multiple Mail.app IMAP clients could store their database on the server and share it.

  11. Fast?!? by SuperBanana · · Score: 4, Interesting
    With Altivec, no wonder Mail is so damned fast.

    Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.

    Mail CHOKED on them. The early version of Mail chugged for 2 something hours and I gave up and killed it. The latest version was slightly better; 1000 messages or so still took well over 10 minutes. It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. My inbox is 820, and several mailing list boxes are well over 5,000 if I forget to clean them out. I have hundreds of MB of mail, and Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).

    But that was just the importing- then it had to thread them or something, and THEN it had to index them all, both of which it did in the background, but still took forever.

    Searching? Well, ok, it's "better" than Eudora in that it gives relevancy and Eudora is an on/off sorta deal, but that's fine- and I prefer 1 second for an exact search in a 2,000 message mailbox over 5-10 seconds for a fuzzy search.

    Sorry, but Eudora, despite being a lumbering dinosaur technology-wise(MIME support is broken- PGP-MIME just doesn't work right; no address book integration is another thing that really irritates me), it is just plain hands-down the fastest mail client around.

    The MBOX-with-index format also works exceedingly well, is portable (although some minor massaging with text-processing tools may be needed in some cases), and hard to corrupt- unlike almost every other mail client's DB (especially outlook). I've used Eudora for ten years, and never lost a single message except for one early beta version which munged a mailbox on me.

    1. Re:Fast?!? by pHDNgell · · Score: 4, Interesting

      Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.

      Mail CHOKED on them.


      Everyone's got a story and a counter-story. I've got over 100,000 messages in IMAP (101,269 as of last night, but it goes up and down), fully synced to Mail.app (bodies and attachments) indexed for searching, and used every day. It's split over 250 mail boxes (one for each month I've sent or received email as long as I've been keeping stuff).

      It's amazingly fast. It makes my mail server seem fast (Sun IPX running SunOS 4.1.4 with a custom cyrus IMAPd that supports compressed mail stores and LDAP and some other stuff).

      (Sorry for all the parentheticals. :)

      --
      -- The world is watching America, and America is watching TV.
    2. Re:Fast?!? by Rosyna · · Score: 2, Interesting

      Uhm, I've got about 5 mailboxes that have hit this 32760 message limit (dunno why but they recently reduced it to 32000).

      My Mail folders contain 2.31gigs of email. Mail cannot handle this and chokes on it horribly. Eudora handles it like a champ. Too bad its junk mail filter sucks.

    3. Re:Fast?!? by mandelbaum · · Score: 2, Interesting

      Yeah for Eudora, the mail client with option-click to automatically group messages by whatever you click on. It's the best thing...ever.

      I've been using it since 1994 and can't imagine switching to anything else.

      -aaron

    4. Re:Fast?!? by richie2000 · · Score: 2, Interesting
      Has anything good come out since SunOS 4.1.4?

      I don't think so. Considering the time it took to get 4.1.4 as the proverbial gift from the Gods, I wouldn't hold my breath. ;-)

      Damn, I actually miss SunOS, SunView and the 3/80s we had at school...

      --
      Money for nothing, pix for free
    5. Re:Fast?!? by nikster · · Score: 3, Interesting

      Mail CHOKED on them

      it helps to check Apple apps _again_ from time to time since they tend to make huge improvements with every release. Mail.app has not been slow for a while now. Apple seems to pretty consequently follow the strategy "make it work first, make it fast later" . i am running the latest version on OS X 10.3

      I have about 1G of mail and it doesn't really seem slow in any situation, even though it's running on a almost 3 year old 667MHz powerbook (with a sloooow hard disk).

      I just did a test of search entire message in all mailboxes (all 1G of them). the first results appeared after 3 seconds, and it stopped after 40 secs, rebuilding some indexes along the way. the second search was done in about 15 seconds.

      Every single criticism i had since Mail 1.0 - and there were a lot, including performance - has since been addressed. It is now fast, no annoying modal dialogs, no indexing behind your back, no weird delays. It's just a beautiful mail client.

      i recommend you try it again.

      On topic: The junk mail filter seems to indeed work pretty well. i just checked my junk mail folder (2000 unread messages, heh): All except for 5 were spam, and those 5 were all mass mailings, too. Even clever(?) subject lines like v$a.g.r.a and such were filtered out.

      Oddly, 3 of the 5 false positives were from Apple, sent to my .mac account.

    6. Re:Fast?!? by EvilTwinSkippy · · Score: 2, Interesting
      As a network administrator I just have to do a paternalistic scowel at you.

      2.3 gig of email. Dear god our server only has a 20 gig hard drive. I'd be camped out at your office (or send a coop to camp in your office.) and make disparaging remarks about "bloat" until you trimmed up a bit.

      If everything is important, nothing is important. 32,000 messages means you aren't real picky.

      --
      "Learning is not compulsory... neither is survival."
      --Dr.W.Edwards Deming
    7. Re: Fast?!? by teridon · · Score: 2, Interesting

      I had the same experience with Mail -- I let it chug away *overnight* to import my mail. The next day when I tried actually *using* Mail it was too slow compared to Eudora. What a waste of time :(

      FYI, Eudora 6.1 now has address book integration. See here

      --
      I hold it, that a little rebellion, now and then, is a good thing. -- Thomas Jefferson
  12. Word disguises? by Piquan · · Score: 2, Interesting

    The big problem I see in spamland today isn't the classification technology. It's the word recognition problem. Sure, "VIAGRA" may be deeply embedded in a "spam" cluster, but what about "V1_4G ra"? If spammers weren't disguising their words, I think that Bayesian filtering and other techniques work fine. I'm not really sure that more advanced techniques in word classification are really needed here.

  13. This is probably off-topic by teamhasnoi · · Score: 4, Interesting
    All my emails to a couple of people suddenly started bouncing with a 550 'Administrative Prohibition' error last week - at first I blamed my ISP, then blamed my host, then the receiving host, all for naught. I then found I was on a couple of blacklists (probably because I apparently shared a virtual host with a scummy mortgage guy), but these had no bearing (I learned later)

    I had emails out to every link in the chain, but no one knew what was going on.

    In Apple Mail, I had my 'reply to' names set to my emai addys - I changed it to short descriptive names and now they're not bouncing anymore. (odd error, so I thought I'd post it)

    Why this started all of a sudden, and why no host or ISP had heard of this before. I don't know.

    I do know that being on a blacklist and attempting to get off of it is nigh impossible, so I'd be all over Apple making spam filtering software so overzealous wizards of blacklists can be kicked to the curb. (Why is this in use anywhere..?)

  14. Privacy violation by michaeldot · · Score: 2, Interesting

    You mean these "Vectors" (sounds foreign) are watching everything in my email?!!

    Well, if that isn't a gross invasion of privacy then my name's not Liz Figueroa.

    I'm drafting a letter to the Senate immediately... on a typewriter.

  15. Document Vectors - Term Weights by agentofchange · · Score: 3, Interesting
    Forgetting about vectors is silly.

    In short: a vector is the result of a calculation based on the number of times a term is used in a document and the terms in the other documents it is being compared with (the document set).

    The angle between document (email) vectors is a representation of their likeness. For example if the angle is very small the documents have a lot in common.

    This is how the mail app works. It compares known junk emails (ie the query) to the incoming document set (new emails)

    There are a number of weighting schemes, for example Term Frequency Weights (TF Weights) or Term Frequency Inverse Document Frequency (TF-IDF Weights).

    There are a few particiularly relevant laws to Information Retrieval. Heaps Law (the larger a document gets the less new words are added to it).

    http://planetmath.org/encyclopedia/HeapsLaw.html

    Zipfs Law: More relevant to document weighting schemes. It states that frequently used words are less relevant. For example stop words such as "a, the, it, and, is" all carry little meaning and are used frequently.

    http://planetmath.org/encyclopedia/ZipfsLaw.html

    Less frequently used words in a document are better at describing its content. For example " pixel intensity mathematical concepts".

    -- Agent

  16. Re:But you still get the spam... by rudedog · · Score: 4, Interesting

    The sender would just receive a message from the mail server saying that their mail was marked as spam

    Sadly, if it is spam, then you'll be punishing thousands of innocent people whose email addresses have been forged by the spammers, by sending them the bounce messages. Very little actual spam gets past my bayesian filters, but I do get a lot of bounces from other people's spam filters for messages and virusses that I never sent.