Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

14 of 273 comments (clear)

  1. Magic by Faust7 · · Score: 4, Funny

    and no, it doesn't use white magic...

    Black, then?
    Or is that reserved exclusively for Microsoft?

    1. Re:Magic by Jameth · · Score: 4, Funny
      and no, it doesn't use white magic...

      Black, then? Or is that reserved exclusively for Microsoft?
      It's not reserved, they have a monopoly.
  2. i know how by ShallowThroat · · Score: 5, Funny

    it's simple. it uses it's extremely uninsipired app name to scare away spam.

    --
    The "Insert Quote Here" line is almost as predictable as inserting an actual quote.
    1. Re:i know how by jjeffries · · Score: 4, Funny

      I hear that the next version will be known as "mail-enhancemant.app"

  3. subspaces? by thedogcow · · Score: 5, Funny

    The article mentions...

    "In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

    Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.

    --
    Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.
    1. Re:subspaces? by Capt'n+Hector · · Score: 3, Funny
      When I took linear algebra I was wondering if there was a practical approach to this

      If by "this" you mean spam filtering, then cool. But if you're talking about applications in general... Are you kidding? Linear algebra is probably the most useful stuff you'll ever learn, especially if you're into computers. It's the stuff CG is made of. EVERYTHING uses linear algebra.

      So here's a guess on how this works: So you've got your document vector. You also have a vector space, call it S for "spam". Choose your basis for S to be a bunch of words commonly found in spam. Now, orthogonally project your document vector into S, take the Euclidian norm and if it's too long -- zap it! It's spam!

      --
      Quid festinatio swallonis est aetherfuga inonusti?
      Africus aut Europaeus?
  4. ...moderation ideas.... by j3ll0 · · Score: 5, Funny

    Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

    1. Re:...moderation ideas.... by wheresdrew · · Score: 5, Funny
      Yes, but the combination of too many all too common terms could cause the system to implode.

      "In Soviet Russia imagine a beowulf cluster of insenstive clods who don't RTFA because they're using linux to beat the GNAA to the first post."

  5. n-space by Anonymous Coward · · Score: 5, Funny

    Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.

    It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.

  6. Re:Kinda like Mozilla Mail? by jcr · · Score: 4, Funny

    I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Umm, how much would you want to bet? I'll take that action!

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
  7. Re:how does it compare to Bayesian? by inburito · · Score: 5, Funny

    Wow. If your grandma is suggesting you viagra I think your problems go way deeper than Bayesian misfirings..

  8. Re:Summary Service by Mikey-San · · Score: 4, Funny

    Input:

    Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

    If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

    Very cool...

    Output:

    Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

    If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

    Wow, look at that! Impressive!

    (I actually love Summary Service, but I couldn't resist that joke.)

    --
    Mikey-San
    Karma: +Eleventy billion (mostly affected by watching Celebrity Jeopardy)
  9. Re:Fast?!? by Alan · · Score: 5, Funny

    Dude, you seriously need to seek help for your mail-archiving condition :)

    Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!

  10. Information Retrieval by ScottGant · · Score: 4, Funny

    This is Information Retrieval not Information Dispersal...Information Transit got the wrong man. I got the right man. The wrong one was delivered to me as the right man, I accepted him on good faith as the right man. Was I wrong?

    My name's Lowry. Sam Lowry. I've been told to report to Mr. Warrenn.
    Thirtieth floor, sir. You're expected.
    Um... don't you want to search me?
    No sir.
    Do you want to see my ID?
    No need, sir.
    But I could be anybody.
    No you couldn't sir. This is Information Retrieval.


    There you are, your own number on your very own door. And behind that door, your very own office! Welcome to the team, D7-105! Welcome to Information Retrieval

    --

    "Music is everybody's possession. It's only publishers who think that people own it." - John Lennon.