Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

22 of 273 comments (clear)

  1. Magic by Faust7 · · Score: 4, Funny

    and no, it doesn't use white magic...

    Black, then?
    Or is that reserved exclusively for Microsoft?

    1. Re:Magic by Jameth · · Score: 4, Funny
      and no, it doesn't use white magic...

      Black, then? Or is that reserved exclusively for Microsoft?
      It's not reserved, they have a monopoly.
    2. Re:Magic by Inf0phreak · · Score: 2, Funny
      Oh yes. I can just imagine how some of the code looks:

      if (isspam(mailentry)) HADOKEN(mailentry);

      Go here for an explanation (funny webcomic IMO).

      --
      ________
      Entranced by anime since late summer 2001 and loving it ^_^
  2. i know how by ShallowThroat · · Score: 5, Funny

    it's simple. it uses it's extremely uninsipired app name to scare away spam.

    --
    The "Insert Quote Here" line is almost as predictable as inserting an actual quote.
    1. Re:i know how by jjeffries · · Score: 4, Funny

      I hear that the next version will be known as "mail-enhancemant.app"

    2. Re:i know how by Anonymous Coward · · Score: 1, Funny

      And apparently you use your extremely poor use of the English language to scare off replies. Have you ever considered picking up a 7th grade grammar book?

  3. subspaces? by thedogcow · · Score: 5, Funny

    The article mentions...

    "In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

    Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.

    --
    Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.
    1. Re:subspaces? by Capt'n+Hector · · Score: 3, Funny
      When I took linear algebra I was wondering if there was a practical approach to this

      If by "this" you mean spam filtering, then cool. But if you're talking about applications in general... Are you kidding? Linear algebra is probably the most useful stuff you'll ever learn, especially if you're into computers. It's the stuff CG is made of. EVERYTHING uses linear algebra.

      So here's a guess on how this works: So you've got your document vector. You also have a vector space, call it S for "spam". Choose your basis for S to be a bunch of words commonly found in spam. Now, orthogonally project your document vector into S, take the Euclidian norm and if it's too long -- zap it! It's spam!

      --
      Quid festinatio swallonis est aetherfuga inonusti?
      Africus aut Europaeus?
  4. ...moderation ideas.... by j3ll0 · · Score: 5, Funny

    Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

    1. Re:...moderation ideas.... by pvt_medic · · Score: 2, Funny

      and by that token, i could creat something that would get me moded up every time so i can get more karma so i can mod...

      oh automated mod... scratch that plan, i will have to figure something else out for world domination.

      --
      30% Troll, 50% Underrated, 10% Interesting
      Score:5, Troll
    2. Re:...moderation ideas.... by wheresdrew · · Score: 5, Funny
      Yes, but the combination of too many all too common terms could cause the system to implode.

      "In Soviet Russia imagine a beowulf cluster of insenstive clods who don't RTFA because they're using linux to beat the GNAA to the first post."

  5. n-space by Anonymous Coward · · Score: 5, Funny

    Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.

    It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.

  6. Re:Kinda like Mozilla Mail? by jcr · · Score: 4, Funny

    I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Umm, how much would you want to bet? I'll take that action!

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
  7. Re:how does it compare to Bayesian? by inburito · · Score: 5, Funny

    Wow. If your grandma is suggesting you viagra I think your problems go way deeper than Bayesian misfirings..

  8. Re:Kinda like Mozilla Mail? by Anonymous Coward · · Score: 2, Funny

    reading that has cleary shown me for the first time why my friends/family complain when i talk technical about chemistry to them.

    And i thought i spoke english!

  9. Re:Summary Service by Mikey-San · · Score: 4, Funny

    Input:

    Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

    If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

    Very cool...

    Output:

    Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

    If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

    Wow, look at that! Impressive!

    (I actually love Summary Service, but I couldn't resist that joke.)

    --
    Mikey-San
    Karma: +Eleventy billion (mostly affected by watching Celebrity Jeopardy)
  10. Re:Nitpick on one of their recommendations by m1chael · · Score: 1, Funny

    But now imagine two Apple users using Mail Filter...

    --
    I know you are psychotic, but please make an effort.
  11. Re:Fast?!? by Alan · · Score: 5, Funny

    Dude, you seriously need to seek help for your mail-archiving condition :)

    Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!

  12. Re:Crystal clear ... erm ... by Anonymous Coward · · Score: 1, Funny

    It's Apple. Gotta be good.

    You know Apple INVENTED spamfiltering don't you? ;-)

  13. Information Retrieval by ScottGant · · Score: 4, Funny

    This is Information Retrieval not Information Dispersal...Information Transit got the wrong man. I got the right man. The wrong one was delivered to me as the right man, I accepted him on good faith as the right man. Was I wrong?

    My name's Lowry. Sam Lowry. I've been told to report to Mr. Warrenn.
    Thirtieth floor, sir. You're expected.
    Um... don't you want to search me?
    No sir.
    Do you want to see my ID?
    No need, sir.
    But I could be anybody.
    No you couldn't sir. This is Information Retrieval.


    There you are, your own number on your very own door. And behind that door, your very own office! Welcome to the team, D7-105! Welcome to Information Retrieval

    --

    "Music is everybody's possession. It's only publishers who think that people own it." - John Lennon.
  14. Re:Maybe... by ChaosDiscord · · Score: 2, Funny
    Maybe you just need to be more picky about giving your address to people.

    I tried that, but my boss got angry when I refused to give him my business address.

  15. Re:Maybe... by Golias · · Score: 2, Funny
    I totally agree!!! It seems to me that converting HTML to plain old text should be a perfectly fine choice for those who don't want to read your
      dumbass, pointless markup.

    Some people really like using HTML, and everybody should respect that.

    Those who read this hoseshit from the command line can just suck it up and deal with it.

    --

    Information wants to be anthropomorphized.