Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

10 of 273 comments (clear)

  1. Nitpick on one of their recommendations by Logic+Bomb · · Score: 2, Insightful
    You can also ask that your potential correspondents resend emails if they do not receive answers in a certain timeframe.

    If the Junk Mail filter snagged a message the first time, it'll probably get it on subsequent tries too. If the message is legitimate, it probably can't be changed enough to make it through. It's a much better idea to check Junk Mail for legit messages and only empty it manually (or automatically for messages that are at least a week old).

  2. one thing missing from mail to make it perfect by Raleel · · Score: 1, Insightful

    and it's not really mail. it's more iCal. iCal + exchange. as in, let me talk to exchange with ical. i'd love to get rid of entourage, the slowest mail client ever.

    --
    -- Who is the bigger fool? The fool or the fool who follows him? --
  3. Hmmm. Document visualization by mveloso · · Score: 3, Insightful

    I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.

    In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?

  4. Crystal clear ... erm ... by Too+Much+Noise · · Score: 4, Insightful

    Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.


    ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words ... or something (lexical analysis).

    (further down ...)


    Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.


    so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???

    ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.
    1. Re:Crystal clear ... erm ... by Anonymous Coward · · Score: 1, Insightful

      Actually, this is nothing new at all. It is roughly performing a feature transformation on a data set, something that's been done with multimedia data for the purposes of conducting nearest neighbor searches for years now.

      Personally I favor Bayesian filters as high-dimensional vector calculations eventually become too unweildy, no matter what kind of beefy system you have.

  5. Re:Vectors..... by RovingSlug · · Score: 4, Insightful
    Ah, it uses vector math. ... Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

    Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.

    Image clustering is hard, and the problem comes from picking a good representation of the image. Of course, a "word histogram" for an image makes no sense. Just considering pixel intensity or pixel color doesn't work either. You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

    I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)

  6. Re:Missing functionality by repetty · · Score: 1, Insightful

    "I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database."

    No that's a bad idea. Your case is unique because you are specifying that just one user uses a bunch of computers, but the general principal you are advocating completely ruins the premise of adaptive filtering.

    Suppose we're sitting in an office... You don't want to see penis enlargement ads but I love 'em. How that big server-level database of yours supposed to work?

    Bad idea.

    --Richard

  7. But you still get the spam... by Yusaku+Godai · · Score: 2, Insightful

    I mean, it's great and all that we've gotten pretty good at filtering spam. I use Opera quite a bit, and its spam filters work with 99% accuracy after sufficient training. But there's still a chance something can slip through. I still have to download all the spam, and occasionally go through it, deleting it all, while making sure something legit didn't accidentally get flagged as spam. It's rare, but it happens. The most annoying thing is just that I get it at all. I'd be more impressed to see something like this running on the mail server, turning back spam. I even wouldn't mind if the rare legitimate message got bounced. The sender would just receive a message from the mail server saying that their mail was marked as spam, and that they should try again, or let me know by some other means. Heck, I wouldn't mind missing the occasional e-mail if I never had to download another spam again. That's what would impress me at this point.

  8. Re:This is probably off-topic by the+shoez · · Score: 2, Insightful

    Maybe because this type of filtering can only really work for a single user on their own corpus of email. In effect it's an end-user solution not something that could be deployed across the whole spectrum of mail servers (as I understand it). Take for example Scott Richter (or however you spell the name), that scabby-little spammer - he loves the stuff, and wouldn't wish it to be filtered from his inbox.

    Blacklists are there to swallow-up this bandwidth wasting traffic forced down our necks by spammers. Personally, I would rather the crap be denied before it ever has to reach my section of the line. I don't know about you, but I get a chuff-load of spam every day which seriously hacks me off. Getting onto a blacklist for any length of time means a boat load of spam must be coming from that machine - hence it's the fault of your host provider for not cracking down sooner.

    I say all power to them!

    --
    &lawyers($instruction);
  9. Re:Maybe... by orasio · · Score: 2, Insightful

    (I was going to mod you down, but I understood that its a good comment, I just think you are wrong)

    Nonsense. HTML mail should be rendered as HTML. If you want to see text-only, or something, you can just read mail as text-only, in your client. If I send mail with baloons, it is because I want people to see my beautiful baloons and gothic handwriting. Messing with that is mangling communication, the other person thinks you saw something you didn't.

    No one I know abuses HTML mail to the extent of making it hard to read. If I had friends like that, they wouldn't know my email address.

    Maybe you just need to be more picky about giving your address to people.