Slashdot Mirror


How Apple's Mail.app Junk Filter Works

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

46 of 273 comments (clear)

  1. Face recognition by dysprosia · · Score: 3, Informative

    I believe I remember reading somewhere that the same sort of vector/clustering calculations are used in face recognition software?

    Just goes to show how solid math/calculations can have some useful applications!

    1. Re:Face recognition by moyix · · Score: 4, Informative

      Yes, for example, the eigenfaces method converts each image into a vector, and constructs a new subspace based on the highest ranked common features between them (using Principal Component Analysis, aka the Karhunen Lòeve Transform). Then new images are projected into this space and the shortest distance between the new vector and the previously computed ones is found.

      It was the first thing that popped into my head while reading the article too :)

  2. Re:Kinda like Mozilla Mail? by BWJones · · Score: 5, Informative

    Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.

    --
    Visit Jonesblog and say hello.
  3. GD, RTFA! by Zen+Programmer · · Score: 5, Informative

    If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."

  4. Re: Bayesian Filtering by Anonymous Coward · · Score: 2, Informative

    The author is awfully dismissive of bayesian filtering, which works extremely well for me and for lots of other people. See mozilla, spam assassin, others.

  5. Re:Maybe... by Anonymous Coward · · Score: 5, Informative

    That's why, at our site, all incoming email goes through the Anomy Sanitizer. It removes unknown HTML tags, like <vframe> or <script>, as well as filters offsite images to eliminate so called web-bugs.

    Oh, and it's fast, too.

  6. vs bayesian filters ? by Bugmaster · · Score: 2, Informative

    How does this technology compare to Bayesian filters such as PopFile ? PopFile was not made by Apple, so clearly it doesn't have the cult appeal, but it has been working flawlessly for me for about a year now. What really irks me about this article is how it implies that Apple invented trainable filters -- where, in reality, this is very far from the truth. Apple does the same thing with pretty much everything it sells... sort of like Soviet Russia, who claimed to have invented flight, radio, transistors, and probably elephants too.

    --
    >|<*:=
  7. Re:how does it compare to Bayesian? by the+quick+brown+fox · · Score: 3, Informative
    That actually tends not to happen. Most Bayesian filtering packages are weighted very conservatively, so that one or two highly non-spam tokens (like your grandma's e-mail address, or the name of the uncle who is on the little blue pill) will more than counterbalance the spam tokens.

    Again, what's intuitive doesn't play out in practice... this seems to be a common theme in the world of statistical spam filtering. For example, you'd think the word "free" would be pretty spammy... in my corpus, it only gets a score of .406 (where 0 is least spammy and 1 is most spammy, and an e-mail must have an aggregate score of .9 to be classified as spam). On the other hand, "sir" gets .945 and "madam" gets .987.

  8. Re:how does it compare to Bayesian? by SimplyCosmic · · Score: 5, Informative

    Bayesian spam filtering doesn't mark an email as spam simply because of the presence of one single word, but using a mathematical equation based on the likelyhood of each of the words being in the message being symptoms of spam. What you're talking about is simply a spam filter based on a blacklist of words. Bayesian spam filtering uses mathematics to consider how those words are used in the context of the rest of the message, and do a surprisingly good job of it.

    Therefore, "viagra" in your grandmother's email might have a high indication of spamminess, but all the other words will lower the score below the rather high threshold needed to be considered spam.

    That's why training your bayesian spam filter on the email you receive is so important, as it learns what you consider spam from the type of email you receive.

  9. Re:Apple spam by k_187 · · Score: 3, Informative

    There is, Apple puts a rule in by default that stops Mail from evaluating any mail from apple. Well, there is in Panther, don't know if you caught that or not, but that might fix your problem.

    --
    11 was a racehorse
    12 was 12
    1111 Race
    12112
  10. Re:Apple spam by timgoh0 · · Score: 5, Informative

    This behaviour is due to the rules set up in apple mail. To disable this behaviour, go to the mail preferences, select rules and remove the entry "news from apple"

  11. Re:Apple spam by .com+b4+.storm · · Score: 4, Informative

    Did you check your "rules" preferences? Mail.app by default includes a rule to "Stop evaluating rules" for mail from a whole host of Apple e-mail addresses. I've never tried deleting it to see if I can get Apple mail to get filed as spam because... well, they e-mail me maybe twice a year and it's always been worth reading. But you might want to check out that rule, it could be what's fouling you up.

    --
    "Wow, you're like some kind of superhero able to ward off happiness and success at every turn."
    -- Ryan Stiles
  12. Re:Kinda like Mozilla Mail? by DrSchlock · · Score: 5, Informative

    This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

    Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.

    To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)

    A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.

    Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.

  13. Re:Maybe... by karmatic · · Score: 5, Informative

    Macs are vulnerable to the so-called "hole" as well. In fact, _any_ html compliant email client with image support is.

    For example, I wrote some software which takes your email address, and assigns a 5 letter id. The img tag loads an image with the url http://mailserver/get/yourid/image.gif

    From this, it's possible to tell 1) If the email is valid, 2) If you click the image (the url contains your ID) 3) How long before you click 4) If you buy.

    So, if you're dumb enough to buy from spam you get on a sucker list.

    Quit blaming MS - they are unfortunatly the ones who introduced HTML mail, but everyone else who follows suit has problems too.

  14. Re:Apple spam by Libraryman · · Score: 2, Informative
    There could be a back door in the spam filter, but I have another [slightly] less sinsiter possibility.

    Mail.app ships with a preset filtering rule to color-lable messages from Apple in blue. The junk filter may be set not to act on messages which are already being filtered (colored, flagged, moved to a specific folder) by one of your rules. Try deleting the rule to colorize the mail from Apple and see if it starts junk filtering it.

    Also worth noting, Apple will remove you from its mailing lists, any email from them includes links/instructions to do this.

  15. Re:Maybe... by bigberk · · Score: 2, Informative
    from which a spammer can clearly see that you have opened their messages and validate your address...
    That's old news, I wrote the solution three years ago. Just use a mail client such as this one that strips HTML.
  16. Re:Maybe... by tkokesh · · Score: 5, Informative
    Actually, Mail.app in Mac OS X 10.3 (Panther) has an option in the "Viewing" Preferences: "Display images and embedded objects in HTML messages".

    When this option is unchecked, the user has to click a specific "Load Images" button in order to see the images in an HTML email, which means that the GIF does not get loaded unless the user lets it. For obvious spam emails, of course, the user can just junk the email, and the spammer gets no confirmation of delivery.

    --

    A pride of lions.
    A gaggle of geese.
    A murder of crows.
    A vista of bugs.
  17. Re:Vectors..... by BWJones · · Score: 5, Informative

    The magic doesn't come from vectors. Vectors are just how you throw the numbers around

    And your point is?

    The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test.

    For a univariate space (or perhaps bivariate space) this will work, but now try implementing standard chi-square analysis in multivariate (or hyperspectral) space. Starts to fall short rather quickly thus the measures of distances between clusters analysis.

    Image clustering is hard, and the problem comes from picking a good representation of the image.

    Yes, I do image clustering almost every day. Well, at least a couple times a week. With proper discriminands one can overcome "good image representation" problems.

    Of course, a "word histogram" for an image makes no sense.

    Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.

    Just considering pixel intensity or pixel color doesn't work either.

    Actually, yes it does. This is how many standard measures of image cluster analysis work.

    You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc.

    Actually, no. For many image classification algorithms that examine pixel value (oil bearing strata, concrete vs granite, types of aluminum in missiles etc...), structure or anatomy play absolutely no role in the identification of classes.

    Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques.

    That is a very difficult approach to take for image classification that begins to rely on machine processing and image "interpretation" which is a much higher order problem.

    But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

    Simply add more discriminands or filters and don't worry about "describing" the image. Other properties (like structure and anatomy) fall out after image clustering.

    --
    Visit Jonesblog and say hello.
  18. Re:Sounds sufficiently different to me by arcus · · Score: 2, Informative
    A) When you reduce the the N dimensional space, your would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
    The method described by Paul Graham only looks at a handful of the most 'interesting' words in the mail ('interesting' meaning tending to yield a high probability of being either spam or ham). Adding lots of random words could mean that the spammer lucks out and gets words that happen to appear a lot in your regular mail, but what's rather more likely is that the 'interesting' words will be things like 'viagra', and the random words will simply be ignored. Bayesian sorting isn't necessarily particularly vunerable to random words.

    What would tend to defeat Graham's filter more would be inconsistent spelling of key words, i.e. v1agra, v|agra, V!agr4 or whatever. Perhaps other bayesian filters are cleverer.

  19. Re:Maybe... by rritterson · · Score: 2, Informative

    Or you can just set Outlook 2003 to not parse html and show it as code instead. You can also tell it not to download images by default which prevents another possible 'notifier'

    --
    -Ryan
    AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
  20. Not if email is marked as junk... by SuperKendall · · Score: 5, Informative

    If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.

    As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:Not if email is marked as junk... by soft_guy · · Score: 2, Informative

      I use Mail.app, I have Panther, and I keep everything current. Still, Mail often misses many pieces of spam every day and gives me false positives from time to time. YMMV. Still, I find the junk mail filter useful enough to leave on.

      --
      Avoid Missing Ball for High Score
  21. Re:Mail & IMAP by elbobo · · Score: 3, Informative

    doesn't ... have the ability to move my mail to a Junk folder on my IMAP server.

    Yes it does:

    Preferences -> Accounts -> Special Mailboxes -> Store junk messages on the server.

    My personal IMAP complaint is that you can't create rules to move messages between folders on the server, only folders on the client.

  22. Re:Fast?!? by alannon · · Score: 3, Informative

    One of the reasons that eudora tends to be fast for some things when Mail.app isn't, is that Eudora does not store attachments with the mail. It splits them off at download-time into a separate folder. Mail.app keeps the entire mail envelope intact, including attachments. This makes Mail.app often very, very slow when moving large numbers of messages around, simply because it's doing a lot of file manipulation. I will admit, though, that Mail.app often feels very sluggish. Apple needs to work on that.

  23. Combine with JunkMatcher for 99%-100% accurracy by kiddailey · · Score: 2, Informative


    Mail's junk filter may be okay, but it's not nearly as good as the article makes it out to be. You'll get much better results using a combination of the built-in filter along with external filtering/tagging devices.

    For example, I ran across JunkMatcher some time ago and have been enjoying 99-100% accurracy with less than 1% false positives. It's a huge improvement over the 80-90% accurracy I was getting with the built-in filter alone.

  24. Re:Vectors..... by Hays · · Score: 4, Informative

    You're being overly hard on the grandparent. He makes some good points. And naive image vectorization IS a problem. Eigenfaces only works with extremely careful registration of images, because the images are vectorized naively. Basically this means throwing out any notion of spatial coherence. (You could vectorize the image in random order, scanline order, whatever.. as long as you did it consistantly across the data set you'd get the same bases out. Shouldn't a system understand that an image shifted one pixel to the right is not arbitrarily far from its original version?).

    See http://www.cs.columbia.edu/~jebara/papers/iccv03.p df for a good argument about this

    And responding to another point of yours, classification algorithms that look only at intensity are at best brittle. In the real world things have to be better. You have to be able to recognize an object under different lighting, etc. The fact that you can design and calibrate a system well enough to work on pixel intensity alone in a few specific cases doesn't convince me that it's robust.

    That's not to say that you can't do some vision tasks with relatively simple metrics like intensity histograms or naively vectorized images, but really data representation is a major bottleneck for a lot of vision work. But you look like you're qualified to know that so I don't know why you're jumping down the grandparent's throat.

  25. Re:how does it compare to Bayesian? by martin-boundary · · Score: 3, Informative
    Bayesian filtering is a subset of what LSM can do.
    I'm sorry, but that's just completely wrong. Whoever is propagating this deserves a slap on the forehead.

    Bayesian theory is the most general possible form of rational decsion making. *Any* rational method based on belief structures can be represented in a Bayesian form. This was shown by Richard Cox in about 1944.

    Here's an excerpt from this wikipedia article, to whet your appetite:

    1. Divisibility and comparability - The plausibility of a statement is a real number and is dependent on information we have related to the statement.

    2. Common sense - Plausibilities should vary sensibly with the assessment of plausibilities in the model.

    3. Consistency - If the plausibility of a statement can be derived in two ways, the two results must be equal.

    Any system of reasoning which satisfies those assumptions has a Bayesian version, and conversely. (Read the whole article if you want to argue edge cases).

    So, if LSA (you wrote LSM?) works, then it's only to the extent that there's an underlying Bayesian model which makes it work.

  26. There's plenty of LSI information online by K-Man · · Score: 4, Informative

    Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).

    In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.

    That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    1. Re:There's plenty of LSI information online by kanthaka · · Score: 2, Informative

      There's a good survey of information retrieval techniques & algorithms here -
      http://maya.cs.depaul.edu/~classes/ds575/lecture.h tml
      It's a course site so the lectures not accessible, but all the articles & tools are.

  27. Re:Missing functionality by ezthrust · · Score: 3, Informative
    There might be something of use for you in this thread on macosxhints.com

    http://www.macosxhints.com/article.php?story=20030 320162436823

    Although there is a warning that once this is done, Mail stops learning.

  28. Re:Missing functionality by n8_f · · Score: 3, Informative

    How that big server-level database of yours supposed to work?

    Uhh, how do you get any mail that he doesn't? The data would be stored in one of the user's mail folders, just like an attachment. You completely misunderstood the parent poster. He accesses the same IMAP account from multiple different machines, but he has to train each one of his clients FOR THE SAME ACCOUNT. So he gets 10 messages to homer@doh.com and his machine at work filters out message 1 and 2. He gets home, and his client filters out message 7. His laptop filters out message 9. They've each been trained to recognize some of the spam, but their training is incomplete because only one of the 3 clients is trained for each message that comes in. The only way to make it consistent would be to move all of the junk message back into the Inbox and select them as junk in each mail client. Pretty crappy. And it gets unsalvageable when you mark a message as Not Junk on client 2 that client 1 marked as Junk. I have the same issue. I just leave me home client running most of the time, so it handles all of the filtering as new messages come in and then mark the ones it missed when I get home. But the parent is right, Mail should just store it on the IMAP server.

    Which brings up an interesting point. I tend to store all of my notes on my personal IMAP server as drafts, so I can get to it anywhere. Why don't any programs use IMAP to store data? Can you not access them at a byte level, but only as whole messages? I haven't looked at the IMAP protocol. Could it be combined with WebDAV for a unified data store? I would love to have a server that allowed me to keep all of my e-mail, documents, contacts, etc. in one place that I could access from anywhere.

  29. Re:Crystal clear ... erm ... by martin-boundary · · Score: 2, Informative
    so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???

    ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.

    Well, the article explains very poorly, but the approach isn't that new. Look up cluster analysis in google.

    Latent Semantic Analysis broadly works as follows:

    First, you plot all documents as points in space, by using each word as an independent coordinate. So if you have 100,000 unique words in all documents, then you plot in 100,000 dimensional space.

    Second, you compute the principal eigenvectors for the matrix of all the documents, viewed as columns. This gives you a partial new coordinate system in the 100,000 dimensional space. You don't compute those eigenvectors whose eigenvalue is too small, it's just a waste of effort.

    Finally, each document is dotted with each eigenvector, obtaining the representation of the document in the eigen-coordinate system. This now tells you, for each document, how much it resembles each eigenvector. The eigenvectors represent "concepts", and the mix of eigenvectors used for each document represents the mix of "concepts" in the document.

    Roughly speaking, that's what LSA does, modulo the devil in the details and speed/memory optimization.

  30. Re:how does it compare to Bayesian? by rbright · · Score: 2, Informative

    Furthermore, most Bayesian filters process headers as well, so the mail would be weighted heavily towards ham simply because it was from Aunt Emma and addressed directly to you.

  31. Re:Maybe... by dj245 · · Score: 1, Informative

    Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
  32. Re:how does it compare to Bayesian? by ghamerly · · Score: 2, Informative

    Your post is a bit misleading. It's true that the words are all considered together, but it's not true that they are considered "in context" in the sense that phrases are considered. The thing that makes Naive Bayes classifiers viable for most applications is that they are "naive", and do not consider phrases. Instead, each word is considered conditionally independent of every other word (conditioned on the class label, in this case spam or not spam). The "spamminess" of each word has an additive effect, and the phrase "Joe wants to buy viagra" (in a non-spam) is about equally spammy as "You should buy viagra" (in a spam).

    Just wanted to clear that up. It may have been what you meant all along, but that's not what came through.

  33. Re:Fast?!? by Rosyna · · Score: 2, Informative

    The limit exists on OS X (at least) because of a limit of the Resource Manager. Each message in the mbox on OS X has its index and other data in the resource fork. One for each message. There is a 16-bit limit on the number of resources in a file (and a 16meg limit for the entire resource fork). It is also why some OS X developers keep asking apple to FREAKIN IMPLEMENT NAMED FORKS ALREADY!

  34. Re:how does it compare to Bayesian? by NoOneInParticular · · Score: 4, Informative

    You're absolutely right, but note however that what the grandparent calls 'Bayesian filtering' is referring to something that is more commonly known as 'naive Bayes': Bayesian inference with a set of extremely limiting assumptions. This technique is known in information retrieval as both the 'multinomial' and the 'multivariate' model of word frequency manipulation (which is which depends on how you store the evidence: only word occurrences or also word counts). In this sense, 'Bayesian filtering' is a very narrow subset of 'Bayesian inference' and its completely possible, and even quite likely, that latent semantical analysis subsumes it.

  35. Good god, man by thatguywhoiam · · Score: 5, Informative
    Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too [msnwar.com].

    Well, since you brought it up, yes, let's compare:

    Apple method:
    Open Prefs
    Click Viewing Options
    Uncheck 'Display images and embedded objects in HTML messages'

    ... or I can go hunting on the web for this weirdo, non-sanctioned 'patch' for Outlook, and install that. Oh yeah, and ZoneAlarm.

    I'll stick with Apple's method thanks.

    --
    If Jesus wants me it knows where to find me.
    1. Re:Good god, man by fanfriggintastic · · Score: 3, Informative

      Images are off by default in Outloook 2003. You can turn them on for a particular sender or per email, easily, through a link at the top of the message. Piece of cake.

      --
      This is not the greatest sig in the world, no. This is a tribute.
    2. Re:Good god, man by geoffspear · · Score: 2, Informative
      My three mouse buttons all work perfectly well with my Mac. They don't restrict you to anything, they just sell their machines with a one-button mouse.

      I don't even need to go hunting for drivers to install if I want to plug in another mouse, or damn near any other USB device. They just work.

      --
      Don't blame me; I'm never given mod points.
  36. Latent Semantic Analysis by Henry+Stern · · Score: 4, Informative

    After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.

    Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.

    The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).

    Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.

    You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.

    I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.

    For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.

    [1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.

    [2] Singular Value Decomposition -- from MathWorld. http://mathworld.wolfram.com/SingularValueDecompos ition.html

    [3] Frobenius Norm -- from MathWorld. http://mathworld.wolfram.com/FrobeniusNorm.html

    [4] Artificial Intelligence Wiki: NearestNeighbour. http://www.ifi.unizh.ch/ailab/aiwiki/aiw.cgi?Neare stNeighbor

  37. Re:Maybe... by That's+Unpossible! · · Score: 2, Informative

    I assume web bug images aren't filtered out if they are, for example:

    http://host.com/images/1F59C6EA.jpg


    You assume wrong. The guy you're responding to said they remove offsite image tags. So unless the images are embedded in the email (i.e. not web-bugs), they aren't displayed.

    You cannot filter web-bugs and still leave images pointing offsite, obviously.

    --
    Ironically, the word ironically is often used incorrectly.
  38. Re:Fast?!? by EvilTwinSkippy · · Score: 4, Informative
    Where to start...

    First off, servers take SATA or SCSI, not the cheepy IDE drives you find on the net. Second, even if you could find equivilent sizes for equivilent prices for server-grade stuff, I can't speak for everyone, but users don't store anything on my network that isn't on a RAID. 2 drives for a RAID-1, 3 (at least) for RAID-5.

    Assuming that cost isn't an issue, and you have a miraculaous RAID controller that is easy to program, you run into the problem of how to hook up the new drives. If you don't have enough bays and connectors you have to drop your old hard drives to tape, plug in your new drives, and restore.

    The last time I did a restore of 160GB it took 48 hours with a DLT autoloader. AIT might cut that down to 12 hours. But that's still a long time to be without data.

    I'll save the isues about premature failure on these uber-mega drives for another discussion.

    Now I insist our users use IMAP for email. Too many bad experiences of desktops croaking and taking all of a user's POP mailboxes with it. Making your system catalogue several gigabytes of email per user is going to slow things to a crawl, unless you are using something enlightened like maildir. Even then, you are going to be hell bent to find a file system that effiently handles both uber-mega attachments AND a few million tiny text files for individual messages.

    All for what? So some user doesn't have to be bothered to clean out their mailbox?

    No problem, except the next thing El' numbnuts is going to ask for is a tool to actually FIND something in all that mess.

    --
    "Learning is not compulsory... neither is survival."
    --Dr.W.Edwards Deming
  39. Re:how does it compare to Bayesian? by adamengst · · Score: 2, Informative

    You can have Bayesian filtering in Mail, with SpamSieve from Michael Tsai.

    You might also be interested in reading Joe Kissell's just-released ebook Take Control of Spam with Apple Mail, which explains the common accuracy problems with Mail's Junk filter and how to optimize it for better results. Joe also recommends SpamSieve as an alternative to Mail's Junk filter in those instances where Mail proves inadequate.

    cheers... -Adam

  40. Re:Maybe... by myov · · Score: 2, Informative

    Messages flagged as spam do not display images (until you click Load Images). I requested this feature a while ago because of all the web bugs embedded in spam.

    --
    I use Macs to up my productivity, so up yours Microsoft!
  41. Apple's filters need help by cardozo · · Score: 2, Informative
    I found the same as other people have noticed, that Mail.app's filter misses stuff and is hard to train.

    Enter JunkMatcher Central.

    it uses rules based filtering to complement Mail.app's methods. And, as a bonus, you can have it mark what it finds as junk mail, which trains mail.app.

    It requires some tweaking, but is great, updated often, and free!