How Apple's Mail.app Junk Filter Works

Magic by Faust7 · 2004-05-18 16:56 · Score: 4, Funny

and no, it doesn't use white magic...

Black, then?
Or is that reserved exclusively for Microsoft?

Re:Magic by Anonymous Coward · 2004-05-18 17:03 · Score: 0

Maybe the lack thereof of magic.
Re:Magic by Jameth · 2004-05-18 17:18 · Score: 4, Funny

and no, it doesn't use white magic...

Black, then? Or is that reserved exclusively for Microsoft?
It's not reserved, they have a monopoly.
Re:Magic by lpangelrob2 · 2004-05-18 17:32 · Score: 1

Black, then?
I would have to imagine it would be a little more like red magic. Pretty versatile, borrows a bit of both, and largely effective, but if you want hardcore effects, you'll have to go all white or all black.

--
-Rob
Marriage doesn't have to suck!
Re:Magic by Inf0phreak · 2004-05-18 17:39 · Score: 2, Funny

Oh yes. I can just imagine how some of the code looks:
if (isspam(mailentry)) HADOKEN(mailentry);

Go here for an explanation (funny webcomic IMO).

--
________
Entranced by anime since late summer 2001 and loving it ^_^
Re:Magic by Anonymous Coward · 2004-05-18 22:13 · Score: 0

if (isspam(mailentry)) HADOKEN(mailentry);

BAD idea, unless you only get one spam a day. ;)

Maybe... by ErichTheWebGuy · 2004-05-18 17:02 · Score: 5, Interesting

Microsoft can learn a lesson here? Especially in the light of this hole, from which a spammer can clearly see that you have opened their messages and validate your address...

--
bash: rtfm: command not found

Re:Maybe... by Anonymous Coward · 2004-05-18 17:24 · Score: 5, Informative

That's why, at our site, all incoming email goes through the Anomy Sanitizer. It removes unknown HTML tags, like <vframe> or <script>, as well as filters offsite images to eliminate so called web-bugs.

Oh, and it's fast, too.
Re:Maybe... by ErichTheWebGuy · 2004-05-18 17:41 · Score: 1

Sweet, thanks for the info. I will look into deploying it at our site.

--
bash: rtfm: command not found
Re:Maybe... by karmatic · 2004-05-18 17:43 · Score: 5, Informative

Macs are vulnerable to the so-called "hole" as well. In fact, _any_ html compliant email client with image support is.

For example, I wrote some software which takes your email address, and assigns a 5 letter id. The img tag loads an image with the url http://mailserver/get/yourid/image.gif

From this, it's possible to tell 1) If the email is valid, 2) If you click the image (the url contains your ID) 3) How long before you click 4) If you buy.

So, if you're dumb enough to buy from spam you get on a sucker list.

Quit blaming MS - they are unfortunatly the ones who introduced HTML mail, but everyone else who follows suit has problems too.
Re:Maybe... by ErichTheWebGuy · 2004-05-18 17:47 · Score: 1

Quit blaming MS

Drag... I had no idea. Thanks for the info. I just assumed, since past history supports the theory, that the Microsoft software was the mitigating factor.

--
bash: rtfm: command not found
Re:Maybe... by bigberk · 2004-05-18 17:53 · Score: 2, Informative

from which a spammer can clearly see that you have opened their messages and validate your address...
That's old news, I wrote the solution three years ago. Just use a mail client such as this one that strips HTML.
Re:Maybe... by tkokesh · 2004-05-18 17:57 · Score: 5, Informative

Actually, Mail.app in Mac OS X 10.3 (Panther) has an option in the "Viewing" Preferences: "Display images and embedded objects in HTML messages".
When this option is unchecked, the user has to click a specific "Load Images" button in order to see the images in an HTML email, which means that the GIF does not get loaded unless the user lets it. For obvious spam emails, of course, the user can just junk the email, and the spammer gets no confirmation of delivery.

--

A pride of lions.
A gaggle of geese.
A murder of crows.
A vista of bugs.
Re:Maybe... by rritterson · 2004-05-18 18:21 · Score: 2, Informative

Or you can just set Outlook 2003 to not parse html and show it as code instead. You can also tell it not to download images by default which prevents another possible 'notifier'

--
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Re:Maybe... by nacturation · 2004-05-18 18:29 · Score: 4, Interesting

I assume web bug images aren't filtered out if they are, for example:

http://host.com/images/1F59C6EA.jpg

A spammer could setup their server (mod_url I think?) so that this gets translated to:

http://host.com/serve_image.php?email_id=1F59C6E A

This would still verify the email address and would generally be transparent to the user. The filter could get smarter and search for numbers, but this is also easily overcome by dictionary words. If you used 5 letter words, you'd have about 10,000 of them to use. You could then represent 100,000,000 (10,000 ^ 2) email addresses using only two five letter words in succession in a URL, such as:

http://host.com/img/abash/zymin/logo.jpg

and rewriting it as before. Each user gets a unique combination of two words that uniquely identifies them. If abash is the 9th word and zymin is the 9914th word, then this is user id (9 * 10,000 + 9914) = 99,914.

Really, the only solution to web bugs is to not load images from unknown senders. Make the user manually load images (mail.app has this feature as do many other clients) if they are not attached as files with the message.

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:Maybe... by Anonymous Coward · 2004-05-18 18:51 · Score: 0

Not true. kmail isn't vulnerable to this.
Re:Maybe... by Anonymous Coward · 2004-05-18 18:59 · Score: 0

I assume web bug images aren't filtered out if they are, for example ... http://host.com/images/1F59C6EA.jpg
No, you assume wrong. Of course they are filtered out in that case, otherwise (as you go on to say) there would be no point. Anomy filters any <img a="..."> tag where "..." is not an image that is not attached to the email in a MIME section.
Re:Maybe... by nacturation · 2004-05-18 19:23 · Score: 1

Thanks for the clarification. Upon re-reading your original post, I misunderstood what you said about images.

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:Maybe... by mibus · 2004-05-18 19:59 · Score: 1

Evolution is good with images in html mail... it'll show them, but only if they're attatched to the email itself. Otherwise, you have to use a menu item to load the images for that message.
Re:Maybe... by dj245 · 2004-05-18 21:17 · Score: 1, Informative

Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too.

--
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
Re:Maybe... by EddWo · 2004-05-18 21:46 · Score: 1

Fixed in Outlook 2003, does not show images in email unless requested or the sender is added to a safe sender list, and in Outlook Express included in XP SP2.

--
"Taligent is still pure vapor. Maybe they'll be the last who jumps up on Openstep... "
Re:Maybe... by the_olo · 2004-05-18 22:19 · Score: 1

Some other clients have an option to modify that behaviour. For example in Mozilla you can switch message display to simple HTML, which doesn't load remote images, or select "Do not load remote images in Mail and Newsgroups messages" in privacy options.
However, that implementation has weak points too, like not using a sandbox, so there's an open bug on improving this:

Bugzilla Bug 28327 No server hits at HTML mailnews reading - privacy (disable remote content/web-bugs)
http://bugzilla.mozilla.org/sho w_bug.cgi?id=28327

(you have to copy/paste the URL since Mozilla Bugzilla denies referrals from Slashdot).
Re:Maybe... by Short+Circuit · 2004-05-18 23:23 · Score: 1

Thunderbird disables images in a message (and performs other sanitizing operations) if you mark a message as junk. And you don't have to read a message in order to mark it.

So if you're unsure about a message, mark it as junk, read the text, then go from there.

--
tasks(723) drafts(105) languages(484) examples(29106)
Re:Maybe... by tbone1 · 2004-05-18 23:47 · Score: 1

Maybe Microsoft can learn a lesson here?
Do you realize what you just said?

--

The Independent: Reverend Spooner Arrested in Friar Tuck Incident - ISIHAC, Historical Headlines
Re:Maybe... by Anonymous Coward · 2004-05-19 00:21 · Score: 0

Yes but images are off by default so if you want it to read HTML mail with images you have to go find this "burried preference".

If you have never used Mail.app then refrain from commenting and embarrassing yourself.
Re:Maybe... by spoot · 2004-05-19 00:22 · Score: 0, Flamebait

You can also tell Outlook not to check mail. That's pretty foolproof. Even better, move Outlook to a safe place, such as the trash bit. That's a pretty good way to avoid any offending code.
Re:Maybe... by cosmo7 · 2004-05-19 00:41 · Score: 1

"Mitigating" means reducing in severity.
Re:Maybe... by Refrag · 2004-05-19 01:06 · Score: 1

Mail won't load images for any mail that had been identified as junk mail unless the user tells it to.

--
I have a website. It's about Macs.
Re:Maybe... by That's+Unpossible! · 2004-05-19 01:09 · Score: 2, Informative

I assume web bug images aren't filtered out if they are, for example:

http://host.com/images/1F59C6EA.jpg

You assume wrong. The guy you're responding to said they remove offsite image tags. So unless the images are embedded in the email (i.e. not web-bugs), they aren't displayed.

You cannot filter web-bugs and still leave images pointing offsite, obviously.

--
Ironically, the word ironically is often used incorrectly.
Re:Maybe... by LoudMusic · 2004-05-19 02:02 · Score: 0, Offtopic

Actually, Mail.app in Mac OS X 10.3 (Panther) has an option in the "Viewing" Preferences: "Display images and embedded objects in HTML messages".

How did this get modded up? Nearly every mail program has the option to disable HTML.

--
No sig for you. YOU GET NO SIG!
Re:Maybe... by JLester · 2004-05-19 02:04 · Score: 1

It is the default in Outlook 2003 now. You have to specifically set what senders you want to see images from.

Jason

--
"FORMAT C:" - Kills bugs dead!
Re:Maybe... by Anonymous Coward · 2004-05-19 02:23 · Score: 0

I take an easier approach. I have a software firewall running on my system that can filter by application and/or port number(s) an application is allowed to "talk" to. I only allow Outlook to talk to ports 25 and 110 on my incoming and outgoing mail servers, and that's it. In addition, the "preview pane" is disabled and I run SpamBayes to filter mail. The firewall I use is called the Tiny Personal Firewall. No affiliation to the company other than a very satisfied user.
Re:Maybe... by Anonymous Coward · 2004-05-19 02:30 · Score: 0

Yeah, I guess we should blame YOU for writing the software!
Re:Maybe... by Merk · 2004-05-19 02:38 · Score: 2, Interesting

Why leave any HTML? Does <blink> make a message more compelling? Do you really need someone to send a message with baloons in the background? If someone really likes the handwriting font, should I be forced to see that in their email?

Sure, sometimes in a complex email it would be nice to be able to use headers or bulleted lists. But nobody should be able to force me to display the message with their ugly-ass markup.

The only thing that makes any sense here is to use strict stylesheet-based markup. Someone can label things as 'headers' and 'bulleted lists'. Then, the receiver can have a stylesheet that properly renders these types of content markers so the information isn't lost. That way, 'chick who likes baloon backgrounds' can display all her incoming emails that way, and 'guy who likes unreadable fonts' can have all his incoming emails displayed in that font... but those of us who like black, 12pt Times New Roman text on white backgrounds can avoid being driven insane.
Tags like <i> and <b> and <blink> and <font> shouldn't ever be part of email.
Re:Maybe... by jskiff · 2004-05-19 02:49 · Score: 1

Whether I like it our not, the company I work for sends out a quaterly email "newsletter" for people who sign up for our mailing list. To be fair, ours is defintely "opt in". When you download our software, you don't have to enter your email address at all if you don't want to, and even then you still have to click a checkbox saying "I would like to receive annoucements..." etc. It's not great, but at least we're not overtly decepitve.

Unfortunately, though, we do use image tags that link back to user IDs, so we can see who's opened the emails. Additionally, all of the links we have regarding products have unique IDs as well, so if someone clicks through for more information, that activity gets logged to our CRM system so one of our friendly sales folks can follow up.

I can't say I like it, but unfortunately it seems to be the way a lot of business gets done these days.

--
It's "no one," not "noone." Who the hell is noone anyway?
Re:Maybe... by Myopic · 2004-05-19 03:00 · Score: 1

you seem to have all the answers. please don't go into the spam buisiness, unless you already are. ;-)
Re:Maybe... by orasio · 2004-05-19 03:07 · Score: 2, Insightful

(I was going to mod you down, but I understood that its a good comment, I just think you are wrong)

Nonsense. HTML mail should be rendered as HTML. If you want to see text-only, or something, you can just read mail as text-only, in your client. If I send mail with baloons, it is because I want people to see my beautiful baloons and gothic handwriting. Messing with that is mangling communication, the other person thinks you saw something you didn't.

No one I know abuses HTML mail to the extent of making it hard to read. If I had friends like that, they wouldn't know my email address.

Maybe you just need to be more picky about giving your address to people.
Re:Maybe... by needacoolnickname · 2004-05-19 03:08 · Score: 1

Actually it is in the top of the email. It say that this email contains images that have not been loaded (because this is the default with Mail.app), would you like to load them. If so one clicks on a big ass button that says "LOAD IMAGES" to see them.
Re:Maybe... by Anonymous Coward · 2004-05-19 03:28 · Score: 0

It doesn't disable HTML, it changes the default behavior to not load images and embedded objects in HTML messages.
Re:Maybe... by azav · 2004-05-19 03:46 · Score: 1

Which is why I don't display images in my HTML email.

The problem with MS is that GENERALLY, when they do introduce something, everyone else HAS TO follow suit.

And having HTML email on by default, is inane.

So I'll blame them for introducing another blight into society.

--
- Zav - Imagine a Beowulf cluster of insensitive clods...
Re:Maybe... by myov · 2004-05-19 04:31 · Score: 2, Informative

Messages flagged as spam do not display images (until you click Load Images). I requested this feature a while ago because of all the web bugs embedded in spam.

--
I use Macs to up my productivity, so up yours Microsoft!
Re:Maybe... by nacturation · 2004-05-19 05:47 · Score: 1

Yeah, I realized this after he pointed it out. Me failed reading comprehension. See other AC thread.

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:Maybe... by ChaosDiscord · 2004-05-19 06:25 · Score: 2, Funny

Maybe you just need to be more picky about giving your address to people.

I tried that, but my boss got angry when I refused to give him my business address.

--
Search 2010 Gen Con events
Re:Maybe... by Golias · 2004-05-19 06:42 · Score: 2, Funny
I totally agree!!! It seems to me that converting HTML to plain old text should be a perfectly fine choice for those who don't want to read your
Some people really like using HTML, and everybody should respect that.
Those who read this hoseshit from the command line can just suck it up and deal with it.
--
Information wants to be anthropomorphized.
Re:Maybe... by orasio · 2004-05-19 07:28 · Score: 1

When I said "text-only" I was thinking something more in the line of the mail reader running
lynx -force_html -dump <crappy_html> > <text_output>
on HTML content.

In the case of your post, HTML-hating people would read something like this, according to lynx:
-- I totally agree!!! It seems to me that converting HTML to plain old text should be a perfectly fine choice for those who don't want to read your dumbass, pointless markup. ; Some people really like using HTML, and everybody should respect that. Those who read this hoseshit from the command line can just suck it up and deal with it.
--
Anyway, in this particular case, I would just mark your mail as junk in mozila thunderbird, but that's just me.
Re:Maybe... by RogerWilco · 2004-05-19 10:50 · Score: 1

I agree with you,

If they want to use mark-up, then send me an e-mail with an attachment,
in whatever format they like.

--
RogerWilco the Adventurous Janitor
Re:Maybe... by Calroth · 2004-05-19 11:12 · Score: 1

There's been a solution for this in web browsers for years now:

User-defined custom CSS stylesheets.

With these, you can disable what you want, remove backgrounds, remove blink tags, and turn all text to Times New Roman.

Unfortunately, they haven't made it into e-mail clients yet (with the possible exception of Mozilla; I haven't checked this one).

I think HTML mail is great in principle, and nobody complains much about gaudy web pages with balloon backgrounds and blinking text, even though they're just as easy to make as gaudy e-mail messages. It's a social thing, not a technical thing. You ignore web pages that don't appeal to you; why not the same for e-mail?

Besides, text mail is (to me) a throwback to 1970... monospaced, 80x24 terminals. I know this is Slashdot and people love stuff like that, but it doesn't appeal to me.
Re:Maybe... by Hooded+One · 2004-05-19 16:50 · Score: 1

You'd be right that Mozilla is an exception - userContent.css works just as well in Mail/Thunderbird.

Vectors..... by BWJones · 2004-05-18 17:03 · Score: 4, Interesting

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.

The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

--
Visit Jonesblog and say hello.

Re:Vectors..... by Anonymous Coward · 2004-05-18 17:15 · Score: 0

Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.

LOL. Yeah, client-side filtering of 1,000-word emails really makes a dent in a 1 or 2 GHz processor without parallel vector processing. Damn, 12 microseconds instead of 3 microseconds! :-)
Re:Vectors..... by mrpuffypants · 2004-05-18 17:15 · Score: 1

The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

Yes, that is important and all, but the real question is: "How fast does it play PORN?" Truly that is a real multispectral dataset that needs to be examined using floating points. heh.
Re:Vectors..... by KarMax · 2004-05-18 17:22 · Score: 1

So...
If i receive a mail (REALLY SPAM) that says:
"Enlarge your penis and forget all your problems"
This filter can "accidentaly" filter a mail from a beautiful lady that says:
"Foget all your problems i will suck your penis"
(YES this girl is fashion and dosen't say "dick" she says pennis)
No, i want a SPAM filter that dosen't think, just like my girlfriend.

Well... seriusly, theres a lot of SPAM that says "Here is your file" or others common combination of words, that can be filtered, i dont trust on this kind of filter.

--
Rock and Roll
Re:Vectors..... by RovingSlug · 2004-05-18 17:45 · Score: 4, Insightful

Ah, it uses vector math. ... Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.
Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.
Image clustering is hard, and the problem comes from picking a good representation of the image. Of course, a "word histogram" for an image makes no sense. Just considering pixel intensity or pixel color doesn't work either. You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.
I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)
Re:Vectors..... by BWJones · 2004-05-18 18:01 · Score: 5, Informative

The magic doesn't come from vectors. Vectors are just how you throw the numbers around

And your point is?

The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test.

For a univariate space (or perhaps bivariate space) this will work, but now try implementing standard chi-square analysis in multivariate (or hyperspectral) space. Starts to fall short rather quickly thus the measures of distances between clusters analysis.

Image clustering is hard, and the problem comes from picking a good representation of the image.

Yes, I do image clustering almost every day. Well, at least a couple times a week. With proper discriminands one can overcome "good image representation" problems.

Of course, a "word histogram" for an image makes no sense.

Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.

Just considering pixel intensity or pixel color doesn't work either.

Actually, yes it does. This is how many standard measures of image cluster analysis work.

You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc.

Actually, no. For many image classification algorithms that examine pixel value (oil bearing strata, concrete vs granite, types of aluminum in missiles etc...), structure or anatomy play absolutely no role in the identification of classes.

Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques.

That is a very difficult approach to take for image classification that begins to rely on machine processing and image "interpretation" which is a much higher order problem.

But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

Simply add more discriminands or filters and don't worry about "describing" the image. Other properties (like structure and anatomy) fall out after image clustering.

--
Visit Jonesblog and say hello.
Re:Vectors..... by DrSchlock · 2004-05-18 18:23 · Score: 1

Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram.

I don't think this is really true either. They're definitely representing documents by vectors, where dimensions correspond to words. (I would bet they've added extra dimensions for features like message length and number of recipients, too.) There's more than one way to compute the distance between two vectors, but they're all pretty easy.

The hard part is using this collection of labeled vectors to generate a rule that correctly predicts the labels of new vectors, i.e., divides up the vector space in a good way.

Image classification is rather different: as you observe, a lot more effort goes into extracting meaningful features of the image. In document classification, you can do a certain amount of this, perhaps some sort of syntactic analysis; but usually most of your features still end up just being words, which are easy to pick out. In both problems you then have to divide up the vector space somehow.

You can argue that the ease of representing a document makes document classification an easier problem. But it's not a feature of this algorithm; everyone currently doing document classification pretty much ends up using a bag-of-words vector, because it's easy and works very well... even though it seems intuitively very silly to throw out word ordering information.
Re:Vectors..... by Hays · 2004-05-18 19:05 · Score: 4, Informative

You're being overly hard on the grandparent. He makes some good points. And naive image vectorization IS a problem. Eigenfaces only works with extremely careful registration of images, because the images are vectorized naively. Basically this means throwing out any notion of spatial coherence. (You could vectorize the image in random order, scanline order, whatever.. as long as you did it consistantly across the data set you'd get the same bases out. Shouldn't a system understand that an image shifted one pixel to the right is not arbitrarily far from its original version?).

See http://www.cs.columbia.edu/~jebara/papers/iccv03.p df for a good argument about this

And responding to another point of yours, classification algorithms that look only at intensity are at best brittle. In the real world things have to be better. You have to be able to recognize an object under different lighting, etc. The fact that you can design and calibrate a system well enough to work on pixel intensity alone in a few specific cases doesn't convince me that it's robust.

That's not to say that you can't do some vision tasks with relatively simple metrics like intensity histograms or naively vectorized images, but really data representation is a major bottleneck for a lot of vision work. But you look like you're qualified to know that so I don't know why you're jumping down the grandparent's throat.
Re:Vectors..... by jdrugo · 2004-05-18 20:18 · Score: 1

Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word.

...and the word histogram is represented using a vector. 'Histogram' is proably not the best analogy as the words don't have a numeric value assigned to them (I guess they use word hash tables in this method).

To measure the distance between two histograms, you usually use the chi-squared test.

AFAIK, normalised cross-correlation between two vectors, each representing a document, is faster and gives better results in this case [1].

So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.

This distance is given by the correlation coefficient.

[...] But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.

That's what Singular Value Decomposition was used for. I'm sure that this is not the only thing they've used to get rid of the noise and somewhere later in the article he mentions a well-performing tokeniser, maybe also doing dictionary lookups to get rid of closed class words (function words, without semantics but only to provide the grammatical framework of the sentence).

I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)

I don't know if you have noticed, but the article was written for the lay reader. Did you expect a scientific publication embedded in an interview? It does not give away all technical details (they would be stupid to do so anyway) but with a bit of imagination and background knowledge it describes well an outline of the technology used. The described method is pretty standard in the Information Extraction community anyway, but the magic usually lies in the details.

[1] H Schuetze, "Automatic Word Sense Discrimination", Computational Linguistics, 24 (1), pp97-124, 1998.
Re:Vectors..... by RovingSlug · 2004-05-18 21:14 · Score: 2, Interesting

The magic doesn't come from vectors. Vectors are just how you throw the numbers around

And your point is?
Ah, that's the main point. Both the article and your original post focus on the fact that vectors are being used. While true, this doesn't really impact the essense of the algorithm -- effectively addressing the lower-level data structures instead of the higher-level algorithms. Perhaps an analogy might be someone describing Google's search by explaining B-trees instead of getting into what process actually determines that one page is better than another for a given search.
I'm not going to address the finer details of image classification further than that the techniques you describe require a significant amount of preparation, selection, and manipulation up-front by a human before a computer can produce useful results. Rather, I used image classification as a motivation to describe why discussing only the notions of "vectors" and "clusters" misses a huge part of the story of what actually makes these sort of techniques work.
Re:Vectors..... by radish · 2004-05-19 02:46 · Score: 1

OK so you lost me a bit with some of that math stuff. Can anyone help me automatically categorize my *ahem* jpg collection? ;)

--
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
Re:Vectors..... by Anonymous Coward · 2004-05-19 03:42 · Score: 0
- Of course, a "word histogram" for an image makes no sense.
Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.
Rubbish. Text can be considered a one-dimensional vector, whereas images are two-dimensional. There is nothing magical about representing things as vectors or matrices; you can stuff any kinds of data in such structures, and that does not automatically generate or expose correlations that are not there.
You can obviously "upgrade" text to artificially consider paragraphs or sentences another full dimension, but that's finding an analogy where one doesn't really exist. Specifically, for images the dimensions form a cartesian space; distances along both axes (or any direction for that matter) have same significance. For text this would not be true; considering intra-sentence word sequence one dimension and inter-sentence another one, first dimension is rather "short"; all words of a sentence are (usually) closely related, whereas words in adjacent sentences need not be; and especially they are not uniformally "close". Back and forward references are close, others have little or no relation.
To summarize: suggesting identical techniques work easily for both text and images is silly.
Re:Vectors..... by Bellyflop · 2004-05-19 03:56 · Score: 1

Sounds like Apple is using an algorithm called SVM lite. Categorization problems are often solved with that particular algorithm because people feel that it's much better than a Bayesian system. I prefer rule-learner based algorithms. I've found them to be much more accurate and they tend to run in O(n) time where n is the number of rules. I've also found several articles that point to alternatives that are better than SVM. But at least Apple got involved! A quick search on the ACM site comes up with a lot of papers on the topic.
Re:Vectors..... by Anonymous Coward · 2004-05-19 06:16 · Score: 0

In natural language vector work, isn't the predominant distance/similarity measure used the cosine measure? Yes, it can be equivalent to a chi-squared test, or any other similarity measure (information-distance, KL divergence etc), but cosine measures are what are commonly used, I believe.

As for the "word histogram"- I think you take exception to the bag of words approach. There is statistical work that attempts to utilize "features" in text that are not on a word basis, but apparently bag of words works well enough for quite a few applications. I think it's single words can be quite significant in terms of weighting the semantic meaning of a document, whereas in imagery individual pixels do not have the same amount of impact.
Re:Vectors..... by notwrong · 2004-05-19 12:51 · Score: 1

Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word.
I disagree. I use similar techniques to the ones described in the article in the research I work on every day. The initial vector representations are indeed histograms of a sort, but it is the Singular Value Decomposition (SVD) that allows these enormous vectors to be cut down to a manageable size, and also in a sense distills the meaning from the word frequencies. As the article says, this is Latent Semantic Analysis (LSA), which is a reasonably well known technique in the literature.
To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram.
A processor cannot understand histograms, and so I find it hard to see how they could be the "real work horse". Vector calculations at a decent speed are crucially important here. Using LSA means you get the vectors down to a more manageable size when compared to the original word counts, but there is still a lot of vector crunching to be done. A large amount of vector-intensive work is also required to perform the SVD/LSA itself.
And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.
I think you'll find that it's not generally tractable to compare the "histograms" (or more accurately the reduced-dimensionality vectors) of each document/email individually - the number of comparisons required would double with each new document! You are right that the distance measure must be valid for the clustering analysis, but if the clustering analysis is to be reasonably robust, it should be relatively insensitive to the distance measure chosen, which really means that the clustering is more important. Chi-squared is one distance measuring option, but so are city-block, euclidean, cosine, etc. The article doesn't mention what sort of clustering analysis is used, but there are several appropriate methods which will reduce the work needed to classify a new document into a fairly simple geometric-type procedure after LSA has been performed. There is additional work in re-examining the clusters once new documents are incorporated, but this is the "training" the article talks about, and is once more, vector-arithmetic intensive.
I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)
I strongly disagree. I thought the article was an excellent plain-language discussion of LSA, although I concede that I was already familiar with the mathematics involved. I was very pleasantly surprised to see that the techniques related to the ones I explore in my work are finding their way into mainstream applications. What you say about clustering analysis with respect to image classification sounds plausible, but just because the techniques in this article might not work for images doesn't automatically mean that the article is wrong.

i know how by ShallowThroat · 2004-05-18 17:04 · Score: 5, Funny

it's simple. it uses it's extremely uninsipired app name to scare away spam.

--
The "Insert Quote Here" line is almost as predictable as inserting an actual quote.

Re:i know how by jjeffries · 2004-05-18 17:14 · Score: 4, Funny

I hear that the next version will be known as "mail-enhancemant.app"
Re:i know how by Anonymous Coward · 2004-05-18 17:19 · Score: 1, Funny

And apparently you use your extremely poor use of the English language to scare off replies. Have you ever considered picking up a 7th grade grammar book?
Re:i know how by Anonymous Coward · 2004-05-18 17:42 · Score: 0

Hi, I'm an AC, I have nothing better to do with my pathetic life then play grammar nazi on /.

I suck at at everything.
Re:i know how by Anonymous Coward · 2004-05-18 18:01 · Score: 0

ZING! That's the funniest thing I've heard all week, thank you.

subspaces? by thedogcow · 2004-05-18 17:04 · Score: 5, Funny

The article mentions...

"In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.

--
Yes! I listen to NYC Speedcore and do math at 3AM. I suggest you try it too.

Re:subspaces? by DrEasy · 2004-05-18 17:26 · Score: 2

Not only maths help eliminate penis enlargement ads, but they eliminate penis growth altogether.

--
"In our tactical decisions, we are operating contrary to our strategic interest."
Re:subspaces? by Capt'n+Hector · 2004-05-18 18:12 · Score: 3, Funny

When I took linear algebra I was wondering if there was a practical approach to this
If by "this" you mean spam filtering, then cool. But if you're talking about applications in general... Are you kidding? Linear algebra is probably the most useful stuff you'll ever learn, especially if you're into computers. It's the stuff CG is made of. EVERYTHING uses linear algebra.
So here's a guess on how this works: So you've got your document vector. You also have a vector space, call it S for "spam". Choose your basis for S to be a bunch of words commonly found in spam. Now, orthogonally project your document vector into S, take the Euclidian norm and if it's too long -- zap it! It's spam!

--
Quid festinatio swallonis est aetherfuga inonusti?
Africus aut Europaeus?
Re:subspaces? by Hays · 2004-05-18 19:07 · Score: 1

yes yes yes yes! Pay attention in your linear algebra classes. I know your boring instructor might make Eigen vectors sound like the most tedious thing in the universe, but it is a staple of computer science if you want to do lots of interesting things with data.
Re:subspaces? by Anonymous Coward · 2004-05-18 19:43 · Score: 0

I was wondering if there was a practical approach to this

itym "a practical application"

There's not much of an approach to a definition at all, let alone a practical one. Especially if you're dealing with linear algebra.
Re:subspaces? by Anonymous Coward · 2004-05-18 22:53 · Score: 0

"In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

Guess they don't know too much about linear algebra: Points are not equivalent to vectors. For instance, there is no notion of the angle between two points in R^n. The angle is something which they specifically rely on, so saying that a document (in this context) is a point in R^n is just plain wrong.
Re:subspaces? by Dr.+GeneMachine · 2004-05-21 09:42 · Score: 1

Mods?? Funny?? This could work...
Ever seen the strangeness NMR spectroscopists use linear algebra for? On this background I can't help to find this idea quite... interesting.

--
This comment does not exist.

Face recognition by dysprosia · 2004-05-18 17:06 · Score: 3, Informative

I believe I remember reading somewhere that the same sort of vector/clustering calculations are used in face recognition software?

Just goes to show how solid math/calculations can have some useful applications!

Re:Face recognition by moyix · 2004-05-18 17:25 · Score: 4, Informative

Yes, for example, the eigenfaces method converts each image into a vector, and constructs a new subspace based on the highest ranked common features between them (using Principal Component Analysis, aka the Karhunen Lòeve Transform). Then new images are projected into this space and the shortest distance between the new vector and the previously computed ones is found.

It was the first thing that popped into my head while reading the article too :)

...moderation ideas.... by j3ll0 · 2004-05-18 17:07 · Score: 5, Funny

Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

Re:...moderation ideas.... by pvt_medic · 2004-05-18 17:14 · Score: 2, Funny

and by that token, i could creat something that would get me moded up every time so i can get more karma so i can mod...

oh automated mod... scratch that plan, i will have to figure something else out for world domination.

--
30% Troll, 50% Underrated, 10% Interesting
Score:5, Troll
Re:...moderation ideas.... by wheresdrew · 2004-05-18 17:16 · Score: 5, Funny

Yes, but the combination of too many all too common terms could cause the system to implode.
"In Soviet Russia imagine a beowulf cluster of insenstive clods who don't RTFA because they're using linux to beat the GNAA to the first post."
Re:...moderation ideas.... by Anonymous Coward · 2004-05-18 17:22 · Score: 0

Ahhh! My filter's head a-splode!
Re:...moderation ideas.... by thomasdelbert · 2004-05-19 03:22 · Score: 1

Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

If /. implements that and it will give me a new way to karma-whore.

- Thomas;

--
___ This sig is in boldface to emphasize its importance!
Re:...moderation ideas.... by Anonymous Coward · 2004-05-19 03:41 · Score: 0

Score: -42

Deploying Cowboy Neal to administer smackdown.
Re:...moderation ideas.... by merdark · 2004-05-19 08:51 · Score: 1

Yes, you could even replicate slashdot's moderation bias:

Linux + negative words = low score
Linux + positive words = high score
Mac + zealot = high score
Linux + zealot = low score
M$ = score of +infinity

The hard part would be getting it to score posts that present reality as -infinity.
Re:...moderation ideas.... by Anonymous Coward · 2004-05-19 15:04 · Score: 0

That's GNU/Linux, you insensitive clod!
Re:...moderation ideas.... by Trillian_1138 · 2004-05-19 15:37 · Score: 1

You must be new here...

-Trillian

(Sorry. Couldn't help it.)

Full text search goodness by vikman · 2004-05-18 17:07 · Score: 3, Interesting

Now we understand why Apple is so good at doing full text searches and filesystem wide searches. I wish we had the same type of search functionality in Mozilla that Mail.app boasts of.
That is the one feature that Mozilla's mail client really could use.

--
--

n-space by Anonymous Coward · 2004-05-18 17:09 · Score: 5, Funny

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.

It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.

Re:n-space by Anonymous Coward · 2004-05-18 19:16 · Score: 0

I don't know the answer to your question, but personally, with all the spam I get, I'd be rather afraid to see a graph of all of my email messages plotted as points on the goatse guy.

Re:Kinda like Mozilla Mail? by BWJones · 2004-05-18 17:10 · Score: 5, Informative

Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.

--
Visit Jonesblog and say hello.

GD, RTFA! by Zen+Programmer · 2004-05-18 17:10 · Score: 5, Informative

If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."

how does it compare to Bayesian? by the+quick+brown+fox · 2004-05-18 17:11 · Score: 5, Interesting

Is there any hard data out there that shows the cluster analysis actually improves on the better Bayesian algos out there? After all, most of the good ones also achieve the 98%+ that this article cites.

According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.

Re:how does it compare to Bayesian? by turkmenistani · 2004-05-18 17:15 · Score: 2, Interesting

But, like the article mentions, what happens when your grandma sends you an email mentioning viagra? Traditional Bayesian algorithms would automagically flag it as spam and delete it. The problem with traditional spam filters is that they might block all incoming spam, but they might also block something you might have wanted to read.
Re:how does it compare to Bayesian? by jcr · 2004-05-18 17:19 · Score: 2, Interesting

Bayesian filtering is a subset of what LSM can do. If you get to WWDC this year, find Kim Silverman and ask him to explain it to you.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."
Re:how does it compare to Bayesian? by lupin_sansei · 2004-05-18 17:21 · Score: 2, Interesting

No they wouldn't. Bayesian filters would see the word "viagra" and give that a high spam score, but all the other words that your Aunty used would probably have a very high ham score (not spam). Thus it would probably score the entire email as ham.

That's the great thing about Bayesian filters, they score the entire email not just look for single keywords.

--
http://www.perthonline.net
Re:how does it compare to Bayesian? by inburito · 2004-05-18 17:23 · Score: 5, Funny

Wow. If your grandma is suggesting you viagra I think your problems go way deeper than Bayesian misfirings..
Re:how does it compare to Bayesian? by the+quick+brown+fox · 2004-05-18 17:27 · Score: 3, Informative

That actually tends not to happen. Most Bayesian filtering packages are weighted very conservatively, so that one or two highly non-spam tokens (like your grandma's e-mail address, or the name of the uncle who is on the little blue pill) will more than counterbalance the spam tokens.
Again, what's intuitive doesn't play out in practice... this seems to be a common theme in the world of statistical spam filtering. For example, you'd think the word "free" would be pretty spammy... in my corpus, it only gets a score of .406 (where 0 is least spammy and 1 is most spammy, and an e-mail must have an aggregate score of .9 to be classified as spam). On the other hand, "sir" gets .945 and "madam" gets .987.
Re:how does it compare to Bayesian? by SimplyCosmic · 2004-05-18 17:29 · Score: 5, Informative

Bayesian spam filtering doesn't mark an email as spam simply because of the presence of one single word, but using a mathematical equation based on the likelyhood of each of the words being in the message being symptoms of spam. What you're talking about is simply a spam filter based on a blacklist of words. Bayesian spam filtering uses mathematics to consider how those words are used in the context of the rest of the message, and do a surprisingly good job of it.

Therefore, "viagra" in your grandmother's email might have a high indication of spamminess, but all the other words will lower the score below the rather high threshold needed to be considered spam.

That's why training your bayesian spam filter on the email you receive is so important, as it learns what you consider spam from the type of email you receive.
Re:how does it compare to Bayesian? by the+quick+brown+fox · 2004-05-18 17:39 · Score: 2

Did you mean LSA, for Latent Semantic Analysis?
Anyway, yeah, I understand that. My question is whether, for the specific purpose of spam filtering, it results in improved performance, and if so whether it's been documented anywhere.
The clustering stuff is certainly interesting for other purposes, and I'm glad there are people out there not only writing the software, but integrating it into the OS. The graphic and industrial designers aren't the only smart people at Apple.
Re:how does it compare to Bayesian? by king-manic · 2004-05-18 18:03 · Score: 1

I'm sure if grandma is close enough to discuss viagra with you, she's likly is on your safe list.

--
"There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy."
Re:how does it compare to Bayesian? by Ibanez · 2004-05-18 18:22 · Score: 2, Interesting

Actually, I saw this article and figured I could rant a little. I really am not impressed by it. I get 200 or so junk mail every week, and about a quarter of that gets through. And some of these to me seem really obvious. It doesn't really seem to learn anymore either. I've never had a false positive, which is pretty good, but I'd still love to find a way to implement a Bayesian filter in Mail.

Blake
Re:how does it compare to Bayesian? by wirelessbuzzers · 2004-05-18 18:36 · Score: 2, Interesting

It's pretty hard to compare algorithms, at least ones that might work, such as chi squared (SpamBayes) vs Bayesian (Plan for Spam, CRM114, lots more) vs point totals (SpamAssassin) vs cluster analysis (Mail.app).

As for implementations, CRM114 kicks the shit out of Mail.app's filter, at least on my and my roommate's mixes. About the only thing that CRM114 hasn't caught for me is those 1-line virus spams with a .zip attached, and new classes of spam (last week I received my first stock spam). The false positive rate is very low and generally confined to advertisements that I don't want to read, but are from other students over the house lists, or the like. I've been considering retraining those as spam anyway.

The author claims 99.984% filtering rate, which is higher than I get... but then, I don't get as much spam as he does, and I use whitelists, which are said to hurt the accuracy in favor of zero false positives from that segment.

--
I hereby place the above post in the public domain.
Re:how does it compare to Bayesian? by martin-boundary · 2004-05-18 19:06 · Score: 3, Informative

Bayesian filtering is a subset of what LSM can do.
I'm sorry, but that's just completely wrong. Whoever is propagating this deserves a slap on the forehead.
Bayesian theory is the most general possible form of rational decsion making. *Any* rational method based on belief structures can be represented in a Bayesian form. This was shown by Richard Cox in about 1944.
Here's an excerpt from this wikipedia article, to whet your appetite:
1. Divisibility and comparability - The plausibility of a statement is a real number and is dependent on information we have related to the statement.
2. Common sense - Plausibilities should vary sensibly with the assessment of plausibilities in the model.
3. Consistency - If the plausibility of a statement can be derived in two ways, the two results must be equal.

Any system of reasoning which satisfies those assumptions has a Bayesian version, and conversely. (Read the whole article if you want to argue edge cases).
So, if LSA (you wrote LSM?) works, then it's only to the extent that there's an underlying Bayesian model which makes it work.
Re:how does it compare to Bayesian? by rbright · 2004-05-18 19:42 · Score: 2, Informative

Furthermore, most Bayesian filters process headers as well, so the mail would be weighted heavily towards ham simply because it was from Aunt Emma and addressed directly to you.
Re:how does it compare to Bayesian? by ghamerly · 2004-05-18 21:25 · Score: 2, Informative

Your post is a bit misleading. It's true that the words are all considered together, but it's not true that they are considered "in context" in the sense that phrases are considered. The thing that makes Naive Bayes classifiers viable for most applications is that they are "naive", and do not consider phrases. Instead, each word is considered conditionally independent of every other word (conditioned on the class label, in this case spam or not spam). The "spamminess" of each word has an additive effect, and the phrase "Joe wants to buy viagra" (in a non-spam) is about equally spammy as "You should buy viagra" (in a spam).

Just wanted to clear that up. It may have been what you meant all along, but that's not what came through.
Re:how does it compare to Bayesian? by NoOneInParticular · 2004-05-19 00:10 · Score: 4, Informative

You're absolutely right, but note however that what the grandparent calls 'Bayesian filtering' is referring to something that is more commonly known as 'naive Bayes': Bayesian inference with a set of extremely limiting assumptions. This technique is known in information retrieval as both the 'multinomial' and the 'multivariate' model of word frequency manipulation (which is which depends on how you store the evidence: only word occurrences or also word counts). In this sense, 'Bayesian filtering' is a very narrow subset of 'Bayesian inference' and its completely possible, and even quite likely, that latent semantical analysis subsumes it.
Re:how does it compare to Bayesian? by Nuclear+Elephant · 2004-05-19 01:41 · Score: 3, Interesting

98% is pretty pathetic - 1 error in 50. Most good Bayesian filters (SpamProbe, CRM114, DSPAM) can reach at least 99.9% (1 error in 1000) with ease. Others can grow far beyond this and reach as high as 99.985%, as a recent slashdot article covered (and this one). I reset my stats a few weeks ago, and out of 1800 spams so far, 0 have made it through. The only problem with Bayesian filtering is that it's mismarketed by companies who insist they have a better solution (although it's less accurate).

And to answer your question - collaborative filtering, such as message inoculation works quite well at boosting accuracy even beyond the high levels of accuracy Bayesian filters are really capable of, whereas things like shared groups and such hurt it.
Re:how does it compare to Bayesian? by adamengst · 2004-05-19 02:59 · Score: 2, Informative

You can have Bayesian filtering in Mail, with SpamSieve from Michael Tsai.

You might also be interested in reading Joe Kissell's just-released ebook Take Control of Spam with Apple Mail, which explains the common accuracy problems with Mail's Junk filter and how to optimize it for better results. Joe also recommends SpamSieve as an alternative to Mail's Junk filter in those instances where Mail proves inadequate.

cheers... -Adam
Re:how does it compare to Bayesian? by jcr · 2004-05-19 06:38 · Score: 1

There's Latent Semantic Analysis, and Latent Semantic Modeling. Both are far off into Heavy Duty Math that a mere graphics/imaging guy like me has a hard time getting his head around.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."
Re:how does it compare to Bayesian? by marmoset · 2004-05-19 08:38 · Score: 1

I use SpamAssassin and Apple Mail's filter in series (SA on the server, Apple Mail on the client) and between them they catch, well, everything.

I think the selling point of Mail is that they ship it in a fairly "well-trained" state, so it's catching that 98% out of the box.
Re:how does it compare to Bayesian? by Nuclear+Elephant · 2004-05-19 08:42 · Score: 1

Most good Bayesian spam filters can achieve 99%+ accuracy within a few days of training. Every time I blow my data away, I'm up to 99.x% within about 7-10 days. Seeded dictionaries only hurt accuracy over long periods of time.
Re:how does it compare to Bayesian? by jcr · 2004-05-19 17:12 · Score: 1

I'm sorry, but that's just completely wrong. Whoever is propagating this deserves a slap on the forehead.

Go read what I wrote again, and slap yourself on the forehead.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."
Re:how does it compare to Bayesian? by turkmenistani · 2004-05-19 18:10 · Score: 1

Well, as the song by Dwight Latham and Moe Jaffe goes,
I'm my own grandpa

Nitpick on one of their recommendations by Logic+Bomb · 2004-05-18 17:11 · Score: 2, Insightful

You can also ask that your potential correspondents resend emails if they do not receive answers in a certain timeframe.

If the Junk Mail filter snagged a message the first time, it'll probably get it on subsequent tries too. If the message is legitimate, it probably can't be changed enough to make it through. It's a much better idea to check Junk Mail for legit messages and only empty it manually (or automatically for messages that are at least a week old).

Re:Nitpick on one of their recommendations by m1chael · 2004-05-18 17:40 · Score: 1, Funny

But now imagine two Apple users using Mail Filter...

--
I know you are psychotic, but please make an effort.
Re:Nitpick on one of their recommendations by displaced80 · 2004-05-18 20:09 · Score: 2

Simply add the person in question to your address book, and the Junk filter will let it pass. Also, there's a previous recipients list, which means Junk Mail won't filter out replies to mails you've sent.

--
What's the frequency, Kenneth?

Re:Kinda like Mozilla Mail? by Yaztromo · 2004-05-18 17:12 · Score: 2, Redundant

This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

Actually, if you read the article it specifically states that Mail's spam filtering is not like Mozilla Mails. You use it in much the same manner, butt the underlying technology is completely different.

Yaz.

one thing missing from mail to make it perfect by Raleel · 2004-05-18 17:13 · Score: 1, Insightful

and it's not really mail. it's more iCal. iCal + exchange. as in, let me talk to exchange with ical. i'd love to get rid of entourage, the slowest mail client ever.

--
-- Who is the bigger fool? The fool or the fool who follows him? --

Re:one thing missing from mail to make it perfect by Anonymous Coward · 2004-05-18 23:58 · Score: 0

I haven't tried the software yet, because I'm unemployed and don't need to connect to an Exchange Server, but I intend to the moment I [finally] get a job. Check out this company:

http://www.snerdware.com/addressx/

Summary Service by spankalee · 2004-05-18 17:13 · Score: 4, Interesting

Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

Very cool...

Re:Summary Service by Mikey-San · 2004-05-18 17:37 · Score: 4, Funny

Input:
Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.
Very cool...
Output:
Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

Wow, look at that! Impressive!
(I actually love Summary Service, but I couldn't resist that joke.)

--
Mikey-San
Karma: +Eleventy billion (mostly affected by watching Celebrity Jeopardy)
Re:Summary Service by MikeCapone · 2004-05-18 19:04 · Score: 1

Well, I'm actually curious.

Can anyone posts here the summary of the article. I just want to see how good that thing is and I don't have a Mac.

--
Treehugger? Treehugger... Treehugger!
Re:Summary Service by krokodil · 2004-05-18 19:20 · Score: 1

for some reason it is always disabled for me. Any way to enable it?
Re:Summary Service by Halo1 · 2004-05-18 20:20 · Score: 1

It's impossible to post "the" summary, as you can select how terse you want it to be (between 1 and 100%). Additionally, you can choose whether it should use sentences or paragraphs as basic block. Here's a sentence based summary of the first page of "6%":
Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily.
...The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.
...In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.

The 1% summary only returns the middle sentence above.

--
Donate free food here
Re:Summary Service by Ilgaz · 2004-05-18 20:43 · Score: 1

You must use a browser "happy" with OS X services. Nativity is the key.

e.g. Safari, Omniweb or many of services will me grayed.
Re:Summary Service by nikster · 2004-05-18 21:38 · Score: 2, Interesting

below is the default output:
In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.

...Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily.

...The Apple data kit allows the user to find the single document that best represents each topic.

...The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.

...So, a document that contains "Aunt Emma" and "cooking tips" at the beginning and the end of a page can well be in the same cluster as a text that talks precisely about "the time Aunt Emma sent you cooking tips."

...Imagine this: take the biggest issue you can find of the Mac Developer Journal and put it in your left hand, and put your favorite dictionary in your right hand.

...Let's say, for example, that your Aunt Emma, in her cooking tips, talks about a "hippopotamus" (as in "For the turkey to be tasty, it should be quite large but obviously, you don't want a hippopotamus-sized one.").

...If each document is a point in a X0,000-dimension space or so, we reduce its dimensionality into a small number of dimensions that capture the salient patterns and the majority of the variation in the corpus.

...Like we did before, you can perform a bit of cluster analysis and find clusters of documents that each represent a topic.

...Because words are distributed in the same space as documents, you can find the words that are closer to the center of a document cluster.

...Even though Apple is not the only company working on such technologies, they do seem to be the only ones to have made it so accessible to end users and powerful at the same time. In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.

...The other traditional approach is to look at the sender and not accept any message from any known junk-mail sender.
Re:Summary Service by Anonymous Coward · 2004-05-19 00:12 · Score: 0

And here's the default summary of this Slashdot thread (copied at +2 threaded; one can guess from this who my fans and friends are because they're all showing up at an additional +1). Obviously it will be far more effective on something written by one person on one subject.

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. ...It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. ...You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. ...Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. ...They've each been trained to recognize some of the spam, but their training is incomplete because only one of the 3 clients is trained for each message that comes in. The only way to make it consistent would be to move all of the junk message back into the Inbox and select them as junk in each mail client. ...The sender would just receive a message from the mail server saying that their mail was marked as spam, and that they should try again, or let me know by some other means. ...For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. ...If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems." ...So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word.

Re: Bayesian Filtering by Anonymous Coward · 2004-05-18 17:13 · Score: 2, Informative

The author is awfully dismissive of bayesian filtering, which works extremely well for me and for lots of other people. See mozilla, spam assassin, others.

Re:Kinda like Mozilla Mail? by jcr · 2004-05-18 17:17 · Score: 4, Funny

I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

Umm, how much would you want to bet? I'll take that action!

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."

os x's mail filter is great by squarefish · 2004-05-18 17:18 · Score: 3, Interesting

but it's a whole lot better with junkmatcher central

--
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.

Re:os x's mail filter is great by Anonymous Coward · 2004-05-18 19:43 · Score: 0

I get about 1000 spams daily. Spam Assassin on the server, along with Mail.app's spam filter + JunkMatcher on the client side only lets about 10 of those 1000 spams into my inbox.

Apple spam by seanadams.com · 2004-05-18 17:22 · Score: 4, Interesting

I have marked every single announcement and special offer i've ever received from Apple as junk, and yet the filter still refuses to classify them as such automatically.

I wonder if there's a loophole here that spammers could take advantage of: masquerade as Apple using the hole they've left in their filter. Spam Mac users to your heart's content. Bundle a Mac virus along with it for extra damage.

Please don't mod this down just because you like Macs. I like Macs too, but it really looks like there is a back door in the spam filter and I'm just reporting it - not mac bashing.

Re:Apple spam by Hungus · 2004-05-18 17:27 · Score: 1

Its not a back door its a rule in your profile. Problem is, I don't think your can trivially delete it. Its the same reason they all get labled blue.

--
Bad Panda! No Bamboo for you! In matters of importance ACs will not be responded to. Want to say something critical,OK
Re:Apple spam by k_187 · 2004-05-18 17:29 · Score: 3, Informative

There is, Apple puts a rule in by default that stops Mail from evaluating any mail from apple. Well, there is in Panther, don't know if you caught that or not, but that might fix your problem.

--
11 was a racehorse
12 was 12
1111 Race
12112
Re:Apple spam by timgoh0 · 2004-05-18 17:31 · Score: 5, Informative

This behaviour is due to the rules set up in apple mail. To disable this behaviour, go to the mail preferences, select rules and remove the entry "news from apple"
Re:Apple spam by .com+b4+.storm · 2004-05-18 17:35 · Score: 4, Informative

Did you check your "rules" preferences? Mail.app by default includes a rule to "Stop evaluating rules" for mail from a whole host of Apple e-mail addresses. I've never tried deleting it to see if I can get Apple mail to get filed as spam because... well, they e-mail me maybe twice a year and it's always been worth reading. But you might want to check out that rule, it could be what's fouling you up.

--
"Wow, you're like some kind of superhero able to ward off happiness and success at every turn."
-- Ryan Stiles
Re:Apple spam by r3dx0r · 2004-05-18 17:39 · Score: 1

there's a 'news from apple' rule in mail.app by default which sets a different color and prevents other rules from being evaluated. maybe this applies to junk mail too, so you may want to turn this one off.
Re:Apple spam by __aadidx2690 · 2004-05-18 17:41 · Score: 1

Hmmm... I have never seen this problem, but then again I've always just opted out of Apple's announcements and special offers as I don't like people stuffing my email box.
I do seem to recall, however, that in some earlier versions of Mail.app there were rules for highlighting mail from Apple. I don't see those in the latest version of Mail, but maybe that's because I deleted them as soon as I discovered that they were there.
- Nåff
Re:Apple spam by simon_c_heath · 2004-05-18 17:46 · Score: 1

Mail has a default rule for News from Apple which bypasses the Junk mail filter (or at least, seems to, as it has a 'stop processing rules' action). This can, however, easily be turned off. Go to Preferences->Rules and deselect the Rule 'News from Apple'. Cheers, Simon
Re:Apple spam by Libraryman · 2004-05-18 17:46 · Score: 2, Informative

There could be a back door in the spam filter, but I have another [slightly] less sinsiter possibility.
Mail.app ships with a preset filtering rule to color-lable messages from Apple in blue. The junk filter may be set not to act on messages which are already being filtered (colored, flagged, moved to a specific folder) by one of your rules. Try deleting the rule to colorize the mail from Apple and see if it starts junk filtering it.
Also worth noting, Apple will remove you from its mailing lists, any email from them includes links/instructions to do this.
Re:Apple spam by General+Sherman · 2004-05-18 18:09 · Score: 1

Actually, there's a rule built in that ovverides the Junk filter, go into Preferences -> Rules and you'll se a special Apple rule. Remove it if you wish, but next time look before you yell at apple for making a "loophole". You can just delete it.

Next time, look before you bash.

--
- Sherman
Re:Apple spam by SilentChris · 2004-05-19 01:36 · Score: 1

Twice a year? I get a receipt from them every time I buy a single iTune, and they continually put me back on the new music emails.

No, don't bother me if I've spent .99 and no, I don't want to hear about your Britney Spears "exclusive". And to those who say "well, just buy a bunch of songs in a given day, so the receipt tallies them all up at once", sorry. The whole point of iTunes is to let me buy a song I just remembered from years ago, not write them down in a list and save them until I have enough for a decent receipt.
Re:Apple spam by .com+b4+.storm · 2004-05-19 17:17 · Score: 1

Twice a year? I get a receipt from them every time I buy a single iTune, and they continually put me back on the new music emails. No, don't bother me if I've spent .99

I really just meant Apple newsletters and the like - things most people might consider spam. Every online store sends e-mail receipts, and a lot of people (myself included) keep them for record of online transactions.

I'm not sure what the big deal is - it's not as if you can't easily make a rule to throw anything with the matching subject or source address into the trash. Receipts are not spam to most people. In the time it took you to type your rant here, you could have added a rule to delete them if they really bothered you.

--
"Wow, you're like some kind of superhero able to ward off happiness and success at every turn."
-- Ryan Stiles

Sounds sufficiently different to me by Anonymous Coward · 2004-05-18 17:25 · Score: 5, Interesting

Actually from my understanding of it, its fairly different.

I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.

What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam

The advantage to this method I would suppose is to fold:

A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.

B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...

At least thats my understanding of it.

Re:Sounds sufficiently different to me by arcus · 2004-05-18 18:05 · Score: 2, Informative

A) When you reduce the the N dimensional space, your would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
The method described by Paul Graham only looks at a handful of the most 'interesting' words in the mail ('interesting' meaning tending to yield a high probability of being either spam or ham). Adding lots of random words could mean that the spammer lucks out and gets words that happen to appear a lot in your regular mail, but what's rather more likely is that the 'interesting' words will be things like 'viagra', and the random words will simply be ignored. Bayesian sorting isn't necessarily particularly vunerable to random words.
What would tend to defeat Graham's filter more would be inconsistent spelling of key words, i.e. v1agra, v|agra, V!agr4 or whatever. Perhaps other bayesian filters are cleverer.
Re:Sounds sufficiently different to me by Anonymous Coward · 2004-05-18 18:39 · Score: 0

I'm not so sure you get so much advantage with this in this context.

A) When using Bayesian, you can also ignore new words instead of weighing them as neutral. This is just an optimization.

B) Bayesian can also have many groups. For example, you can mark emails/documents as spam, ham, job, home, friends, etc. You will have a database of word-counts and weights for each group.

But while Bayesian is light-weight, this is a more complex algorithm. You will use more memory and CPU for doing essentially the same. Bayesian also seems to scale alot better on volume.

Bayesian seems like a subset of this clustering algorithm. It will be able to do much more complex tasks than Bayesian, but spamfiltering is not really that complex. I would prefer a simple algorithm on a server.
Re:Sounds sufficiently different to me by CowboyBob500 · 2004-05-18 23:19 · Score: 1

Sounds very similar to Active Shape Models. If you've never heard of these, they are very cool things that allow you to classify objects in N dimensional space and then apply probability rules to find likely matches. I used them working on a project to convert speech to sign-language (amongst other techniques).

Bob

--
Listen to my latest album here

It's Cyberdog! by Blackbrain · 2004-05-18 17:26 · Score: 2, Interesting

Apple has finally brought Cyberdog back!

Kickin it Apple Old School.

--
Where would we be if Wheel had hid her round rock in a cave instead of showing everyone how it rolls?

Re:It's Cyberdog! by itwerx · 2004-05-19 04:49 · Score: 1

You Slashdotted CyberDog!
You bastard! :)

vs bayesian filters ? by Bugmaster · 2004-05-18 17:27 · Score: 2, Informative

How does this technology compare to Bayesian filters such as PopFile ? PopFile was not made by Apple, so clearly it doesn't have the cult appeal, but it has been working flawlessly for me for about a year now. What really irks me about this article is how it implies that Apple invented trainable filters -- where, in reality, this is very far from the truth. Apple does the same thing with pretty much everything it sells... sort of like Soviet Russia, who claimed to have invented flight, radio, transistors, and probably elephants too.

--
>|<*:=

Re:vs bayesian filters ? by diamondsw · 2004-05-18 17:59 · Score: 2

RTF... Oh yeah, this is Slashdot. Nevermind...

--
I don't know what kind of crack I was on, but I suspect it was decaf.
Re:vs bayesian filters ? by Anonymous Coward · 2004-05-18 18:13 · Score: 0

I had a look at the article and it seems bayesian gives more bang for the CPU-time and memory. Bayesian is very simple and fast, with only one big database of word-counts. While this is trying to match and index documents against eachother to attempt to detect the "spamminess". The article seemed a bit non-technical regarding details though.

What I have found superior is SpamAssassin's rules combined with the built-in bayesian. If you setup a cron-job for running the manual learning program (sa-learn) on your mail-folders, you're pretty well setup. I have yet to see a mail labelled falsely and only a few mails per week slip through the filter.

Re:Kinda like Mozilla Mail? by Anonymous Coward · 2004-05-18 17:30 · Score: 2, Funny

reading that has cleary shown me for the first time why my friends/family complain when i talk technical about chemistry to them.

And i thought i spoke english!

Hmmm. Document visualization by mveloso · 2004-05-18 17:34 · Score: 3, Insightful

I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.

In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?

Re:Hmmm. Document visualization by Anonymous Coward · 2004-05-18 20:27 · Score: 0

You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.

You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the email you wanted in the first place...by looking at the outliers and noise.
Re:Hmmm. Document visualization by Anonymous Coward · 2004-05-19 01:15 · Score: 0

This was actually what the Total Information Awareness project was atttempting to do, before it's unfortunate Orwellian name got it shut down: take all the data collected by a whole bunch of federal agencies and apply this sort of analysis to locate patterns...
Re:Hmmm. Document visualization by Anonymous Coward · 2004-05-19 03:18 · Score: 0

Why limit it to Mail, It could be any collection of documents. But how would you visualise it? I guess as linked nodes in 3D with the distance between them reflecting inter-document similarity and colour to highlight relevance. If you could then click on a node to provide a summary of the document I think you would have avery useful application, or has it already been done?

Re:Dear Apple by Aqua+OS+X · 2004-05-18 17:36 · Score: 0, Offtopic

Ya, this is off topic trolish flame bait... and I am an OS X user...

nevertheless, I still laughed at this ;)

--
"Things are more moderner than before- bigger, and yet smaller- it's computers-- San Dimas High School football RULES!"

It's not 'underpants gnomes?'... by ErnstKompressor · 2004-05-18 17:36 · Score: 1

I thought it was those underpants gnomes all this time...

--
We apologise for the fault in this post. Those responsible have been sacked. -- Signed RICHARD M. NIXON

Re:Kinda like Mozilla Mail? by DrSchlock · 2004-05-18 17:39 · Score: 5, Informative

This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.

To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)

A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.

Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.

Crystal clear ... erm ... by Too+Much+Noise · 2004-05-18 17:40 · Score: 4, Insightful

Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.

ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words ... or something (lexical analysis).

(further down ...)

Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.

so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???

ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.

Re:Crystal clear ... erm ... by prockcore · 2004-05-18 18:19 · Score: 1

ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.

It does a poor job of explaining exactly what is going on. I've read it, and I'm still convinced that they basically just implemented a neural net. The exact same stuff is used in OCR.
Re:Crystal clear ... erm ... by Anonymous Coward · 2004-05-18 18:19 · Score: 1, Funny

It's Apple. Gotta be good.

You know Apple INVENTED spamfiltering don't you? ;-)
Re:Crystal clear ... erm ... by martin-boundary · 2004-05-18 19:28 · Score: 2, Informative

so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???
ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.

Well, the article explains very poorly, but the approach isn't that new. Look up cluster analysis in google.
Latent Semantic Analysis broadly works as follows:
First, you plot all documents as points in space, by using each word as an independent coordinate. So if you have 100,000 unique words in all documents, then you plot in 100,000 dimensional space.
Second, you compute the principal eigenvectors for the matrix of all the documents, viewed as columns. This gives you a partial new coordinate system in the 100,000 dimensional space. You don't compute those eigenvectors whose eigenvalue is too small, it's just a waste of effort.
Finally, each document is dotted with each eigenvector, obtaining the representation of the document in the eigen-coordinate system. This now tells you, for each document, how much it resembles each eigenvector. The eigenvectors represent "concepts", and the mix of eigenvectors used for each document represents the mix of "concepts" in the document.
Roughly speaking, that's what LSA does, modulo the devil in the details and speed/memory optimization.
Re:Crystal clear ... erm ... by Anonymous Coward · 2004-05-19 01:06 · Score: 1, Insightful

Actually, this is nothing new at all. It is roughly performing a feature transformation on a data set, something that's been done with multimedia data for the purposes of conducting nearest neighbor searches for years now.

Personally I favor Bayesian filters as high-dimensional vector calculations eventually become too unweildy, no matter what kind of beefy system you have.
Re:Crystal clear ... erm ... by NaugaHunter · 2004-05-19 02:06 · Score: 1

Mail uses matrixes from word count vectors and does some funky math, which are hard to represent in a sentence. Others do 1-dimensional word weighted word counts along the lines of "15 x Viagra; 6 x Amazing; 8 x Deal", assigns it a score, and handles it. They compare word counts against acceptable count thresholds. Mail compares emails directly to each other and says, effectively, "This email is 1.2 % different from the main matrix of known Junk, so it is Junk as well."

--
R: That voice. Where have I heard that voice before? B: In about 365 other episodes. But I don't know who it is either.

Missing functionality by nsayer · 2004-05-18 17:40 · Score: 4, Interesting

Here's the problem I have with mail.app's spam filtering:

I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database. So the training winds up being sort of haphazard.

I suppose I should designate a particular machine to be the spam filtering IMAP client and have the rest of them not participate, but then I can't train on those subservient machines.

It'd be much better if multiple Mail.app IMAP clients could store their database on the server and share it.

Re:Missing functionality by repetty · 2004-05-18 17:56 · Score: 1, Insightful

"I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database."

No that's a bad idea. Your case is unique because you are specifying that just one user uses a bunch of computers, but the general principal you are advocating completely ruins the premise of adaptive filtering.

Suppose we're sitting in an office... You don't want to see penis enlargement ads but I love 'em. How that big server-level database of yours supposed to work?

Bad idea.

--Richard
Re:Missing functionality by LMariachi · 2004-05-18 18:00 · Score: 1

Have you tried sharing the trained LSMMap2 and replacing other clients' versions with a symlink to the shared one? Don't know if this will work, but it's worth a shot.
Re:Missing functionality by Anonymous Coward · 2004-05-18 18:21 · Score: 0

things can still be stored centrally on a per user basis
Re:Missing functionality by ezthrust · 2004-05-18 19:12 · Score: 3, Informative

There might be something of use for you in this thread on macosxhints.com
http://www.macosxhints.com/article.php?story=20030 320162436823
Although there is a warning that once this is done, Mail stops learning.
Re:Missing functionality by n8_f · 2004-05-18 19:12 · Score: 3, Informative

How that big server-level database of yours supposed to work?

Uhh, how do you get any mail that he doesn't? The data would be stored in one of the user's mail folders, just like an attachment. You completely misunderstood the parent poster. He accesses the same IMAP account from multiple different machines, but he has to train each one of his clients FOR THE SAME ACCOUNT. So he gets 10 messages to homer@doh.com and his machine at work filters out message 1 and 2. He gets home, and his client filters out message 7. His laptop filters out message 9. They've each been trained to recognize some of the spam, but their training is incomplete because only one of the 3 clients is trained for each message that comes in. The only way to make it consistent would be to move all of the junk message back into the Inbox and select them as junk in each mail client. Pretty crappy. And it gets unsalvageable when you mark a message as Not Junk on client 2 that client 1 marked as Junk. I have the same issue. I just leave me home client running most of the time, so it handles all of the filtering as new messages come in and then mark the ones it missed when I get home. But the parent is right, Mail should just store it on the IMAP server.

Which brings up an interesting point. I tend to store all of my notes on my personal IMAP server as drafts, so I can get to it anywhere. Why don't any programs use IMAP to store data? Can you not access them at a byte level, but only as whole messages? I haven't looked at the IMAP protocol. Could it be combined with WebDAV for a unified data store? I would love to have a server that allowed me to keep all of my e-mail, documents, contacts, etc. in one place that I could access from anywhere.
Re:Missing functionality by kiddygrinder · 2004-05-18 19:15 · Score: 1

popfile allows all your computers to run their email through the same database, maybe you hsould try it.

--
This is a joke. I am joking. Joke joke joke.
Re:Missing functionality by StrawberryFrog · 2004-05-18 19:42 · Score: 1

"Mail.app doesn't share the filtering database - No that's a bad idea

Isn't one of the ways in that spam differs from regular email is that it's the same message (in the same or similar text) sent to very many people? Cannot this fact be exploited by comparing messages sent to multiple users?

Suppose we're sitting in an office... You don't want to see penis enlargement ads but I love 'em

Then my job is much safer than yours.

--
My Karma: ran over your Dogma
StrawberryFrog
Re:Missing functionality by Anonymous Coward · 2004-05-18 21:00 · Score: 0

Have you suggested this to Apple?

Don't wish they would support it, go to and fill in the form!

Every suggestion I have made has been done, and they are always looking for ways to improve their products.
Re:Missing functionality by Anonymous Coward · 2004-05-18 23:57 · Score: 0

You might want to try creating a spam folder on your IMAP server and every once in a while dumping spam into it (autodumping via filters would accumulate messages to quickly). Then go into that folder on each Mac. Mark un-marked messages in this folder as Junk, and bingo, you've got a somewhat unified training database.

Fast?!? by SuperBanana · 2004-05-18 17:44 · Score: 4, Interesting

With Altivec, no wonder Mail is so damned fast.

Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.

Mail CHOKED on them. The early version of Mail chugged for 2 something hours and I gave up and killed it. The latest version was slightly better; 1000 messages or so still took well over 10 minutes. It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. My inbox is 820, and several mailing list boxes are well over 5,000 if I forget to clean them out. I have hundreds of MB of mail, and Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).

But that was just the importing- then it had to thread them or something, and THEN it had to index them all, both of which it did in the background, but still took forever.

Searching? Well, ok, it's "better" than Eudora in that it gives relevancy and Eudora is an on/off sorta deal, but that's fine- and I prefer 1 second for an exact search in a 2,000 message mailbox over 5-10 seconds for a fuzzy search.

Sorry, but Eudora, despite being a lumbering dinosaur technology-wise(MIME support is broken- PGP-MIME just doesn't work right; no address book integration is another thing that really irritates me), it is just plain hands-down the fastest mail client around.

The MBOX-with-index format also works exceedingly well, is portable (although some minor massaging with text-processing tools may be needed in some cases), and hard to corrupt- unlike almost every other mail client's DB (especially outlook). I've used Eudora for ten years, and never lost a single message except for one early beta version which munged a mailbox on me.

--
Please help metamoderate.

Re:Fast?!? by pHDNgell · 2004-05-18 18:03 · Score: 4, Interesting

Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.

Mail CHOKED on them.

Everyone's got a story and a counter-story. I've got over 100,000 messages in IMAP (101,269 as of last night, but it goes up and down), fully synced to Mail.app (bodies and attachments) indexed for searching, and used every day. It's split over 250 mail boxes (one for each month I've sent or received email as long as I've been keeping stuff).

It's amazingly fast. It makes my mail server seem fast (Sun IPX running SunOS 4.1.4 with a custom cyrus IMAPd that supports compressed mail stores and LDAP and some other stuff).

(Sorry for all the parentheticals. :)

--
-- The world is watching America, and America is watching TV.
Re:Fast?!? by Rosyna · 2004-05-18 18:06 · Score: 2, Interesting

Uhm, I've got about 5 mailboxes that have hit this 32760 message limit (dunno why but they recently reduced it to 32000).

My Mail folders contain 2.31gigs of email. Mail cannot handle this and chokes on it horribly. Eudora handles it like a champ. Too bad its junk mail filter sucks.
Re:Fast?!? by Alan · 2004-05-18 18:19 · Score: 5, Funny

Dude, you seriously need to seek help for your mail-archiving condition :)

Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!
Re:Fast?!? by alannon · 2004-05-18 18:50 · Score: 3, Informative

One of the reasons that eudora tends to be fast for some things when Mail.app isn't, is that Eudora does not store attachments with the mail. It splits them off at download-time into a separate folder. Mail.app keeps the entire mail envelope intact, including attachments. This makes Mail.app often very, very slow when moving large numbers of messages around, simply because it's doing a lot of file manipulation. I will admit, though, that Mail.app often feels very sluggish. Apple needs to work on that.
Re:Fast?!? by mandelbaum · 2004-05-18 19:21 · Score: 2, Interesting

Yeah for Eudora, the mail client with option-click to automatically group messages by whatever you click on. It's the best thing...ever.

I've been using it since 1994 and can't imagine switching to anything else.

-aaron
Re:Fast?!? by pHDNgell · 2004-05-18 19:32 · Score: 1

Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!

Heh, actually, I have been considering that. Mail.app does need to look at all of the folders every time it connects to see if anything's changed. It's working pretty well right now, but I am considering making a new mail server. Has anything good come out since SunOS 4.1.4? :)

Seriously, though, keeping my mail has come in way too handy for me to consider throwing it away.

--
-- The world is watching America, and America is watching TV.
Re:Fast?!? by Anonymous Coward · 2004-05-18 19:38 · Score: 0

Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).

Yeah, I hit that limit. It's 32760 messages....
Re:Fast?!? by ecampbel · 2004-05-18 20:23 · Score: 1

I have 56,442 messages in my Junk folder, so there can't be a 16-bit message limit...

--

Sig goes here
Re:Fast?!? by Ilgaz · 2004-05-18 20:38 · Score: 1

Eudora, both in windows and mac world is known/famous for its handling of amazing amounts of mails.

Yet, Qualcomm, that giant behind Eudora still couldn't figure Spam filtering to some degree in 2004 must be free and yes, there are non english speakers wanting to use their application (no utf support)

The degree of mail handling is... A coder friend of mine had to sort 150.000 mails coming to a support department after a major crash there... It did. :)

But... Spam filtering must be free, it does nothing but make people hate from application and uninstall...

I still don't understand why Qualcomm needs my $50 anyway lol.
Re:Fast?!? by richie2000 · 2004-05-18 21:25 · Score: 2, Interesting

Has anything good come out since SunOS 4.1.4?
I don't think so. Considering the time it took to get 4.1.4 as the proverbial gift from the Gods, I wouldn't hold my breath. ;-)
Damn, I actually miss SunOS, SunView and the 3/80s we had at school...

--
Money for nothing, pix for free
Re:Fast?!? by nikster · 2004-05-18 21:26 · Score: 3, Interesting

Mail CHOKED on them

it helps to check Apple apps _again_ from time to time since they tend to make huge improvements with every release. Mail.app has not been slow for a while now. Apple seems to pretty consequently follow the strategy "make it work first, make it fast later" . i am running the latest version on OS X 10.3

I have about 1G of mail and it doesn't really seem slow in any situation, even though it's running on a almost 3 year old 667MHz powerbook (with a sloooow hard disk).

I just did a test of search entire message in all mailboxes (all 1G of them). the first results appeared after 3 seconds, and it stopped after 40 secs, rebuilding some indexes along the way. the second search was done in about 15 seconds.

Every single criticism i had since Mail 1.0 - and there were a lot, including performance - has since been addressed. It is now fast, no annoying modal dialogs, no indexing behind your back, no weird delays. It's just a beautiful mail client.

i recommend you try it again.

On topic: The junk mail filter seems to indeed work pretty well. i just checked my junk mail folder (2000 unread messages, heh): All except for 5 were spam, and those 5 were all mass mailings, too. Even clever(?) subject lines like v$a.g.r.a and such were filtered out.

Oddly, 3 of the 5 false positives were from Apple, sent to my .mac account.
Re:Fast?!? by EvilTwinSkippy · 2004-05-18 21:57 · Score: 2, Interesting

As a network administrator I just have to do a paternalistic scowel at you.
2.3 gig of email. Dear god our server only has a 20 gig hard drive. I'd be camped out at your office (or send a coop to camp in your office.) and make disparaging remarks about "bloat" until you trimmed up a bit.
If everything is important, nothing is important. 32,000 messages means you aren't real picky.

--
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming
Re:Fast?!? by Rosyna · 2004-05-18 22:31 · Score: 2, Informative

The limit exists on OS X (at least) because of a limit of the Resource Manager. Each message in the mbox on OS X has its index and other data in the resource fork. One for each message. There is a 16-bit limit on the number of resources in a file (and a 16meg limit for the entire resource fork). It is also why some OS X developers keep asking apple to FREAKIN IMPLEMENT NAMED FORKS ALREADY!
Re:Fast?!? by cyb97 · 2004-05-18 23:06 · Score: 1

a decent IMAP implementation (not that I can think of any that is decent and useable from a user perspective) should allow you to choose which boxes you want to subscribe to (ie. which ones it should check on connection).

It still amazes me that there still aren't any really killer IMAP-implementations around yet. They all seem to suffer from different problems making them annoying to use in the long run.
Re:Fast?!? by SlamMan · 2004-05-18 23:41 · Score: 1

So what if he's not picky? (this is actually very similar to a discussion we had at work yesterday) With drive prices being as low as we can, we realize we could easily split the data portion of our mailserver between 3 drives. With 160's going for a song, we could give our 130 users 3 gigs of mail storage with plenty of room for institutional growth. It does become a bit more of a backup issue, but really, there's less load having 160 users with 3 gigs of mail, vs 480 users with 1 gig.

--
Mod point free since 2001
Re: Fast?!? by teridon · 2004-05-18 23:48 · Score: 2, Interesting

I had the same experience with Mail -- I let it chug away *overnight* to import my mail. The next day when I tried actually *using* Mail it was too slow compared to Eudora. What a waste of time :(

FYI, Eudora 6.1 now has address book integration. See here

--
I hold it, that a little rebellion, now and then, is a good thing. -- Thomas Jefferson
Re:Fast?!? by thatguywhoiam · 2004-05-19 00:32 · Score: 1

My Mail folders contain 2.31gigs of email. Mail cannot handle this and chokes on it horribly.
Patient: "It hurts when I do this."
Doctor: "Stop doing that."
While I admire your chutzpah, having 2.31 gigs of Mail around may just be one of those exercises in futility. I mean, I can't prove you wrong if you tell me you need instantaneous access to 2.31GB of mail all the time, but I do kind of doubt it.
Reminds me of those people who complained that you can't put 900 folders in the Dock and make sense of it.

--
If Jesus wants me it knows where to find me.
Re:Fast?!? by geniusj · 2004-05-19 02:25 · Score: 1

I also have well over 100k messages.. Now here's the reason that I don't use Mail.app.. I DO NOT want all of my messages to be stored on my mac as well. That is what IMAP is for. But instead of Mail.app even giving me the option of using server side searching as most complete IMAP clients should, it insists on indexing everything locally in order to provide searching functionality.

Very large mailboxes (30k+ messages?) also take a bit to open in Mail.app as well. In my opinion, if you're using a good IMAP client, the amount of messages you have should be irrelevant to the speed in which you can use your mail client. I currently use Mulberry and am satisfied at the performance which it gives me. It only loads headers for the ~20 messages surrounding where you're scrolled to and therefore provides excellent performance on any size mailbox. The other thing I like is that all searches are performed server side and usually 4 at a time (customizable just like everything in mulberry is). This means that my mail store on my local drive? Zero.. I store nothing locally, including my mail settings. Those are all stored on an IMSP server. This is how I like my mail to be.

-JD-
Re:Fast?!? by EvilTwinSkippy · 2004-05-19 02:53 · Score: 4, Informative

Where to start...
First off, servers take SATA or SCSI, not the cheepy IDE drives you find on the net. Second, even if you could find equivilent sizes for equivilent prices for server-grade stuff, I can't speak for everyone, but users don't store anything on my network that isn't on a RAID. 2 drives for a RAID-1, 3 (at least) for RAID-5.
Assuming that cost isn't an issue, and you have a miraculaous RAID controller that is easy to program, you run into the problem of how to hook up the new drives. If you don't have enough bays and connectors you have to drop your old hard drives to tape, plug in your new drives, and restore.
The last time I did a restore of 160GB it took 48 hours with a DLT autoloader. AIT might cut that down to 12 hours. But that's still a long time to be without data.
I'll save the isues about premature failure on these uber-mega drives for another discussion.
Now I insist our users use IMAP for email. Too many bad experiences of desktops croaking and taking all of a user's POP mailboxes with it. Making your system catalogue several gigabytes of email per user is going to slow things to a crawl, unless you are using something enlightened like maildir. Even then, you are going to be hell bent to find a file system that effiently handles both uber-mega attachments AND a few million tiny text files for individual messages.
All for what? So some user doesn't have to be bothered to clean out their mailbox?
No problem, except the next thing El' numbnuts is going to ask for is a tool to actually FIND something in all that mess.

--
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming
Re:Fast?!? by Pope · 2004-05-19 03:04 · Score: 1

Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.

Mail CHOKED on them
Err, yeah, importing a ton of mailboxes is quite a different task than simply running day-to-day email tasks.
For the record I use both Mail and Eudora, with Eudora getting my "important" email boxes. I've been using it since I got my first PPP connection, and see no reason to stop using it, considering I don't need all the "features" that most newer clients have.

--
It doesn't mean much now, it's built for the future.
Re:Fast?!? by Pope · 2004-05-19 03:17 · Score: 1

I have 56,442 messages in my Junk folder
You're supposed to delete them, dude.

--
It doesn't mean much now, it's built for the future.
Re:Fast?!? by Zeriel · 2004-05-19 06:46 · Score: 1

First off, servers take SATA or SCSI, not the cheepy IDE drives you find on the net.

Hitachi 160GB SATA drives are ~$105. Pretty damn cheap.

Yeah, it's only 7200rpm. It's a freakin' 1000-user mail server, if you're hitting those drives hard you need some serious help.

--
"America has done some terrible things. But I know that Americans don't cheer when innocents die." -Dave Barry
Re:Fast?!? by sfgoth · 2004-05-19 07:41 · Score: 0, Troll

No problem, except the next thing El' numbnuts is going to ask for is a tool to actually FIND something in all that mess.

I have over 3GB of mail archived using Eudora, and I use the search feature all the time.

Here's a nickel, buy yourself another gigabyte.

Are sysadmins trying to become the new beancounters? Who the hell are you to decide which of my data is valuable?
Re:Fast?!? by Brent+Nordquist · 2004-05-19 09:01 · Score: 1

pHDNgell: Would you mind contacting me at cyrusmac AT nordist DOT net -- I am trying to get Mail.app to work with Cyrus v2.2.3 with no luck (it crashes); if you solved this problem I'd love to hear how. Thanks!

--
Brent J. Nordquist N0BJN
Re:Fast?!? by pHDNgell · 2004-05-19 13:55 · Score: 1

It still amazes me that there still aren't any really killer IMAP-implementations around yet. They all seem to suffer from different problems making them annoying to use in the long run.

I know what you mean. I've tried many things. I always went back to pine. Mail.app has been Good Enough(tm) to keep me away from pine most of the time (not that I've got anything against pine).

IMAP's pretty amazing, though. I'd hate to try to implement it again. :)

--
-- The world is watching America, and America is watching TV.
Re:Fast?!? by sfgoth · 2004-05-20 07:23 · Score: 1

TROLL?! What kind of crack are the moderators smoking and where do I get some?

Wow, the neighborhood is really going to hell around here.

Mail & IMAP by WALoeIII · 2004-05-18 17:48 · Score: 1

I'd much preferr to use Mail over Entourage but I can't because it doesn't support multiple databases for accounts or have the ability to move my mail to a Junk folder on my IMAP server. I run my own server and space is not an issue so I do not delete mail. This makes it easier to train a new application if I need it - and makes sure that i never miss a message. Until mail can handle multiple accounts better I won't be using it.

Re:Mail & IMAP by elbobo · 2004-05-18 18:46 · Score: 3, Informative

doesn't ... have the ability to move my mail to a Junk folder on my IMAP server.

Yes it does:

Preferences -> Accounts -> Special Mailboxes -> Store junk messages on the server.

My personal IMAP complaint is that you can't create rules to move messages between folders on the server, only folders on the client.
Re:Mail & IMAP by hxnwix · 2004-05-18 19:39 · Score: 1

But it doesn't actually seem to move them there... I found that option and checked it, but have been too lazy to figure out why it doesn't work. Oh well. Perhaps I need to create a junk folder on the server myself first...
Re:Mail & IMAP by elbobo · 2004-05-18 19:55 · Score: 1

Perhaps I need to create a junk folder on the server myself first...

You do. Select the new folder, then Mailbox (menubar) -> Use this folder for -> Junk. Same process for when you want to store your Sent or Trash on the remote server.

Although I can't actually confirm that this works for the Junk folder. I only keep my Sent folder on the server. But the option is there for the Junk folder, so I assume it must work.
Re:Mail & IMAP by pwagland · 2004-05-24 03:09 · Score: 1

doesn't ... have the ability to move my mail to a Junk folder on my IMAP server.
Yes it does:
Preferences -> Accounts -> Special Mailboxes -> Store junk messages on the server.
My personal IMAP complaint is that you can't create rules to move messages between folders on the server, only folders on the client.

Not entirely true... it depends a lot on what server you are using, but it is possible to do server side filtering on IMAP servers. In general there are two types of servers, and two types of solutions: Maildir based (i.e. dovecot) Filters for these can be normally be setup with procmail based filters. Private database (i.e. Cyrus) These normally have a language called "sieve" to do the filtering for you. What you might of meant to say that it gripes you that mail.app does not have any way to manipulate these rules... and that is damn annoying. Horde/ingo is a god choice for a web based frontend to this though.
Re:Mail & IMAP by elbobo · 2004-05-24 06:14 · Score: 1

I'm talking about client side rules. ie the rules you find under Preferences -> Rules. These have a specific limitation that they're incapable of moving messages between IMAP folders.

Yes you can do server side rules, and on occasion I do, as I run my own mail server (qmail + courier-imap, using Maildir). But for convenience sake, it's nice to be able to quickly throw together rules on the client side, which is impossible with Mail.app, due to this limitation.

best filter nerdmaker.com by Anonymous Coward · 2004-05-18 18:05 · Score: 0

As far as I know, I am the first to make this up.
1. have several spam email accounts to gather spam

2. use all these emails as a filter(can be partial matching) to eliminate spam at the isp or server or client level.

Re:best filter nerdmaker.com by Anonymous Coward · 2004-05-19 15:21 · Score: 0

that is an excellent idea; but you must understand the three interests of slashdot readers(in this order chronologically:
1. masterbation
2. pron
3. see number 1.

If computers are so smart... by Anonymous Coward · 2004-05-18 18:15 · Score: 0

Why can't they tell that pen1s and penis are essentially the same word.

I use Spam Assassin, and yeah it flags a lot of stuff, but the stuff that does get through is really obvious spam to a human being, yet it fools SA no sweat.

Re:Kinda like Mozilla Mail? by Anonymous Coward · 2004-05-18 18:24 · Score: 0

Actually, if you read the article it specifically states that Mail's spam filtering is not like Mozilla Mails. You use it in much the same manner, butt the underlying technology is completely different.

Wow, the article states it.. It's gotta be true..

It does a poor job of explaining the difference.

Word disguises? by Piquan · 2004-05-18 18:31 · Score: 2, Interesting

The big problem I see in spamland today isn't the classification technology. It's the word recognition problem. Sure, "VIAGRA" may be deeply embedded in a "spam" cluster, but what about "V1_4G ra"? If spammers weren't disguising their words, I think that Bayesian filtering and other techniques work fine. I'm not really sure that more advanced techniques in word classification are really needed here.

Re:Word disguises? by Anonymous Coward · 2004-05-19 01:13 · Score: 0

Bayesian filters aren't just word filters, in general. Read Paul Graham's article on this. It tokenizes everything, so common features we're not considering (such as bold, red text) also becomes a devistating indicator.
Re:Word disguises? by PowerMacDaddy · 2004-05-19 07:13 · Score: 1

Dude, you need to download and install JunkMatcher. It works awesome in compliment to Mail.app's built-in filter, and it's free as in beer.

--
MacTacToe - for every problem, an elegant solution

Not if email is marked as junk... by SuperKendall · 2004-05-18 18:40 · Score: 5, Informative

If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.

As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Re:Not if email is marked as junk... by emarkp · 2004-05-18 20:18 · Score: 1

At least in Thunderbird, you can turn off loading of images not contained in the message, irrespective of whether it's junk or not. I've been quite happy with that setting.
Re:Not if email is marked as junk... by soft_guy · 2004-05-19 09:12 · Score: 2, Informative

I use Mail.app, I have Panther, and I keep everything current. Still, Mail often misses many pieces of spam every day and gives me false positives from time to time. YMMV. Still, I find the junk mail filter useful enough to leave on.

--
Avoid Missing Ball for High Score

But you still get the spam... by Yusaku+Godai · 2004-05-18 18:49 · Score: 2, Insightful

I mean, it's great and all that we've gotten pretty good at filtering spam. I use Opera quite a bit, and its spam filters work with 99% accuracy after sufficient training. But there's still a chance something can slip through. I still have to download all the spam, and occasionally go through it, deleting it all, while making sure something legit didn't accidentally get flagged as spam. It's rare, but it happens. The most annoying thing is just that I get it at all. I'd be more impressed to see something like this running on the mail server, turning back spam. I even wouldn't mind if the rare legitimate message got bounced. The sender would just receive a message from the mail server saying that their mail was marked as spam, and that they should try again, or let me know by some other means. Heck, I wouldn't mind missing the occasional e-mail if I never had to download another spam again. That's what would impress me at this point.

Re:But you still get the spam... by nikster · 2004-05-18 21:53 · Score: 1

slight problem: if you bounce spam, you let the spammer know that your account is live, so they will proceed to send you even more spam.

besides, a lot of spam also has fake return addresses (viruses and such). so you are spamming the real owners of the address. and if they have such a bounce-system as well...

=> if everybody automatically bounced all spam, the internet would go down in an avalanche of spam and bounces...

the best way to stay spam-free is to never give out your email address. use a disposable one instead, like this one: slashdot-test.3.nikster@spamgourmet.com

i will receive the first 3 emails to that address. everything else gets eaten (deleted right on the server). works beautifully.
Re:But you still get the spam... by Mwongozi · 2004-05-19 00:47 · Score: 1

I host my e-mail at Fastmail. They use SpamAssassin to filter spam, but do it in a really cool way. For example, any mail scoring greater than 4.0 is dropped into my spam folder, but mail scoring greater than 10.0 is just deleted automatically, server-side. It's pretty cool, and the thresholds are user configurable.
Re:But you still get the spam... by Yusaku+Godai · 2004-05-19 02:07 · Score: 1

That does sound really cool. Sounds like just the sort of solution I'd like to see =D
Re:But you still get the spam... by Anonymous Coward · 2004-05-19 02:33 · Score: 0

Find an ISP that uses a Barracuda spam appliance. Whats great about this solution is that when messages are tagged as spam they are not downloaded to you. You get a quarantine notice once a day which you can review to trash, deliver, or whitelist. You can even set your own spam scores to tweak the effectiveness. check out http://www.crocker.com a great ISP in Massachusetts. They KNOW how to handle spam!
Re:But you still get the spam... by rudedog · 2004-05-19 03:28 · Score: 4, Interesting

The sender would just receive a message from the mail server saying that their mail was marked as spam

Sadly, if it is spam, then you'll be punishing thousands of innocent people whose email addresses have been forged by the spammers, by sending them the bounce messages. Very little actual spam gets past my bayesian filters, but I do get a lot of bounces from other people's spam filters for messages and virusses that I never sent.

Combine with JunkMatcher for 99%-100% accurracy by kiddailey · 2004-05-18 18:50 · Score: 2, Informative

Mail's junk filter may be okay, but it's not nearly as good as the article makes it out to be. You'll get much better results using a combination of the built-in filter along with external filtering/tagging devices.

For example, I ran across JunkMatcher some time ago and have been enjoying 99-100% accurracy with less than 1% false positives. It's a huge improvement over the 80-90% accurracy I was getting with the built-in filter alone.

This is probably off-topic by teamhasnoi · 2004-05-18 18:53 · Score: 4, Interesting

All my emails to a couple of people suddenly started bouncing with a 550 'Administrative Prohibition' error last week - at first I blamed my ISP, then blamed my host, then the receiving host, all for naught. I then found I was on a couple of blacklists (probably because I apparently shared a virtual host with a scummy mortgage guy), but these had no bearing (I learned later)

I had emails out to every link in the chain, but no one knew what was going on.

In Apple Mail, I had my 'reply to' names set to my emai addys - I changed it to short descriptive names and now they're not bouncing anymore. (odd error, so I thought I'd post it)

Why this started all of a sudden, and why no host or ISP had heard of this before. I don't know.

I do know that being on a blacklist and attempting to get off of it is nigh impossible, so I'd be all over Apple making spam filtering software so overzealous wizards of blacklists can be kicked to the curb. (Why is this in use anywhere..?)

Re:This is probably off-topic by the+shoez · 2004-05-18 22:28 · Score: 2, Insightful

Maybe because this type of filtering can only really work for a single user on their own corpus of email. In effect it's an end-user solution not something that could be deployed across the whole spectrum of mail servers (as I understand it). Take for example Scott Richter (or however you spell the name), that scabby-little spammer - he loves the stuff, and wouldn't wish it to be filtered from his inbox.

Blacklists are there to swallow-up this bandwidth wasting traffic forced down our necks by spammers. Personally, I would rather the crap be denied before it ever has to reach my section of the line. I don't know about you, but I get a chuff-load of spam every day which seriously hacks me off. Getting onto a blacklist for any length of time means a boat load of spam must be coming from that machine - hence it's the fault of your host provider for not cracking down sooner.

I say all power to them!

--
&lawyers($instruction);

There's plenty of LSI information online by K-Man · 2004-05-18 19:11 · Score: 4, Informative

Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).

In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.

That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:There's plenty of LSI information online by kanthaka · 2004-05-19 02:35 · Score: 2, Informative

There's a good survey of information retrieval techniques & algorithms here -
http://maya.cs.depaul.edu/~classes/ds575/lecture.h tml
It's a course site so the lectures not accessible, but all the articles & tools are.

Need To Select Text by WiseWeasel · 2004-05-18 19:34 · Score: 1

You need to first select the text to summarize, then you can go to the application menu, Services, and choose the Summarize option. This then launches the SummaryService, which then allows you to set the desired summary length and displays the summary accordingly.

--
"I like systems, their application excepted", George Sand (French)

Wow by Anonymous Coward · 2004-05-18 19:45 · Score: 0

I guess that's what the "load images" button in Mail.app is for. So you only load pictures from sources you trust.

I assume this is the same in Outlook? Anyway. I'll never stop blaming MS. Think of how much bloody extra traffic their introduction of HTML is causing (look at spam numbers).

No need to own a mac! by Phil+John · 2004-05-18 19:53 · Score: 1

look here! ;o)

--
I am NaN

Brazil by Cally · 2004-05-18 20:22 · Score: 1

Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

Information Retrieval? Information Adjustments, surely.

--
"None are more hopelessly enslaved than those who falsely believe they are free." -- Goethe

Privacy violation by michaeldot · 2004-05-18 21:17 · Score: 2, Interesting

You mean these "Vectors" (sounds foreign) are watching everything in my email?!!

Well, if that isn't a gross invasion of privacy then my name's not Liz Figueroa.

I'm drafting a letter to the Senate immediately... on a typewriter.

Re:Kinda like Mozilla Mail? by Anonymous Coward · 2004-05-18 22:28 · Score: 0

time to get new friends and disown your family

Document Vectors - Term Weights by agentofchange · 2004-05-18 22:30 · Score: 3, Interesting

Forgetting about vectors is silly.

In short: a vector is the result of a calculation based on the number of times a term is used in a document and the terms in the other documents it is being compared with (the document set).

The angle between document (email) vectors is a representation of their likeness. For example if the angle is very small the documents have a lot in common.

This is how the mail app works. It compares known junk emails (ie the query) to the incoming document set (new emails)

There are a number of weighting schemes, for example Term Frequency Weights (TF Weights) or Term Frequency Inverse Document Frequency (TF-IDF Weights).

There are a few particiularly relevant laws to Information Retrieval. Heaps Law (the larger a document gets the less new words are added to it).

http://planetmath.org/encyclopedia/HeapsLaw.html

Zipfs Law: More relevant to document weighting schemes. It states that frequently used words are less relevant. For example stop words such as "a, the, it, and, is" all carry little meaning and are used frequently.

http://planetmath.org/encyclopedia/ZipfsLaw.html

Less frequently used words in a document are better at describing its content. For example " pixel intensity mathematical concepts".

-- Agent

mozilla by hitmark · 2004-05-18 22:39 · Score: 1

and this is diffrent from mozilla mails spam filter how?

oh and can we please get a mac software article that dont sound like a raveing zealot doing the "hail mac/jobs" routine?

even linux reviewser are more critical...

--
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

Re:mozilla by zpok · 2004-05-19 03:55 · Score: 1

"can we please get a mac software article that dont sound like a raveing zealot doing the "hail mac/jobs" routine?"

No.
If O'Reilly is too zealot for you, I guess everything is.

"even linux reviewser are more critical.."

No, no they're not. Unless they talk about "that other distro/window system/desktop environment/Windows". I've seen quite a few Linux reviews with more than the maximum sustainable amount of exclamation marks...

--
I think, therefore I am...I think.

Or using Microsoft's version... by yarisbandit · 2004-05-18 23:19 · Score: 1

..."Wow, the article just turned me on."

Kinda like Mozilla Mail?-Pigeonhole filesystems. by Anonymous Coward · 2004-05-18 23:20 · Score: 0

"Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. "

Which you'll see more of as DB Filesystems, and Meta this and Meta that, become popular.

Good god, man by thatguywhoiam · 2004-05-19 00:23 · Score: 5, Informative

Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too [msnwar.com].

Well, since you brought it up, yes, let's compare:

Apple method:
Open Prefs
Click Viewing Options
Uncheck 'Display images and embedded objects in HTML messages'

... or I can go hunting on the web for this weirdo, non-sanctioned 'patch' for Outlook, and install that. Oh yeah, and ZoneAlarm.

I'll stick with Apple's method thanks.

--
If Jesus wants me it knows where to find me.

Re:Good god, man by fanfriggintastic · 2004-05-19 02:04 · Score: 3, Informative

Images are off by default in Outloook 2003. You can turn them on for a particular sender or per email, easily, through a link at the top of the message. Piece of cake.

--
This is not the greatest sig in the world, no. This is a tribute.
Re:Good god, man by CaptainFrito · 2004-05-19 02:40 · Score: 1

No wonder they restrict you to one mouse button ;-)
Re:Good god, man by geoffspear · 2004-05-19 04:28 · Score: 2, Informative

My three mouse buttons all work perfectly well with my Mac. They don't restrict you to anything, they just sell their machines with a one-button mouse.
I don't even need to go hunting for drivers to install if I want to plug in another mouse, or damn near any other USB device. They just work.

--
Don't blame me; I'm never given mod points.
Re:Good god, man by Anonymous Coward · 2004-05-19 06:49 · Score: 0

My PC came with no mouse buttons, or a mouse, for that matter. Apple includes a mouse with each computer, and an OS simple enough that you don't need extra mouse-buttons to get stuff done.
To break it down:
My PC: Needs 3 buttons, comes with zero
My Mac: Needs 1 button, comes with one.
In both cases, you have the option to go out an buy a mouse with more buttons if you want.
Pretty obvious to me which is better. Why can't more PC bigots see that? Just stupid, I guess.
Re:Good god, man by milkman_matt · 2004-05-19 09:04 · Score: 1

Images are off by default in Outloook 2003. You can turn them on for a particular sender or per email, easily, through a link at the top of the message. Piece of cake.

Really? Does it do this for Outlook Express too? If so then that's very cool. However, what piqued my interest in your post was that you mention you can turn them on for a particular SENDER. I've WANTED that in Mail.app for a while. I used to get my daily sinfest comics emailed to me, but I stopped just because I had to load the images every day. No i'm not THAT lazy, but on a g3/400 powerbook, even that takes longer than it should. I'd like a 'accept images in mail for sender' option, that'd be great.

-matt
Re:Good god, man by fanfriggintastic · 2004-05-19 09:13 · Score: 1

Nope, just Outlook - it gives you the option to either add the sender's email to a safe list, or the sender's domain. Pretty handy. I'm sure it's saved me from plenty of spam in the past.

But it totally backfired on me once, when I sent a friend a particularly funny piece of spam, she opened it and up popped the images, probably tied to my email... Seriously, I think my daily spam doubled since then.

--
This is not the greatest sig in the world, no. This is a tribute.
Re:Good god, man by CaptainFrito · 2004-05-19 12:19 · Score: 1

I don't use Windows (except for some engineeering apps that only run in Windows and the occasional PowerPunt file). Windows doesn't come with any hardware, neither does any other OS. Apple is a bundled, blister-packed harware/software solution designed with a specific intellect in mind. Thus, it is ultra-simple. And bully for them, I can't say they underestimated their market one bit. I prefer selecting all my own hardware and software and then tailoring it to my needs. Some people just want "internet in a box". or "graphics arts in a box".

Latent Semantic Analysis by Henry+Stern · 2004-05-19 00:41 · Score: 4, Informative

After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.

Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.

The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).

Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.

You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.

I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.

For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.

[1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.

[2] Singular Value Decomposition -- from MathWorld. http://mathworld.wolfram.com/SingularValueDecompos ition.html

[3] Frobenius Norm -- from MathWorld. http://mathworld.wolfram.com/FrobeniusNorm.html

[4] Artificial Intelligence Wiki: NearestNeighbour. http://www.ifi.unizh.ch/ailab/aiwiki/aiw.cgi?Neare stNeighbor

Information Retrieval by ScottGant · 2004-05-19 00:52 · Score: 4, Funny

This is Information Retrieval not Information Dispersal...Information Transit got the wrong man. I got the right man. The wrong one was delivered to me as the right man, I accepted him on good faith as the right man. Was I wrong?

My name's Lowry. Sam Lowry. I've been told to report to Mr. Warrenn.
Thirtieth floor, sir. You're expected.
Um... don't you want to search me?
No sir.
Do you want to see my ID?
No need, sir.
But I could be anybody.
No you couldn't sir. This is Information Retrieval.

There you are, your own number on your very own door. And behind that door, your very own office! Welcome to the team, D7-105! Welcome to Information Retrieval

--

"Music is everybody's possession. It's only publishers who think that people own it." - John Lennon.

Re:Information Retrieval by fanfriggintastic · 2004-05-19 01:57 · Score: 1

Brilliant movie. Great quote.

--
This is not the greatest sig in the world, no. This is a tribute.

Re:no, its.. by ForestGrump · 2004-05-19 01:01 · Score: 1

i mean "them" as in macs...
very sexy laptops. They just cost too much.

So I settle for the "discount girl" you know...dell.

-Grump

--
Is it true that more people vote for the winner of American Idol, than vote for the president? -Ali G.

What Should I Do with Spam Once It's Flagged? by supertsaar · 2004-05-19 02:55 · Score: 0, Redundant

From The Article: "What Should I Do with Spam Once It's Flagged? "
Why, send it on to all your friends ofcourse! :)

--
The Bigger The Headache The Bigger the Pill

Shades of the MCP! by Samurai+Cat! · 2004-05-19 03:01 · Score: 1

Isn't that how the MCP got started in "Tron"... something... small...?

It's time to nip this one in the bud! Before it's too late!!! :)

--

"People" using "unnecessary" quotes should be "shot".

Yes, but... by wfolta · 2004-05-19 03:58 · Score: 1

An underlying Bayesian model. Not necessarily the Bayesian model used by current "Bayesian" spam filters.

Free Software? by Bob+Uhl · 2004-05-19 04:22 · Score: 1

So, is any free software project working on this sort of thing? Given that Unix docs tend to be plain text, this kind of approach should work better here than with all those nasty proprietary binary formats. Reading the description, it doesn't sound that difficult to do, although I've not enough maths background to no for certain.

Re:Free Software? by firebus · 2004-05-19 04:59 · Score: 1

in my experience, thunderbird has a *much* better adaptive junk filter than mail.app. specifically, i see lots of falsepositives with mail.app.

thunderbird is weak in the other direction, with more missed spam, but tbird misses a lot less than mail.app misclassifies. this is my experience at a site with about 40 macs.

tbird also plays nicer with our (courier) imap server and is generally less of a pain in the butt.

of course, if you really want good spam filtering, you need to bring it to the server. dspam is awesome!

Apple's filters need help by cardozo · 2004-05-19 04:54 · Score: 2, Informative

I found the same as other people have noticed, that Mail.app's filter misses stuff and is hard to train.

Enter JunkMatcher Central.

it uses rules based filtering to complement Mail.app's methods. And, as a bonus, you can have it mark what it finds as junk mail, which trains mail.app.

It requires some tweaking, but is great, updated often, and free!

Re:Apple's filters need help by Anonymous Coward · 2004-05-19 07:25 · Score: 0

Excellent recommendation -- it works great, and it's killing all those "debt" messages Safari used to miss. Thanks!

secret by balusiku · 2004-05-19 05:57 · Score: 1

no, it doesn't use white magic

sure, it uses dark voodoo magic! :-)

Agreed - JunkMatcher is awesome by PowerMacDaddy · 2004-05-19 07:07 · Score: 1

I average a total of about 35-50 spams a day across 6 email accounts. Mail.app's built-in junk filtering got it down to maybe 3 a day. When I added in JunkMatcher, it's now down to about 3 a week. And you can't beat the price!

Of course, having half of those accounts on an Xserve G5 running OS X Server 10.3.3 and referencing about a half-dozen blacklists helps, too. :)

--
MacTacToe - for every problem, an elegant solution

You can do that in Mail.app too... by SuperKendall · 2004-05-19 08:25 · Score: 1

It's jus tnice that you can leave it on and have to explicitly load it for mail marked as junk.

I was using Thunderbird (well, really the earlier mozilla mail) for a few years, but I have to say it got really slow with a lot of messages and Mail.app has just the right set of features for me at the moment. I might move back to Thunderbird someday though.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Slashdot Mirror

How Apple's Mail.app Junk Filter Works

273 comments