Mozilla Adding Spam Filters
ksheka writes "Mozilla mail now has Spam Filters, using Bayesian filtering method, no less. This is a very good thing, because it learns from the spam you receive, and constantly modifies itself, based on new spammer techniques!"
Now the list of 101 Mozilla features that IE doesn't have can be amended to 102 features! :)
Does the name Pavlov ring a bell?
Interesting thought, but they would have to have a large sample of YOUR valid email to train on...
"I'll have a Guinness, no wait, make that a Coors Light" -Grad student I work with, who shall remain anonymous...
But the spammers will develop Bayesian filters of their own to find the best content that will sneak by your filters
No they won't, unless the pattern (if there is one discernable in the S/N ratio) of replies they receive changes. As most spam, as far as spammers goes, disappears into a black hole, they have no way of learning how your filters are working.
And that's good filterin'!
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
it's size is getting bigger and bigger.
Compile Mozilla from scratch, and you'll see that you can custom tailor the build and cut out a lot of cruft.mpile Mozilla from scratch, and you'll see that you can custom tailor the build and cut out a lot of cruft.
The source package is far larger than the binaries! Then there's the wait in compiling the damn thing. No (L)User is going to do that. Maybe us geeks (and I do use the source, Luke), but certainly not a "normal" user.
The problem here is that binary distributions package it all together
So download the Net installer and choose only what you want?
I'm not a prophet or a stone-age man,
I'm just a mortal with potential of a super man.
Nonsense. It's impossible. First of all, they don't have access to much of the mail I want to let through-- although my mailing list traffic certainly qualifies, so let's assume that's the only mail I get and that they know I am receiving it.
There will still need to be header information and actual spam content in the spams themselves for those mails simply to not be repeats or dada-esque cutups of posts to the mailing list. That is, there must be content unique to the spam that no normal sender on the list will include.
Because of this, and the fact that so-called Bayesian spam filtering works by scoring all the words in an email and then evaluating the email based only on the extremes, there is little likelihood-- since the spam must still contain spam words to have any point at all-- of those words not being on the extreme word list. After all, if the same words are appearing in both spam and not-spam mails, they will be given a spam-probability that is not extreme. So all those words in common will be ignored and only the spam words will be looked at-- and the spam will still be filtered.
I do not have a signature
Really, eh? I mean, I turned on CNN today and they were reporting a story that I'd already heard on ABC News! The nerve! I sent them a letter saying "Um, excuse me, but I already heard that on ABC l053rZ!" They haven't replied yet.
To make matters even worse, when I was on the train I overheard two people talking about the Israeli conflict. I couldn't believe it! I mean, I heard someone talking about that LAST YEAR for crying out loud! That is so 2001! I told them that they're l4m3rZ for being so dated. They just seemed to ignore me though.
$5 / month hosted VPS on linux = awesome!
Since you must first download the content for client-side filtering to work you waste bandwidth. If you are truly bombarded by spam you still lose...your mail spool still gets filled up with stuff you don't want, your data transfers compete for bandwidth with the spam, storage hardware works harder storing data that will only be deleted. It raises everyone's costs, including yours.
We need to block undesired mail at the host, not filter it at the client. That way the spam never gets sent, the spammer gets the message that their attempt was futile, and bandwidth is conserved. Many ISPs already provide this service...we need to improve on it. And we need better tools for identifying and dealing with spammers. The current mail standards are woefully inadequate to this task.
There needs to be a tiered structure with filters. The main one would be at the ISP level. It would only filter out obvious spam(like spam going to 2000 users at that ISP). The second tier would be at the client side and would have a certain level of intelligence in identifying spam. One feature that I'd like (it might already be available) is if it could automatically send an email back to the sender saying the email address doesn't exist. This should be done at the server level and/or client level. This could possibly help in removing your email from such lists. As far as what to do with the spam at the client level, I think that it should be sent to your main inbox but just marked as spam (maybe greyed out or something). Like new mail is always bold and once you read it it goes to a regular font. Well, spam could be just greyed out. That way you would ever miss something that the spam filter had a false hit on.
Some of us don't even keep an address book, then again nowdays 80% of my mail is spam. I guess that means a spam filter that compares against my address book would not only be 100% effective in eliminating spam, but would also only 20% of the mails it wipes would be false positives. Good stuff that filtering software :P
How about a spamcop-like plugin? Or something that can submit my message plus contents to SpamCop?
If using SpamCop, there should be a way to still show the site's banners, because they deserve to get paid for their bandwidth I'm using up.
I'd love to just be able to right-click on a message and report it to the various abuse/postmaster accounts without having to copy my whole message plus headers, and pasting such into their web form. SpamCop seems to be pretty good at tracing the origins of messages, so I'd love to be able to leverage that sort of functionality.
You can accomplish anything you set your mind to. The impossible just takes a little longer.
That's a really cool idea in theory. In reality, you have to deal with trusting that everybody on the internet are trusted enough to decide what your spam is and isn't.
I mean, you've been on the internet before, right? You've seen the other people here, too? Think about it.
=Brian
There is nothing so good that someone, somewhere, will not hate it.
I would like to understand the choice of Bayesian more. As far as I know Bayesian is good for classifying based on *belief* and can be pretty good when only partial evidence is available to network. This is great for Marketing activities, eg sending out mass emails to a segment of a database :) . However as this is _my_ email and mission critical to me, just a simple belief that something is spam is not enough
In my experience, 99% of spam can be caught with static rules (am I in the TO or CC line gets a bit under half the spam I receive). Taxonomical analysis of the subject and body can get the rest.
Bayesian seems like overkill, or maybe even a bad fit. Let's face it, the other well known use for Bayesian is the famous Microsoft Office Paper Clip!!! And that is about as useful as the proverbial ashtray on a motorbike!!
Gary
I care. I'm busy, and if one of my friends needs a ride tonight I'll read it. If that same friend is just wondering how I'm doing, I won't -- unless I'm not at all busy.
Further, some of us actually have multiple threads of conversation going with our friends, or archive our messages and occasionally go back through them. I may be simultaniously talking with someone about (say) some PHP problems they're having and discussing motorcycle riding. If I want to go back and reread what exactly the problem he was having with PHP is, I don't want to have to sort through the messages where he's trying to convince me I should be riding a crotch rocket instead of a cruiser.
My friends understand this, and are polite enough to use the subject header in their emails. If they don't do that once, I'll ask politely that they start. If they don't do it again, I may well be rightfully a bit annoyed.
I love mozilla, and use it as my main browser. However my biggest complaint is that all the components (browser, mail, composer, etc) should be separate apps. I don't like the fact that if my browser crashes, so does my email reader, and vice versa.
I tried to find some documentation on how to acheive this, however, there was none to be found. Does anyone know how to do this, the I can use Mozilla's mail, rather than the flaky mail app that comes with OSX.
I am completely against all client-based spam filters. This essentially does nothing to address the most serious repurcussion of spamming, and that's exploitation of third-party networks & bandwidth. Aside from the fact that client-based spam filtering is most-likely the least effective solution and more likely to stop legitimate mail than other methods such as known spam relay blocking.
Ultimately, the only way we're going to really curtail spam is by enacting harsh *criminal* penalties for mail relay and server hijacking, which is the standard method by which most spam is distributed. It's true that these activities are already considered illegal but the law enforcement agencies are either unwilling to take action because there's a minimum threshold of monetary damages required, or they're ill-equipped knowledge and technology-wise to aggressively go after these people.
And Puleeze don't even bother with the ineffective, "let the industry regulate itself" argument, which doesn't work. Most spammers are small "cell groups" that move around a lot; most don't have any money in the first place; only criminal penalties are going to work, and client-side and industry regulated efforts don't stop their efforts at all and just drive bandwidth charges up for the rest of us.
It seems too many people distrust spam filters because of the chance of accidentally blocking an important legitimate message as if it were spam.
Many spam filters are strictly binary: a message is either spam, or not spam. This is not ideal, because "gray area" messages - between these two extremes - will likely not be sorted correctly.
I propose adding a new sort option to email clients.
Sort by Spam Probability
This would be an additional field that can be displayed in a message list, similiar to "To", "From", "Subject", and the like. Like the article, probabilities would range from 99% (almost certain spam) to 1% (most likely an innocent message). Notice that 100% accuracy either way is not claimed.
This way, the user can see up front the messages that are most likely not spam. The spam messages will be relegated to the bottom of the list, possibly colored to indicate their likelihood of being spam. If there is a message in the "gray area", it will most likely appear in the list between the legitimate messages and the spam, so the user will have a chance to see the message and make a decision, without the message being lost in the shuffle.
This would be a great feature. I hope this gets into Mozilla's mail client.
(BTW, another feature that would be great to see in mail clients would be datestamping of the actual time the message was downloaded. Many spammers, and innocent people with misconfigured clocks, send emails with wild dates that are not to be trusted. You can see this in yearly archives of GNU "mailman" mailing lists! Datestamping emails as they are downloaded will also keep mailboxes in order when sorted by date, as newly arrived messages will always be at the bottom, instead of being scattered throughout the inbox. But sorting by spam probability will probably become more popular than sorting by date....)
Dr. Demento On The 'Net!
As a popfile user, I'm quite impressed with the catch rate possible with bayes theorem spam filters, however I suspect this will decrease in effectiveness over the long term.
Spammers are likely to respond to filters like this by encoding text in ways the filters can't read but humans can (eg having a .gif file of the text, loaded by a HTML statement in the message).
Statistical filters would need to have some kind of built in OCR routine before it could be effective against that trick, and some respectible mailing lists are using images as well, so you can't just filter all mails with images attatched.
In the long term, therefore, I suspect that filters that use a network database of spam will be more successful.
The big problem with this is spam still gets to the server. :(
Just thought of this now... but it seems like almost all spam these days contains a whole bunch of HTML tags. Maybe someone should write a server plugin to instantly reject all mail containing , instantly adding the sending IP to a iptables DROP rule.
There's little legitimate e-mail with tables, unless you count paypal, datek, and travelocity news and that kind of crap. But we could always add a list of "good" IPs.
I know there are server solutions, but all make me a bit queasy. I just want something that will detect funky activity on the fly and instantly deny all access to that IP.
After collecting 87 megs worth of spam and a similar amount of non-spam I decided to implement the so-called 'Bayesian' method of spam filtering by way of popfile - it's a pretty slick concept; Perl code that acts as a POP3 server on your own machine - simply drop your collected spam and non-spam in to the appropriate bucket, have popfile go through them and create its indices and set up your mail client to connect to 127.0.0.1 with your username being 'my.pop.server:loginname'.
I know I've got a particularily difficult task for this filtering technique; I get an awful lot of spam that comes in every day (~100 messages per 24 hour period), some of it I actually want (I run an underground music site, and in some cases I subscribe to opt-in lists that result in something that looks like spam), the rest I could care less about.
My results have been decent for the most part; 100% of my spam ends up in my Spam folder, however there is a handful of messages that I wish to keep that end up there as well.. For the most part they are the above-mentioned 'borderline' pieces of spam (which I have been careful to put aside and have indexed by popfile anyway), I can only hope that more time and samples will yield better results. I was however surprised to find that some of the e-mails I was getting from friends were falling in to the Spam mailbox anyway; after taking a closer look, I can see why, they use an awful lot of otherwise unmentionable words - but my suspicion that I haven't gotten enough of these 'good-emails-with-bad-words' to make the filtering truly effective.
Nonetheless, it is nice to have all of my spams seemingly guaranteed to drop in to my "Spam" folder, but my usual task of manually filtering messages that made it past my existing filters in to my Spam folder has been replaced with a different (albeit quicker) task; taking messages out of my spam folder and putting them where they really belong.
Bottom-line: I still have to visually scan through my mail for legitimate messages amongst the thicket of items informing me about the exciting exploits of women at the farm, wonderful business opportunities from Nigeria and suggestions that I should buy Viagra by the boatload.. all this despite having collected a well organized and rather large collection of spam/non-spam mails. I'll stick with it for a while as I'd like to try it out and give it a proper chance, but I suspect that if you're in a similar situation then you should be prepared to tough it out..
1) How much time do you spend training your paperclip in Office?
How much time are you going to spend on training your spam filter? If you are unwilling to invest a little time and effort in developing a solid set of values that fit your personal pattern of behavior, then Bayesian filters are indeed a poor match for you.
2) What harm is a false positive?
If you are automatically deleting anything that is marked as a positive for spam, then you are playing roulette with your email. I would generally recommend diverting email classified as spam by your filter to a folder, especially one that is relatively new and has had very little experience with your patterns of use. Set an expiry on your spam folder, and check it from time to time to see if something fell through the cracks. Mozilla has a handy feature that allows you to simply conceal spam from view, which works adequately, although I dislike the potential performance hit in a large folder.
Considering how important your email is to you, you should certainly consider applying a little diligence to how you manage it.
Weapons of Mass Analysis
I like the ability to block images from a server, but it'd also be nice to have a similar feature for plugins and Java applets.
A lot of ad companies are now using really annoying flash. Blocking images doesn't stop these.
"You spoony bard!" -Tellah