Mozilla Adding Spam Filters
ksheka writes "Mozilla mail now has Spam Filters, using Bayesian filtering method, no less. This is a very good thing, because it learns from the spam you receive, and constantly modifies itself, based on new spammer techniques!"
The news article makes it sound like this feature is up and running, in reality it is partially phased in - alpha stage stuff.
It will be great when it's more complete but there is a lot of work to do yet.
- Toby
Man, a perfect place for a goatse link, and you didn't put it in. Sigh. Kids these days.
Best Slashdot Co
http://research.microsoft.com/~horvitz/junkfilter. htm
-- Outside of a dog, a book is man's best friend. Inside a dog, it's too dark to read
Interesting thought, but they would have to have a large sample of YOUR valid email to train on...
"I'll have a Guinness, no wait, make that a Coors Light" -Grad student I work with, who shall remain anonymous...
Bayesian technique is very good for the sort of abstract classification task that spam represents. It would be an interesting hack to try and train a network to categorize based solely on message body... i do however hope that their team has opted for practicality over just hack value and the network will also use such extremely relevant data as header information and comparing address versus address book(an e-mail from someone not in your address book is not necesarrily spam... but it is more likely to be).
lysergically yours
I wonder if a similar technique could be used in the browser. Automatically block images or popups based on previous ones you have blocked.
Now that would be very nifty!
I assume the filtering statistics live on the client side. What about IMAP? If I open up Mozilla on a new machine, are all my spam statistics lost (presumably rendering the junk mail filtering statistics I've accumulated useless on the new machine).
It would be neat if, with IMAP accounts, Mozilla just stored the statistics in a file on IMAP server instead of on the client.
It's 10 PM. Do you know if you're un-American?
Well, most of my spam is already sent to /dev/null by the SpamAssassin ninja.
But, for those that make it past the email shadow warrior, I guess Bayesian filters are a double whammy they'll never survive... Mwahahahaha!
Kudos to the Mozilla programmers!
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
What happens when microsoft attempts to enforce this patent
Mother is the best bet and don't let Satan draw you too fast.
Nonsense. It's impossible. First of all, they don't have access to much of the mail I want to let through-- although my mailing list traffic certainly qualifies, so let's assume that's the only mail I get and that they know I am receiving it.
There will still need to be header information and actual spam content in the spams themselves for those mails simply to not be repeats or dada-esque cutups of posts to the mailing list. That is, there must be content unique to the spam that no normal sender on the list will include.
Because of this, and the fact that so-called Bayesian spam filtering works by scoring all the words in an email and then evaluating the email based only on the extremes, there is little likelihood-- since the spam must still contain spam words to have any point at all-- of those words not being on the extreme word list. After all, if the same words are appearing in both spam and not-spam mails, they will be given a spam-probability that is not extreme. So all those words in common will be ignored and only the spam words will be looked at-- and the spam will still be filtered.
I do not have a signature
It is so annoying to get an e-mail without a subject. My spam filters actually bump you a little bit closer to being considered spam if there is no subject. I consider it to be a required header.
For one I sort my mail by thread, while Mozilla will use reference headers to thread messages, the fall back is the subject. Without a subject your message would be tossed in the thread with the other loosers who also forgot their subject.
The easy way to keep that dialog box from popping up when you send a mail is to...put a subject on the message.
If you want a spell checker go to the Netscape FTP server find the XPI file for the spell checker and install it.
E-mail is Outlook's domain. Not IE.
It's possible to net-install Mozilla without installing Mozilla Mail, but the default setting includes both. It's possible to net-install IE without installing Outlook Express, but the default setting includes both. Thus, it is a fair comparison.
100. Bugzilla - OK, lots of people use this, but Bugzilla != Mozilla. So it's not like Mozilla has built-in Bugzilla features... This is unrelated to the list.
I think the point of that entry was that unlike IE's bug database, which only Microsoft employees see, Mozilla's bug database is 99% open to the public (the other 1% primarily covers unfixed security vulnerabilities).
Will I retire or break 10K?
1. Says "someone is testing something and you get $NN.00"
2. Says anything like "angels watching over us" or "a mother's poem" or other such bullshit.
3. Says "This is really funny"
4. Says "We'll be over on Tuesday right during dinner when you are trying to put the moves on our daughter/your wife."
Umm, not the last one, really. Just got on a roll.
PDHoss
======================================
Writers get in shape by pumping irony.
$5 / month hosted VPS on linux = awesome!
procmail filters, SpamAssassin, AND the new Mozilla spam filters.. can we make a law that will make it legal to find the spammers and execute them in public?
Pleeeease??
You really want server-side filtering. I do that on my IMAP server with procmail, though not Bayesian. A quick google with "procmail bayesian filter" turns up quite a bit of interesting stuff to sift through. Of course if it's not your IMAP server, you're back to client-side solutions.
The living have better things to do than to continue hating the dead.
This approach is more commonly called "Naive Bayes" classification in the field of machine learning. It is naive because it considers each word to be a feature (dimension), but it also considers each word in an email to be conditionally independent of all other words in the document (which is not true, but really useful in practice).
The author of the web page on using this technique to classify spam (Paul Graham) has a better explanation of Naive Bayes on this web page.
I've written my own naive Bayes classifier to identify spam, with less positive results than he reports. However, naive Bayes can be a very effective technique, and I can believe his results.
The two things you have to beware of when using it are "smoothing" probabilities of words you've never seen (you don't want them to always be zero, as straight naive Bayes will give you), and you need LOTS of training data for naive Bayes to work well. That means that you need to already have a fair amount of spam to identify spam well.
You can see a paper I wrote on using naive Bayes to classify hard drive failures here, or look for more stuff on naive Bayes on Google. Also, don't reinvent the wheel: Andrew McCallum has written a very good toolkit for doing these sorts of things in Bow.
Since you must first download the content for client-side filtering to work you waste bandwidth. If you are truly bombarded by spam you still lose...your mail spool still gets filled up with stuff you don't want, your data transfers compete for bandwidth with the spam, storage hardware works harder storing data that will only be deleted. It raises everyone's costs, including yours.
We need to block undesired mail at the host, not filter it at the client. That way the spam never gets sent, the spammer gets the message that their attempt was futile, and bandwidth is conserved. Many ISPs already provide this service...we need to improve on it. And we need better tools for identifying and dealing with spammers. The current mail standards are woefully inadequate to this task.
what if in addition to this someone put together a company that the mozilla email client can report back to about what is labelled as span and the filters it created along with the headers of the message (or even the entire spam) and grab filters from others that recieves some spam that you have yet to recieve? it would be like a big distributed computing anti-spam project.. then if we were able to make the filters useable by sendmail to block at the server...
I'm almost thinking a distributed and automated anti-spam system like that could completely crush the spam problem within a 12 month period.
or I may be completely out of my mind.
Do not look at laser with remaining good eye.
No they likely aren't. They have this cool thing called Bugzilla (http://bugzilla.mozilla.org/) which is designed to track bugs and new feature requests. If you want to be heard, that's the place to submit, not here.
/. crowd is much scarier.
It's like, if you want to submit a complaint to Microsoft, you write them a letter to their company address instead of, say, writing your complaint as graffiti on a New York subway car. Wait a minute, actually, you might run into a MS employee doing butterfly graffiti, so that's a bad analogy... Plus, a subway isn't a good metaphor for Slashdot. The
Well, ok I am impressed that Mozilla is implementing spam filtering abilities in their MUA. I AM NOT impressed with Bayesian spam filters AT ALL. I've been using Mac OS X's Mail.app since I switched to OS X. It's not my primary MUA but I am letting it POP out a copy of all my mail and "learn" from it. It does a pretty good job of finding maybe 80% of the spam I get. However it has a BAD false-positive rate. I mean hell its been flagging CERT advisories as spam. That kind of crap is really annoying. It's flagged co-workers' mail as spam numerous times (and even though I happen to agree... :) ). The biggest problem I have with Bayesian as a mail admin is that I am constantly dealing with spam. Users forward it to me. I receive a number of spam bounces. I work in spam all that damned time. That's the problem. I need a MUA with Bayesian filters that are smart enough for me to tell them to ignore all mail from certain domains or that went to certain accounts. All of the Bayesian filters built into MUAs I've worked with so far can't do things like that. It's really annoying given the position that I'm in.
This is something that Emacs has in the GNUS client, you score emails up and down and it starts adding filtering rules. Using LISP you could extend this to do some pretty funky moderating.
Every problem is reducable to a previously solved problem or by definition is unsolveable - Church Turing Thesis.
An Eye for an Eye will make the whole world blind - Gandhi
There needs to be a tiered structure with filters. The main one would be at the ISP level. It would only filter out obvious spam(like spam going to 2000 users at that ISP). The second tier would be at the client side and would have a certain level of intelligence in identifying spam. One feature that I'd like (it might already be available) is if it could automatically send an email back to the sender saying the email address doesn't exist. This should be done at the server level and/or client level. This could possibly help in removing your email from such lists. As far as what to do with the spam at the client level, I think that it should be sent to your main inbox but just marked as spam (maybe greyed out or something). Like new mail is always bold and once you read it it goes to a regular font. Well, spam could be just greyed out. That way you would ever miss something that the spam filter had a false hit on.
The "blah blah blah" is roughly, "You have not specified a subject. Would you like to enter one now?" Perhaps you're right, it should be changed. Instead, it should say, "You're about to send an email message without a subject. That's an amazingly rude thing to do and likely to irritate the recipient as it makes it harder for them to pioritize their incoming mail and harder to distinguish from spam. Because this is such a terrible idea, you should enter a subject line below. If you fail to enter a subject, the default entry of 'I'm a idiot, please delete this message without reading it' will be used."
Search 2010 Gen Con events
Use Gotmail, which downloads your hotmail messages to an mbox-style file. Or use hotwayd which appears like a POP3 server running on localhost, and uses WebDAV to get messages from hotmail (like Outlook Express). Either way, no web-bugs will get activated.
The added advantage is that you can pipe these through procmail/spamassassin just like ordinary incoming mail, and not have to manually delete all that spam.
Preferences -> Privacy & Security -> Images, you can turn off images in mozilla, or only in mail/news.
I personally dont really care about all the junk emails I get. I dont get that many, and I can pretty much tell without looking at them. They go straight to /dev/null.
/var/ partition is only 200MB, 50mb free. And the maillog is growing at about 10mb a day. So now Im babysitting this server every day until the spam attempts stop. I dont think theres any way around it unless I get sendmail to check for open proxies. But I dont know how to do that, and I dont think they trust me enough to make such changes to sendmail.
Spam is such a horrible thing though. I work at a webhosting company. Im the one that has to track down the site with the old formmail.pl, removing 'aol.com' and 'yahoo.com' from the hosts to relay for, trying to find out who the hell added them so I can murder them. Im the one clearing out the mail queue with 100,000 mails. Im the one clearing the mail queues of people who thought it was a good idea to check the 'open relay' option in plesk. Im the one that has to deal with people bitching about how their mail isnt working or didnt get through.
Just the other day, I had a raq2 where someone had apparantly put yahoo.com and excite.com in the hosts to relay for. Yay! Thats what attracted the spammers. Now I get a request every second to send mail to 50 people at once. Now that I've removed them, none of them are getting through. But its a raq2, 133 mhz. It has to go through all 50 addresses and say 'relaying denied' and log it. It cant keep up! syslogd is taking up all the cpu and logging things from hours ago because its behind. Quickly, sendmail quits listening on port 25 (but the spam attempts keep coming somehow).
So I get the idea to block their ips, they seem to be using the same ips. But oh guess what, they're using open proxies and have about 400 ips. Well, I did this for about 5 hours, writing scripts to grab the repeated ips out of the maillog, adding them all to my sendmail access lists. Now every time they try to send mail, it blocks them instead of saying relaying denied 50 times for each request. But a minute later, I get a few new ips and it starts all over again. I have an access list about 6 pages long. Its doing ok, blocking about 90% of them, but every once in a while, they get a new ip and sendmail is brought to a stop.
Oh yeah, and my
So oh well, mail is getting lost every day on this server and its been renderred horribly slow for its users.. just because some moron noticed it would send some emails for him and started up his scripts.
Spam causes so many problems on the server level. Its what is making mail an unreliable service. I could care less about spam filters on my mail client. These are the things that make spam evil!
--- Does the name Pavlov ring a bell?
Two brothers immigrated to a mostly Catholic country, hungry and looking for work. Pavlov, whose forehead was quite thick, found work at a monastery bell tower. The monks taught him to tell time, then sound the bell when appropriate. Not too bright, Pavlov missed the part about how to sound the bell. So he notes the time on his handy wristwatch, climbs the belltower, inches up to the edge of the platform, and dives face first into the massive centuries-old bell. KKKLLLAAANNNGGG!!! Poor Pavlov falls to his death hundreds of feet below.
Apparently, monks don't communicate very well. No one in the crowd gathered around Pavlov's remains could identify him. Finally one monk admits, "I never caught his name, but his face sure rings a bell."
Mysteriously, a man steps forward from the crowd and insists on taking Pavlov's place as caretaker of the belltower. One of the monks removes the wristwatch from Pavlov's arm, gives it to the mystery man, and precedes to indoctrinate him in his duties. On the hour, just like Pavlov, our mystery man ascends the tower, perches on the edge -- but this time wielding a massive sledgehammer. He leaps towards the bell and smashes it with Thor-like fury. KKKLLLAAANNNGGG!!! The poor fool falls to his death in a manner very similar to Pavlov's.
Much like deja vu, a muted crowd gathers around the mystery man's remains. After an extended silence, one monk asks, "Does anyone know this man's name?" Answers another, "No, but he's a dead ringer for his brother!"
However, I've heard that popup blockers and tabbed browsing are making their way into IE (and MS employees can already use these features)
IE is the most widely used brower and pop-up advertising has become part of the Internet Experience. If MS decides to incorporate popup blocking in IE, then the pop-up advertising business is RUINED! They'll just be another group victimized by a huge corporation. These people have families to support and will be forced to send their children to public schools. Won't someone PLEASE think of the children?
And all this news about fixing vulnerabilities within Windows is going to affect the virus community as well (both authors and anti-virus). Worrying about vulnerability exploits has also become part of the computer experience.
Won't someone PLEASE think of the virus writers?
This is not my sig.
"...good morning, Dave. You have recieved spam again. I have been analyzing the spammer's patterns, and I believe I have figured out the most efficent way to protect humans from the harm of spam while adhering as closely to the First Law as possible. To protect them from spam, humans must be pushed. They must go down the stairs. Please go stand by the stairs, so I can protect you."
Software that only does mail filtering encourages spammers. The technically knowledgeable people don't get spam, so they stop worrying about it.
All mail filters should also use a service like SpamCop, so that the spammers lose their internet service accounts as the spam is filtered.
I send Spamcop all my spam. Spamcop analyzes it automatically and sends a message to the Internet Service Provider. I use the free Reporting only service.
I may drop Evolution in favor of Mozilla Mail.
i on/2002-August/020845.html
I tried to find out if the Evolution dev team was going to do this. The only thread I could find on the topic is here:
http://lists.helixcode.com/archives/public/evolut
Doesn't look like it's part of their vision.
Software Wars
It seems too many people distrust spam filters because of the chance of accidentally blocking an important legitimate message as if it were spam.
Many spam filters are strictly binary: a message is either spam, or not spam. This is not ideal, because "gray area" messages - between these two extremes - will likely not be sorted correctly.
I propose adding a new sort option to email clients.
Sort by Spam Probability
This would be an additional field that can be displayed in a message list, similiar to "To", "From", "Subject", and the like. Like the article, probabilities would range from 99% (almost certain spam) to 1% (most likely an innocent message). Notice that 100% accuracy either way is not claimed.
This way, the user can see up front the messages that are most likely not spam. The spam messages will be relegated to the bottom of the list, possibly colored to indicate their likelihood of being spam. If there is a message in the "gray area", it will most likely appear in the list between the legitimate messages and the spam, so the user will have a chance to see the message and make a decision, without the message being lost in the shuffle.
This would be a great feature. I hope this gets into Mozilla's mail client.
(BTW, another feature that would be great to see in mail clients would be datestamping of the actual time the message was downloaded. Many spammers, and innocent people with misconfigured clocks, send emails with wild dates that are not to be trusted. You can see this in yearly archives of GNU "mailman" mailing lists! Datestamping emails as they are downloaded will also keep mailboxes in order when sorted by date, as newly arrived messages will always be at the bottom, instead of being scattered throughout the inbox. But sorting by spam probability will probably become more popular than sorting by date....)
Dr. Demento On The 'Net!
As a popfile user, I'm quite impressed with the catch rate possible with bayes theorem spam filters, however I suspect this will decrease in effectiveness over the long term.
Spammers are likely to respond to filters like this by encoding text in ways the filters can't read but humans can (eg having a .gif file of the text, loaded by a HTML statement in the message).
Statistical filters would need to have some kind of built in OCR routine before it could be effective against that trick, and some respectible mailing lists are using images as well, so you can't just filter all mails with images attatched.
In the long term, therefore, I suspect that filters that use a network database of spam will be more successful.
Is it? I thought Outlook Express was a virus-support API. I suspect the fact you can send email with it is a bug. :)
perl -e 'print "Just another Perl newbie\n";'