Mozilla Adding Spam Filters
ksheka writes "Mozilla mail now has Spam Filters, using Bayesian filtering method, no less. This is a very good thing, because it learns from the spam you receive, and constantly modifies itself, based on new spammer techniques!"
But the spammers will develop Bayesian filters of their own to find the best content that will sneak by your filters.
http://research.microsoft.com/~horvitz/junkfilter. htm
-- Outside of a dog, a book is man's best friend. Inside a dog, it's too dark to read
Bayesian technique is very good for the sort of abstract classification task that spam represents. It would be an interesting hack to try and train a network to categorize based solely on message body... i do however hope that their team has opted for practicality over just hack value and the network will also use such extremely relevant data as header information and comparing address versus address book(an e-mail from someone not in your address book is not necesarrily spam... but it is more likely to be).
lysergically yours
I wonder if a similar technique could be used in the browser. Automatically block images or popups based on previous ones you have blocked.
Now that would be very nifty!
I just switched to Mozilla. Happy to be free of Microsoft for email. It's skinnable, and there are some cool skins--like one which sort of emulates Evolution. I noticed an annoying 'feature' though, which is still there from Netscrap days--if you send an email without a subject, a dialog pops up and goes blah blah blah. I asked the Mozilla newsgroup if there was a way around this, but all I got was the sort of adolescent yammerings that keep me out of unmoderated newsgroups. Nice to see it has a spamfilter now. The only major improvement remaining is to add a spell-check (the Netscrap one was licensed from a 3rd party, and can't be freely distributed).
This is really great technology.
I had the benefit of working with this technology for a classification problem here at work. I was amazed at how good it worked. We were using it to replace a purely human process.
However, there is one huge problem. Incorrect classification. Blind tests against a known dataset showed 80%+ correctness. The problem is, you don't know which 20% is wrong. Thus, you still need 100% inspection to validate the results.
When applied to mail filters, I wonder how the technology avoids dumping your good mail? Like when your friend sends you a URL to good pr0n site.
"No matter where you go, there you are." -- Buckaroo Banzai
I assume the filtering statistics live on the client side. What about IMAP? If I open up Mozilla on a new machine, are all my spam statistics lost (presumably rendering the junk mail filtering statistics I've accumulated useless on the new machine).
It would be neat if, with IMAP accounts, Mozilla just stored the statistics in a file on IMAP server instead of on the client.
It's 10 PM. Do you know if you're un-American?
Well, most of my spam is already sent to /dev/null by the SpamAssassin ninja.
But, for those that make it past the email shadow warrior, I guess Bayesian filters are a double whammy they'll never survive... Mwahahahaha!
Kudos to the Mozilla programmers!
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
What happens when microsoft attempts to enforce this patent
In Outlook Express, I can setup 100 different email accounts and not have a giant list of mail folders.
In Mozilla (last I checked) for every account you setup it creates a new set of folders.
Since I've got a catchall account, I'd like to tie multiple email addresses to one set.
Anybody out there on the Mozilla team listening?
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
Spammers don't use relays these days, they use spam tools that directly SMTP the receiving mail server. So the receiver still needs to filter.
sulli
RTFJ.
This new law will force you to leave your radio and TV on even while you aren't paying attention to it. Furthermore junk mail will no longer be able to be discarded without an affidavit that states the recipient has read and understands the offer. Street mimes and homeless people wearing commercial signs must be paid attention to by anyone within a 10 foot radius. You will be required to sample every free offering at the Food Court in the mall and surveyors cannot be ignored. All fliers distributed on your vehicle must be followed up with a phone call or your vehicle will be impounded. You will be required to contact every business that advertises at sporting events if you choose to attend.
Failure to abide by these rules will result in the forfeiture of all assets and the garnishing of all wages earned, which will be deposited into the Federal Marketing Enforcement Fund. Monies from this fund are distributed to companies whose marketing campaigns are not successful.
Remember, You are unique...just like everyone else.
what if in addition to this someone put together a company that the mozilla email client can report back to about what is labelled as span and the filters it created along with the headers of the message (or even the entire spam) and grab filters from others that recieves some spam that you have yet to recieve? it would be like a big distributed computing anti-spam project.. then if we were able to make the filters useable by sendmail to block at the server...
I'm almost thinking a distributed and automated anti-spam system like that could completely crush the spam problem within a 12 month period.
or I may be completely out of my mind.
Do not look at laser with remaining good eye.
Well, ok I am impressed that Mozilla is implementing spam filtering abilities in their MUA. I AM NOT impressed with Bayesian spam filters AT ALL. I've been using Mac OS X's Mail.app since I switched to OS X. It's not my primary MUA but I am letting it POP out a copy of all my mail and "learn" from it. It does a pretty good job of finding maybe 80% of the spam I get. However it has a BAD false-positive rate. I mean hell its been flagging CERT advisories as spam. That kind of crap is really annoying. It's flagged co-workers' mail as spam numerous times (and even though I happen to agree... :) ). The biggest problem I have with Bayesian as a mail admin is that I am constantly dealing with spam. Users forward it to me. I receive a number of spam bounces. I work in spam all that damned time. That's the problem. I need a MUA with Bayesian filters that are smart enough for me to tell them to ignore all mail from certain domains or that went to certain accounts. All of the Bayesian filters built into MUAs I've worked with so far can't do things like that. It's really annoying given the position that I'm in.
This is something that Emacs has in the GNUS client, you score emails up and down and it starts adding filtering rules. Using LISP you could extend this to do some pretty funky moderating.
Every problem is reducable to a previously solved problem or by definition is unsolveable - Church Turing Thesis.
An Eye for an Eye will make the whole world blind - Gandhi
I personally dont really care about all the junk emails I get. I dont get that many, and I can pretty much tell without looking at them. They go straight to /dev/null.
/var/ partition is only 200MB, 50mb free. And the maillog is growing at about 10mb a day. So now Im babysitting this server every day until the spam attempts stop. I dont think theres any way around it unless I get sendmail to check for open proxies. But I dont know how to do that, and I dont think they trust me enough to make such changes to sendmail.
Spam is such a horrible thing though. I work at a webhosting company. Im the one that has to track down the site with the old formmail.pl, removing 'aol.com' and 'yahoo.com' from the hosts to relay for, trying to find out who the hell added them so I can murder them. Im the one clearing out the mail queue with 100,000 mails. Im the one clearing the mail queues of people who thought it was a good idea to check the 'open relay' option in plesk. Im the one that has to deal with people bitching about how their mail isnt working or didnt get through.
Just the other day, I had a raq2 where someone had apparantly put yahoo.com and excite.com in the hosts to relay for. Yay! Thats what attracted the spammers. Now I get a request every second to send mail to 50 people at once. Now that I've removed them, none of them are getting through. But its a raq2, 133 mhz. It has to go through all 50 addresses and say 'relaying denied' and log it. It cant keep up! syslogd is taking up all the cpu and logging things from hours ago because its behind. Quickly, sendmail quits listening on port 25 (but the spam attempts keep coming somehow).
So I get the idea to block their ips, they seem to be using the same ips. But oh guess what, they're using open proxies and have about 400 ips. Well, I did this for about 5 hours, writing scripts to grab the repeated ips out of the maillog, adding them all to my sendmail access lists. Now every time they try to send mail, it blocks them instead of saying relaying denied 50 times for each request. But a minute later, I get a few new ips and it starts all over again. I have an access list about 6 pages long. Its doing ok, blocking about 90% of them, but every once in a while, they get a new ip and sendmail is brought to a stop.
Oh yeah, and my
So oh well, mail is getting lost every day on this server and its been renderred horribly slow for its users.. just because some moron noticed it would send some emails for him and started up his scripts.
Spam causes so many problems on the server level. Its what is making mail an unreliable service. I could care less about spam filters on my mail client. These are the things that make spam evil!
I personally don't think that systems like this can work that well. Everyone seems to get different type of spam, and you're best bet is to create your own filters. About 80% of my spam messages have wierd foreign characters in it (like Á), so I've got filters in Eudora to delete anything with one of these characters in the Subject or Body. Then obviously anything with "porn", "sex" etc, although spammers dont seem that stupid anymore. This way I only get 5-10 spam messages in my inbox per day, maximum. And this takes me about 20-30 seconds to deal with, I don't see what all the fuss is about.
Everything sucks except musicandstuff
I'm running a sendmail server, and I access via webmail accounts, pine, and Mozilla. I would like to add this new type of spam filtering to sendmail directly. Does anyone know if this is something that can be added to sendmail, rather than a specific mail client like Mozilla?
.. should start at the server preventing the offending mail from ever coming into the network in the first place.
Not that localized spam filters are a bad thing (they aren't!) but refusing connections from known spammer IPs and the proper use of blacklists would cut down on a lot of the email traffic. Once the spam is in your inbox, its just an annoyance to you. The cost to the net has already been incurred.
Trolling is a art,
Popup killing and tabbed browsing are the two killer features that have allowed me to spread mozilla widely through my office. People see me surfing and ask what the tabs are or ask where the popup have gone. I tell them about mozilla and show them how easy it is to stop popups. Yes I know about crazybrowser which does both of these, but it does popup killing badly (it's an all or nothing thing, not just unsolicited popups).
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Software that only does mail filtering encourages spammers. The technically knowledgeable people don't get spam, so they stop worrying about it.
All mail filters should also use a service like SpamCop, so that the spammers lose their internet service accounts as the spam is filtered.
I send Spamcop all my spam. Spamcop analyzes it automatically and sends a message to the Internet Service Provider. I use the free Reporting only service.
Well, I certainly have a large volume of SPAM that I plan to use for training purposes. I'm not a big user of personal email, but somehow about 70% of all my incoming personal mail is SPAM. My Dad is much worse off.
I'm glad to see that the software industry is taking the SPAM problem seriously. And it's great to hear that more and more states, like Massachusetts, are enacting laws to curb the abuse of email systems.
I've been dependent on some static rules to curb SPAM (about 90% effective), but I think now it's time to implement more serious anti-spam measures.
Based on the last /. article on Bayesian filtering, I installed SpamProbe. I gave it a folder of about 70 spam emails, and a few hundred good emails I had in various folders. In the past few weeks, it's had one false negative, and a few false positives which were 'semi-spam' mailing list emails from Dell, RedHat, and Amazon. When I moved those emails into the 'recheck as good' folders, it learned its lesson.
It may be naive, but I was very surprised at how well it worked. It's better than SpamAssassin IMO, especially at foreign-language spam.
Essentially, it throws the parsing problem right back in the spammer's faces: They must answer a fuzzy logic question in order to get into your inbox once and for all. It is similar to challenge/response routines in network connection code to prevent spoofing. The most interesting part from the intro:
Bayesian filters to me, seem to work if you are a dull person without many changes in your life. For ex, if you constantly get spams with the word Madam in it and you later on get a sex change, you will need to recalibrate your filters. (Probably not the most pressing thing on your mind, so you'd lose a few authentic mails.)
Just some thoughts.
Since naive Bayes gives probabilities, this is easy to get out of what Mozilla (and Paul Graham, and others) are trying to do. However, it is well-known that the probabilities that naive Bayes classifiers give are typically exaggerated (too close to either 0 or 1). This is partly because of the naive assumption (conditional independence of features).
However, while the probabilities themselves may be exaggerated, they are also usually found to be ranked correctly, which would give you what you want here -- a ranked list of possible spams.
> Is it? I thought Outlook Express was a virus-support API.
No, no, Outlook Express is for Internet Explorer what Composer is
for Mozilla or Netscape -- if you don't know HTML, you can use it
to create web pages. They won't be particularly well-designed, and
they won't validate, but the major legacy browsers everyone seems to
still use will display them, so you can put them up on your website.
The reasons it sends email is not a bug, but a feature (albeit one
that tends to be abused). It's not for sending general email, but
so that you can easily upload your web pages you create to certain
free website engines that can receive them by email (on the theory
that most people don't know how to use ftp, or else because ftp is
considered insecure. The usenet engine was included so that
multiple people can use it in a peer-to-peer fasion to collaborate
on the creation of a web page. For example, if your mom and grandma
want to create a web page, but they aren't sure how to get the
pictures of the family dog scanned in, you can let them write the
text about the dog, and you can put in the picture. You can pass
it back and forth on your private family news server until it's
ready for the family website.
The reason people started using Outlook Express for regular email
is because the email software that shipped with Windows 95 (called
Microsoft Internet Mail) was _so_ bad that it was more convenient
to use _anything_ else, including telnet, and so when Outlook
Express came out people jumped on that, and the rest is history;
Outlook Express now handles (on one end or the other) nearly 40%
of the internet's email, more than anything else except sendmail.
The virus API, as you suspected, was not a bug but a feature, but
the reasons for its inclusion are complicated and involve both
particle physics and JFK.
And you'll have a real winner. Probably several other techniques could be combined as well, but back when I wrote a program just to check all of the from IPs in an email to see if any of them were open relays, I got around 80% filtering with very few false positives.
Furthermore, you can assign a pretty good probability number based on what sort of open relay it is (i.e. verified, unverified, spam server, merely unsecured server, etc). If it comes from a spam server, the chances are 100% that it's spam. If it comes from a dialup server, the chances are about 99.9999%. If it comes from an automatically verified open relay, that's merely unsecured, the chances are more like 60%.
The open relay thing really intrigued me because it has NOTHING to do with the message body, and it was my belief at the time that there was no good way to filter based on message content.
However, combine this with bayes, and I'll bet you'll have something grand.
Also, a great feature would be a multi-tiered identifier, so that you could have the 99.999% sure spam filtered into one folder, and the 75% sure spam filtered into another. You'd have to sift through the 75%, but probably could just leave the 99% alone.
WWJD? JWRTFA!
I'd argue that the time wasted on filtering spam is more valuable than the bandwidth wasted delivering it. This is why I am glad that Apple was able to bring good client-side spam filtering to the people with Mail and that Mozilla will soon provide this feature as well.
I have a website. It's about Macs.