Gmail Spam Filter Testing
An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"
What's Google going to do to protect its users from mail bombs?
Now you're complaining that your free, 1GB-limit, access-from-anywhere email service could be mailbombed? Live with it. If Google "decides" anything more about our emails, we put on our tinfoil hats and scream. If we broadcast a bogus email address, obtained from gmail for clearly sinister purposes, and it gets mailbombed, we whine that Google doesn't "protect" us. Whats the story, or are we all just schizophrenic?
Don't want that "vulnerability"? Don't use Gmail!
I want to delete my account but Slashdot doesn't allow it.
...how many e-mails has he recieved in total? I've kept spam for six months before and it totaled less than 100MB...and I get a cubic buttload of crap daily.
Don't be a looter...and yes, I know that it's spelled with an "A" instead of an "E".
isn't gmail still in 'beta' stages? if so, isn't a review of spam filtering techniques a little premature?
He's not counting all the mail that Google is rejecting and not even being allowed in for further classification.
Don't forget that this is google's first foray into mail software, and it is still in beta. I have so far gotten very little spam in my gmail inbox.
Mozilla Thunderbird or Spamassassin will filter at least as well or even better. Is this just a test to see how quickly we can fill up gmail's disk?
-- Bryan
Anti-Spammers have thought of this, too. Things like the Distributed Checksum Clearinghouses have "fuzzy" matching.
Google also has enough computer power to generate some sort of Bayesian filter to catch the most common spam system wide, and even a personalized filter on each account to catch the rest.
>> You think they bother?
heh heh...abdolutely.
100 known good addresses are worth 10,000 "who the fuck knows" addressess.
>>It's cheaper to just send mail to everyone
no it's not.
let's pretend you are a spammer, and you want to send out spam.
If you target 1 billion questionable addresses, each time a client has a new campaign, then that's 1 billion pieces you have to deliver. every time.
what if you have 1000 clients? that's 1000 billion deliveries.
do you see where this is going? if you don't KNOW WHAT A VALID EMAIL ADDRESS IS, YOU HAVE TO GUESS.
but what if the first time you send out just a "test" to those billion addresses, and then subtract the one's that bounce.
You are left with 50,000 known good addresses.
that's gold. You now have 1/20th of the load,and you are now serving your clients quicker, a helluva lot less load. you are only using an open relay for 1/20th of the time.
overall a smaller footprint by 1/20th.
you tell me. does it make sense to blindly blast out email?
I've found whitelists, combined with treating everything as junk, to be far more useful than blacklists.
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.
In one recent analysis, 10 random words were inserted by the spammer. He got lucky and 1 of those words actually had a very low score for my Bayesian corpus. Unfortunately (for him), the other 9 words had scores of 99.99%! His use of random words literally nuked any possibility of him getting through my filter.
Anyway, random words will not help spammers get through Bayesian filters. But it seems that many people (both spammers and non-spammers) think it will. But, hey, that's good for me: as long as "random words" is seen by spammers as a viable solution to Bayesian filters, my Bayesian filter will continue to work and will not have to deal with any innovative way to get around the filter (if any exists).
I pay the $20 for extra Yahoo email, and I have to say that their spam filtering is much better than gmail's right now. I have about 10 spams a day to clear out of gmail, where with Yahoo it's more like 1, often 0.
People that don't pay for Yahoo don't seem to get such good spam filtering, though.
Google can definitely do better.
So, in less than a month, he has recieved in excess of 300 Megabytes of useless junk ?
I think somebody needs to recalculate axactly how much bandwidth go to waste because of this SPAM plague. The cost in global comms traffic must be staggering!
The consequenses of blocking a non-spam email are so much worse (parent not hearing from kid. the customer that would have saved your startup.) than a spam getting in, I wish the spam filter reviews would focus on those.
It may not increase false negatives, but it has decent chances of increasing false positives which is a much greater problem. My best guess is that spammers are hoping that once enough random words are classified as spam words, real emails with those words will start being classified as spam. If they can force enough false positives, people will start turning off bayesian filtering.
Anyone else having trouble with these spams?
;-)
Surely it's the people who aren't having this problem that you want to hear from - they're the ones with good spam filtering