Gmail Spam Filter Testing
An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"
Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.
I am starting to second guess whether I should transfer everything to my Gmail account.
Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?
DJ kRYPT's Free MP3s!
I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.
:)
It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."
Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
- Payroll Cards a Poor Substitute for Checking Account
- Tips for Tackling Check Fraud
- Sophos hoax description: Ethiopian airline letter
- FAP non-US Investment FAQs
In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive!
- "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho
Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.
This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.
"Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.
It's a good thing you're not using Outlook. :)
:)
I get those in Eudora and they don't seem to do much, my friends with Outlook however... not so lucky.
no, you inversed it. You want MB/message, not message/MB.
3778 messages / 213 MB = 17.37 messages / MB
213 MB / 3778 messages = 0.0564 MB / message
So that's pretty reasonable.
Try looking at the source -- when this happens to me, I see that the random words are plaintext, and the intended advertisement is in HTML (which I've blocked).
$ echo "ceci n'est pas une pipe" | sed -Ee 's/(eci n|pas )//g'
Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.
Donate free food here
Actually the TOS for Gmail says that doing things to attract spam is a violation, so they could just close the account on that basis. Also, if you don't sign on for a certain period of time (a few months I think) the account gets deleted. I had a Yahoo ID for years before I ever knew there was an e-mail address associated with it. I never read the mail associated with my AIM id and I probably still have free hotmail and a few other things like that floating around. Failure of these companies to delete idle accounts is what causes all the good names to be taken. I think Google is more on-top of this than many of the others.
those emails could possibly also contain embedded image tags (known as web beacons). when you open an email and attempt to 'download' the image, some server on the net knows it was you who retreieved the image and has just verified that your email address is active and spammable.
Eh, I only got 180MB worth of email and spam out of the deal though, before I decided to delete the account. The Gmail Spam filter was rather horrible at the time; catching only the most tried and true SPAM, letting tons of other SPAM through, and then randomly flagging legitimate messages from people whom it had not flagged before. I think it has improved some since then.
http://www.sampletheweb.com
>The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte
Why?
Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.
I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.
Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.
>>It's cheaper to just send mail to everyone
>no it's not.
It doesn't matter how cheap it is when 80% of spam supposedly comes from infected zombie computers. (I'm too lazy to actually LINK to the recent story on this.)
The World Wide Web is dying. Soon, we shall have only the Internet.
A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.
I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.
Any evidence that they reject mail for various reasons? I'm sure there is. You can go ahead and see which RFCs they're in compliance with and which they aren't.
If you don't have a PTR record associated with your host, try to send mail to them, or malform your EHLO or something else.
You don't need to be "really sure" mail is spam- I'm talking about doing things like standards complaince checking, which will result in mail being rejected at delivery time.
Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?
There's no reason to get insulting. RFC 2821 has a number of requirements for delivery of mail that many services ignore.
False negative = condition you are testing for comes up negative, when it should be positive.
Put in the context of a spam filter, it depends on whether you are testing for spam or for legitimate emails. If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.
-------------------------------------------------
The cache link is pointing to the cache of his website, not of Google's.