Gmail Spam Filter Testing

← Back to Stories (view on slashdot.org)

Posted by Hemos on Monday June 14, 2004 @02:35AM from the send-the-mail-in dept.

An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"

22 of 285 comments (clear)

Min score:

Reason:

Sort:

Not that impressive by chrisgeleven · 2004-06-14 02:42 · Score: 4, Informative

Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.

I am starting to second guess whether I should transfer everything to my Gmail account.
1. Re:Not that impressive by XO · 2004-06-14 02:47 · Score: 4, Informative
  
  Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!
  
  (example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)
  
  --
  "Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
2. Re:Not that impressive by Kredal · 2004-06-14 02:56 · Score: 2, Informative
  
  tikora@gmail.com, I think.
  
  Mine is kredal@gmail.com, if you're interested. (:
  
  --
  Whoever stated that signature sizes should be limited to one hundred and twenty characters can just go ahead and kiss my
3. Re:Not that impressive by ravydavygravy · 2004-06-14 04:01 · Score: 3, Informative
  
  Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!
  
  Rubbish - I've used thunderbird for many months now, with an account that gets quite a bit of spam. I have yet to see thunderbird make a wrong guess at whats spam and whats not. If anything, thunderbird is more likely to go the other way - allowing spam through - than deleting real email.
Re:One of the best things Google/GMail could do by kryptkpr · 2004-06-14 02:45 · Score: 4, Informative

Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?

--
DJ kRYPT's Free MP3s!
My own gmail testing by Twid · 2004-06-14 02:48 · Score: 5, Informative

I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.

It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."

Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
- Payroll Cards a Poor Substitute for Checking Account
- Tips for Tackling Check Fraud
- Sophos hoax description: Ethiopian airline letter
- FAP non-US Investment FAQs

In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive! :)

--
- "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho
Spam is always personalized by Sulka · 2004-06-14 02:49 · Score: 4, Informative

Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.

This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.

--
"Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
1. Re:Spam is always personalized by Thuktun · 2004-06-14 04:04 · Score: 4, Informative
  
  gzip it and compare the files. a short tracking code will make a negligible difference.
  
  Not necessarily.
  
  Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.
Re:The Filter is great! by aismail3 · 2004-06-14 03:02 · Score: 5, Informative

When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.
Re:One of the best things Google/GMail could do by wo1verin3 · 2004-06-14 03:07 · Score: 2, Informative

It's a good thing you're not using Outlook. :)

I get those in Eudora and they don't seem to do much, my friends with Outlook however... not so lucky. :)
Re:Hmmm.. weird stats... by Satai · 2004-06-14 03:11 · Score: 4, Informative

no, you inversed it. You want MB/message, not message/MB.

3778 messages / 213 MB = 17.37 messages / MB
213 MB / 3778 messages = 0.0564 MB / message

So that's pretty reasonable.
Re:One of the best things Google/GMail could do by xandroid · 2004-06-14 03:21 · Score: 2, Informative

Try looking at the source -- when this happens to me, I see that the random words are plaintext, and the intended advertisement is in HTML (which I've blocked).

--
$ echo "ceci n'est pas une pipe" | sed -Ee 's/(eci n|pas )//g'
Re:One of the best things Google/GMail could do by Halo1 · 2004-06-14 03:21 · Score: 5, Informative

Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.

--
Donate free food here
Re:whining? by cmacb · 2004-06-14 03:27 · Score: 3, Informative

Actually the TOS for Gmail says that doing things to attract spam is a violation, so they could just close the account on that basis. Also, if you don't sign on for a certain period of time (a few months I think) the account gets deleted. I had a Yahoo ID for years before I ever knew there was an e-mail address associated with it. I never read the mail associated with my AIM id and I probably still have free hotmail and a few other things like that floating around. Failure of these companies to delete idle accounts is what causes all the good names to be taken. I think Google is more on-top of this than many of the others.
Re:One of the best things Google/GMail could do by ryen · 2004-06-14 03:37 · Score: 3, Informative

those emails could possibly also contain embedded image tags (known as web beacons). when you open an email and attempt to 'download' the image, some server on the net knows it was you who retreieved the image and has just verified that your email address is active and spammable.
didn't somebody already sort of attempt this? by cks3 · 2004-06-14 03:39 · Score: 2, Informative

Oh, wait, it was me! http://slashdot.org/comments.pl?sid=105335&cid=896 5252
Eh, I only got 180MB worth of email and spam out of the deal though, before I decided to delete the account. The Gmail Spam filter was rather horrible at the time; catching only the most tried and true SPAM, letting tons of other SPAM through, and then randomly flagging legitimate messages from people whom it had not flagged before. I think it has improved some since then.

--
http://www.sampletheweb.com
Re:whining? by Beryllium+Sphere(tm) · 2004-06-14 03:43 · Score: 5, Informative

>The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte

Why?

Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.

I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.

Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.
Re:One of the best things Google/GMail could do by FooAtWFU · 2004-06-14 04:00 · Score: 2, Informative

>>It's cheaper to just send mail to everyone
>no it's not.
It doesn't matter how cheap it is when 80% of spam supposedly comes from infected zombie computers. (I'm too lazy to actually LINK to the recent story on this.)

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:More focus on false positives. by Anonymous Coward · 2004-06-14 05:49 · Score: 4, Informative

false positive : spam getting past the filter ratio...
A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.
I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.
Re:Not a fair test by SWroclawski · 2004-06-14 06:01 · Score: 3, Informative

Any evidence that they reject mail for various reasons? I'm sure there is. You can go ahead and see which RFCs they're in compliance with and which they aren't.

If you don't have a PTR record associated with your host, try to send mail to them, or malform your EHLO or something else.

You don't need to be "really sure" mail is spam- I'm talking about doing things like standards complaince checking, which will result in mail being rejected at delivery time.

Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?

There's no reason to get insulting. RFC 2821 has a number of requirements for delivery of mail that many services ignore.
Re:More focus on false positives. by einTier · 2004-06-14 08:09 · Score: 2, Informative

False positive = condition you are testing for comes up positive, when it should be negative.
False negative = condition you are testing for comes up negative, when it should be positive.
Put in the context of a spam filter, it depends on whether you are testing for spam or for legitimate emails. If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.

--
-------------------------------------------------- $665.95 -- retail price of the beast.
Re:Cache? by Calamity+Jane · 2004-06-14 15:34 · Score: 2, Informative

The cache link is pointing to the cache of his website, not of Google's.