Slashdot Mirror


Gmail Spam Filter Testing

An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"

52 of 285 comments (clear)

  1. first spam? by miketang16 · · Score: 5, Funny

    psh.. i've done this to my friends before.. they didn't need to make a website to ask for it...

    --
    -------
    "In times of universal deceit, telling the truth becomes a revolutionary act."
    -- George Orwell
    1. Re:first spam? by Anonymous Coward · · Score: 4, Funny

      Oh, I didn't know that was you who passed my address along so I could b uy che.ap v1agra! Thanks! Those pi1ls made my p.e..ni.s gr0w 3-5 lnches! It was really very thoughtful of you, Mike.

  2. The Filter is great! by umrgregg · · Score: 5, Funny
    Apparently, Google's spam filter even filters messages that aren't there. From the website:
    3778 messages were received, totaling 213 MB.
    3917 were spam, and Gmail correctly identified 41.9% of these messages.
    Fantastic
    --
    NMG
    1. Re:The Filter is great! by Anonymous Coward · · Score: 5, Funny

      No, thats just a classic threaded code bug:

      They just forgot the mutex surrounding the two snprintfs... so this user probably got 139 messages in the time it takes to execute snprintf, all spam.

      Which is.... about right.

    2. Re:The Filter is great! by aismail3 · · Score: 5, Informative

      When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.

  3. One of the best things Google/GMail could do by Anonymous Coward · · Score: 5, Interesting

    Is use the GMail data to operate a checksum blacklist. Obviously, if thousands (or millions) of their users are getting the exact same email, it's probably spam.

    1. Re:One of the best things Google/GMail could do by kryptkpr · · Score: 4, Informative

      Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?

      --
      DJ kRYPT's Free MP3s!
    2. Re:One of the best things Google/GMail could do by lockefire · · Score: 5, Funny

      Actually, I get a whole lot of emails with the random words and nothing else. I haven't quite caught on to the advertising strategy in that.

    3. Re:One of the best things Google/GMail could do by Cruciform · · Score: 5, Interesting

      I've been getting them as well.
      The only reason I could think of someone sending those around is to bog up Bayesian filters with random crap, possibly lowering their effectiveness.

      Any spammmers/spam-experts feel like enlightening us? :)

    4. Re:One of the best things Google/GMail could do by Halo1 · · Score: 5, Informative

      Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.

      --
      Donate free food here
    5. Re:One of the best things Google/GMail could do by jefe7777 · · Score: 5, Insightful

      >> You think they bother?

      heh heh...abdolutely.

      100 known good addresses are worth 10,000 "who the fuck knows" addressess.

      >>It's cheaper to just send mail to everyone

      no it's not.

      let's pretend you are a spammer, and you want to send out spam.

      If you target 1 billion questionable addresses, each time a client has a new campaign, then that's 1 billion pieces you have to deliver. every time.

      what if you have 1000 clients? that's 1000 billion deliveries.

      do you see where this is going? if you don't KNOW WHAT A VALID EMAIL ADDRESS IS, YOU HAVE TO GUESS.

      but what if the first time you send out just a "test" to those billion addresses, and then subtract the one's that bounce.

      You are left with 50,000 known good addresses.

      that's gold. You now have 1/20th of the load,and you are now serving your clients quicker, a helluva lot less load. you are only using an open relay for 1/20th of the time.

      overall a smaller footprint by 1/20th.

      you tell me. does it make sense to blindly blast out email?

    6. Re:One of the best things Google/GMail could do by ckd · · Score: 4, Funny

      They have a much higher ratio of PhDs than Microsoft, or just about anyone short of a hospital.

      Remind me not to go to your hospital. I want MDs treating me, not people who can give me a dissertation on ancient Sumeria or something. (MDs who also know about ancient Sumeria excepted.)

    7. Re:One of the best things Google/GMail could do by dragonman97 · · Score: 4, Interesting

      Indeed - while I was doing a lot of spam fighting at work, I reviewed a honeypot I'd set up, and was amazed. I used mutt to review the messages, and found a couple of messages where the text part was a page or two from "The Wizard of Oz" and the nasty offer for some kind of auto insurance or other crap was in the HTML section, replete with hidden hash busters behind color backgrounds. These guys are sharp - they must be paying some smart programmers a lot of money, and it's only sad that they've sunk to such levels.

    8. Re:One of the best things Google/GMail could do by letxa2000 · · Score: 5, Insightful
      Spammer is trying to do two things: 1. break any Bayesian filter used on that mail server/inbox. Adding noise to the filter will allow more mail through as "questionable". This might still be tagged as spam, but not as readily as it would be without the added noise

      Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.

      In one recent analysis, 10 random words were inserted by the spammer. He got lucky and 1 of those words actually had a very low score for my Bayesian corpus. Unfortunately (for him), the other 9 words had scores of 99.99%! His use of random words literally nuked any possibility of him getting through my filter.

      Anyway, random words will not help spammers get through Bayesian filters. But it seems that many people (both spammers and non-spammers) think it will. But, hey, that's good for me: as long as "random words" is seen by spammers as a viable solution to Bayesian filters, my Bayesian filter will continue to work and will not have to deal with any innovative way to get around the filter (if any exists).

  4. He gave out his e-mail address... by Anonymous Coward · · Score: 5, Funny

    ... to the entire Slashdot community! Now he's going to be flooded with all sorts of spam and shit. LOL!

    Oh... right. :)

    1. Re:He gave out his e-mail address... by umrgregg · · Score: 4, Funny

      Notice the reader who submitted the story was anonymous... Gotta love friends who sign you up for spam.

      --
      NMG
    2. Re:He gave out his e-mail address... by Algan · · Score: 4, Interesting

      It's not that bad as you think. I posted an dedicated email address to slashdot two times already, just to see what volume of spam I get. Surprisingly, it's only 2-3 messages every other day or so.

      Well, I guess I need a booster shot, so here it is: slashdot@hates.ms. Spam away...

      --
      If con is the opposite of pro, is Congress the opposite of progress?
  5. whining? by Gothmolly · · Score: 5, Insightful

    What's Google going to do to protect its users from mail bombs?

    Now you're complaining that your free, 1GB-limit, access-from-anywhere email service could be mailbombed? Live with it. If Google "decides" anything more about our emails, we put on our tinfoil hats and scream. If we broadcast a bogus email address, obtained from gmail for clearly sinister purposes, and it gets mailbombed, we whine that Google doesn't "protect" us. Whats the story, or are we all just schizophrenic?

    Don't want that "vulnerability"? Don't use Gmail!

    --
    I want to delete my account but Slashdot doesn't allow it.
    1. Re:whining? by supersnail · · Score: 5, Insightful

      I don't think its about protection just practicality. Google offers a SPAM filter the littel pratt tested it and found it wanting.

      I think its more of a problem for Google than the end users. The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte but felling good about the capacity being there. If I open up a gmail account, get p*ss*d of with the spam and go elsewhere without closing the account the 1G will fill up with spam in a couple of months, Google will end up storing terabytes of spam for cutomers who no longer use the service.

      --
      Old COBOL programmers never die. They just code in C.
    2. Re:whining? by Pharmboy · · Score: 5, Insightful

      Now you're complaining...

      That is his JOB, to point out shortcomings of the system. He is a tester, and he is doing it for FREE. Google doesn't want testers who get 3 emails a day, they want people to test the living shit out of the service and point out what is wrong with it. Everyone knows Google will try to fix all the bugs, so all the press, good or bad, is still good press.

      If Google barfs when handling 999 messages in 4 minutes during testing, image when several million people have gmail accounts. Fortunately, now Google has an even to look at to see what the problem is. When you are trying to harden a system, YOU MUST BREAK IT OVER AND OVER AGAIN, to see where it is weak. This is what is happening.

      My impression is that the tech's at Google are spending a significant amount of time saying "oh shit, never thought of that, cool." which is the ENTIRE REASON FOR TESTING. They can't think of every situation by themselves. This is also the entire concept behind "open software is more secure". Google's gmail is going to have bugs at this stage and lots of them, period. Google knows this, hell, everyone knows this (this is why its in testing, and not open to the public yet, duh)

      It not whinning, its stating the facts, which Goggle obviously WANTS him to gather, as a TESTER. Seems to me that he is going beyond the call of duty to test their servers, since he is spending a fair amount of his own time.

      --
      Tequila: It's not just for breakfast anymore!
    3. Re:whining? by Beryllium+Sphere(tm) · · Score: 5, Informative

      >The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte

      Why?

      Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.

      I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.

      Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.

  6. Not a fair test by SWroclawski · · Score: 5, Insightful

    He's not counting all the mail that Google is rejecting and not even being allowed in for further classification.

  7. Should be interesting, what filters? by Clinoti · · Score: 4, Interesting

    Can anyone provide a link or source to the kind of filters google has working on gmail?

    --

    Let's keep in mind that patents are in place to keep lawyers employed and keep them litigating. -CatGrep

  8. I'll help by L.+VeGas · · Score: 5, Funny

    Let's all send him an email and ask him how it's working out.

  9. News... by somethinghollow · · Score: 4, Funny

    "Here is also an article talking about Aaron's efforts from webpronews.com""

    Since we are talking about spam and obtaining more spam, I don't know if I should read the site the article is on as "web pro news dot com" or "web pron ews dot com"...

    I guess I'll figure it out sometime.

  10. Not that impressive by chrisgeleven · · Score: 4, Informative

    Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.

    I am starting to second guess whether I should transfer everything to my Gmail account.

    1. Re:Not that impressive by Apiakun · · Score: 5, Insightful

      Don't forget that this is google's first foray into mail software, and it is still in beta. I have so far gotten very little spam in my gmail inbox.

    2. Re:Not that impressive by XO · · Score: 4, Informative

      Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

      (example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)

      --
      "Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
    3. Re:Not that impressive by peeping_Thomist · · Score: 4, Funny

      I have so far gotten very little spam in my gmail inbox.

      What was that address again?

      --
      Anything worth doing is worth doing badly -- G.K. Chesterton
    4. Re:Not that impressive by furball · · Score: 4, Funny

      Mine's gdnguyen@gmail.com.

      Please only email me if you're barely legal and running a webcam. Thank you.

  11. Re:gmail still beta by waddgodd · · Score: 5, Funny

    >isn't gmail still in 'beta' stages? if so, isn't a review of
    >spam filtering techniques a little premature?

    What part of Beta TEST escapes you here?

    --
    Just because you're paranoid doesn't mean they aren't out to get you
  12. Is this the AventureMail guy? by magefile · · Score: 5, Interesting

    The guy who got booted off AventureMail (2GB free) for trying to test their spam filters? The story is on Kuro5hin, if anyone wants to see it.

  13. My own gmail testing by Twid · · Score: 5, Informative

    I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.

    It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."

    Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
    - Payroll Cards a Poor Substitute for Checking Account
    - Tips for Tackling Check Fraud
    - Sophos hoax description: Ethiopian airline letter
    - FAP non-US Investment FAQs

    In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive! :)

    --
    - "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho
  14. Spam is always personalized by Sulka · · Score: 4, Informative

    Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.

    This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.

    --
    "Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
    1. Re:Spam is always personalized by Thuktun · · Score: 4, Informative

      gzip it and compare the files. a short tracking code will make a negligible difference.

      Not necessarily.

      Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.

  15. About spam and blocking by AviLazar · · Score: 4, Interesting

    While we cannot block every domain name (i.e. if you get spam from $#(*$#sexphreak@yahoo.com) because it will alienate your legitimate contacts, there are many domain names that we can block (i.e. @spam-your-gmail.com). Yahoo provides email/domain name blocking, but limits this to 100 (unless you are paying). Do we know if gmail will have this limitation?
    -A
    *just for those who didn't know, the above domain names and email accounts are random, any resemblence to an actual domain or email account is purely coincidental, and if you choose to do so, you should sue /., not me :)

    --

    I mod down so you can mod up. Your welcome.
  16. 1gb Relieves Spam Concerns by osewa77 · · Score: 4, Interesting

    I have subjected my e-mail address, afriguru@gmail.com to the same abuse. by redirecting all e-mail addresses that recieve lots of junk mail to this one and posting the address unprotected to lots of websites and newsgroups. At the initial stage, a lot of 419 scam mails got through, but now I hardly get any spam. No false positives for me so far.
    _____________________
    Seun Osewa, Abeokuta Nigeria

  17. Re:pre-emptive strike theory by umrgregg · · Score: 4, Funny

    Right! My only idea is that Google's technology is so advanced, it filters messages before they are even sent. It's gotta be a result of faster-than-light calculations. Boy, I'm gonna buy me some stock.

    --
    NMG
  18. gmail spelling by Anonymous Coward · · Score: 5, Funny

    >legitamate

    How about having Slashdot editors/Hemos test the gmail spell checker too?

  19. 0% Spam by yuri · · Score: 5, Interesting

    Spam is unsolicited, so google should filter none of his mail.

    This guy solicited it.

  20. Lack of updates? by Xiadix · · Score: 5, Interesting

    Did anybody else notice that his site hasn't been updated in almost a month (May 25)? Seems his project is no longer working. I wonder if Google booted him.

    KevG

  21. It's going to get a lot better... by waytoomuchcoffee · · Score: 4, Interesting

    For those of you that don't have Gmail yet, there is a little "Report Spam" button you can use to, well, report spam. When Gmail gets a few million users, and even 1% use this little button, you are going to see the spam detect rate skyrocket.

  22. Re:Hmmm.. weird stats... by Satai · · Score: 4, Informative

    no, you inversed it. You want MB/message, not message/MB.

    3778 messages / 213 MB = 17.37 messages / MB
    213 MB / 3778 messages = 0.0564 MB / message

    So that's pretty reasonable.

  23. Cache? by Freon115 · · Score: 5, Funny

    Do you really expect the Google servers to go down because of /.? ;)

    1. Re:Cache? by leo_llew · · Score: 5, Funny

      Obviously not, they provided a link to the GOOGLE Cache ;)

  24. Viola by doodlelogic · · Score: 4, Funny

    If I could stop all the spam I get...I'd feel like a whole string quartet!

  25. Re:How to never get spam by mumblestheclown · · Score: 5, Funny
    Hi! And welcome to the Internet! We're glad to have you aboard.

    Just to get you started, I'll give you a quick hint: virtually every internet discussion on spam includes some high and mighty moron that claims that by not giving out his email address, he never gets spam.

    The problem is, that for every one of those, there are plenty more who follow the same precautions and yet get plenty of spam to those accounts for a variety of reasons. Clearly, your soution is not the answer to "how to never get spam."

    A good rule for using the internet is to read a few discussions before you post. This way, you will be less likely to post something that makes you look naive. So sit back, relax, and enjoy a steaming hot cup of STFU while you read and learn!

  26. Wow by EaterOfDog · · Score: 5, Funny

    His wang is going to be huge!

    --

    Crushing my karma one post at a time.
  27. More focus on false positives. by ron_ivi · · Score: 5, Insightful
    Reviews of spam filters always seem to focus on how much stuff they block.

    The consequenses of blocking a non-spam email are so much worse (parent not hearing from kid. the customer that would have saved your startup.) than a spam getting in, I wish the spam filter reviews would focus on those.

    1. Re:More focus on false positives. by Anonymous Coward · · Score: 4, Informative
      false positive : spam getting past the filter ratio...

      A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.

      I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.

  28. Re:If this guy has used 30% of his capacity... by Zeebs · · Score: 4, Funny

    and I get a cubic buttload of crap daily

    God damned metric system.

    --

    Happy Noodle Boy says "F###ing doughnut! Mock me? You fried cyclops!!"
  29. New spin on the "word salad" strategy by Scott+Richter · · Score: 5, Interesting
    Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.

    Right, and my Thunderbird Bayesian filter catches all of those word salad approaches. But they've come up with a new one - what I call the "encyclopedia attack."

    What they do is copy an encyclopedia entry and put it at the bottom of their spam. The thing is usually a few paragraphs long, so that textually it dominates the message. The subjects are fairly random, and are occasionally educational ;)

    The problem is that the text of this doesn't trip the "too many strange words" flag that's used for word salads. My Thunderbird filter is really having trouble with these. Anyone else having trouble with these spams?