Slashdot Mirror


Gmail Spam Filter Testing

An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"

11 of 285 comments (clear)

  1. Not that impressive by chrisgeleven · · Score: 4, Informative

    Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.

    I am starting to second guess whether I should transfer everything to my Gmail account.

    1. Re:Not that impressive by XO · · Score: 4, Informative

      Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

      (example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)

      --
      "Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
  2. Re:One of the best things Google/GMail could do by kryptkpr · · Score: 4, Informative

    Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?

    --
    DJ kRYPT's Free MP3s!
  3. My own gmail testing by Twid · · Score: 5, Informative

    I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.

    It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."

    Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
    - Payroll Cards a Poor Substitute for Checking Account
    - Tips for Tackling Check Fraud
    - Sophos hoax description: Ethiopian airline letter
    - FAP non-US Investment FAQs

    In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive! :)

    --
    - "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho
  4. Spam is always personalized by Sulka · · Score: 4, Informative

    Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.

    This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.

    --
    "Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
    1. Re:Spam is always personalized by Thuktun · · Score: 4, Informative

      gzip it and compare the files. a short tracking code will make a negligible difference.

      Not necessarily.

      Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.

  5. Re:The Filter is great! by aismail3 · · Score: 5, Informative

    When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.

  6. Re:Hmmm.. weird stats... by Satai · · Score: 4, Informative

    no, you inversed it. You want MB/message, not message/MB.

    3778 messages / 213 MB = 17.37 messages / MB
    213 MB / 3778 messages = 0.0564 MB / message

    So that's pretty reasonable.

  7. Re:One of the best things Google/GMail could do by Halo1 · · Score: 5, Informative

    Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.

    --
    Donate free food here
  8. Re:whining? by Beryllium+Sphere(tm) · · Score: 5, Informative

    >The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte

    Why?

    Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.

    I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.

    Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.

  9. Re:More focus on false positives. by Anonymous Coward · · Score: 4, Informative
    false positive : spam getting past the filter ratio...

    A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.

    I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.