Slashdot Mirror


Zebras Get Less Spam Than Aardvarks

MojoKid writes "A recent study (PDF) by Richard Clayton at Cambridge University determined that the first letter of a someone's email address directly affects how much spam they receive. As shown in the graph at either link above, email addresses with numbers as their first characters receive even fewer spam emails. The corpus used in the study was 8 weeks' worth of email from the UK ISP Demon Internet, just over half a billion messages, of which 56% was deemed to be spam."

9 of 115 comments (clear)

  1. You know what this means by Shajenko42 · · Score: 5, Insightful

    Spammers will now alter their programs to start with "z" and numbers, so they can get the people who aren't as desensitized by spam.

    1. Re:You know what this means by cypherwise · · Score: 4, Insightful

      I'm (incorrectly?) assuming this comment was facetious. 100/35,214 (that's 99.71%) is a pretty damn good ratio when it comes to this type of thing.

    2. Re:You know what this means by lysergic.acid · · Score: 2, Insightful

      but, like the article says, there are fewer people whose e-mail addresses start with z or numbers. so they'd be getting fewer hits by targeting those starting characters. there's already more spam messages being targeted at "zebras" per legitimate target than there are spam messages being targeted at aardvark addresses.

      so the smart thing for spammers to do is to stop wasting time with zebra addresses, since they'd have a higher chance of actually reaching a real mailbox by targeting more popular character ranges.

  2. Re:What? by Oidhche · · Score: 5, Insightful

    Indeed. The conclusion that I'd draw from presented data is that there are more e-mail addresses beginning with 'a' than with 'z' (and that very few addresses begin with a number). Even the percentage of spam is nearly meaningless. To find anything about which addresses receive more spam, you should look at the average amount of spam per e-mail address in a given group, not the total number of messages.

  3. Re:What? by Oidhche · · Score: 3, Insightful

    No. Look at the data. It shows the total amount of messages received by Alberts and Zeds. It's painfully obvious that Alberts receive far more of both spam and genuine messages than Zeds. Not because the average Albert gets more messages than the average Zed, but because there are more Alberts than Zeds.

  4. My domains start with a by flyingfsck · · Score: 2, Insightful

    and yes they get tons of spam, about 99.999% of connection attempts are spam, but a couple of RBLs and Spam Assassin takes care of it. If I turn the protection off, then I get about 10,000 spams per hour, which seems to be a limitation of the server. If the server was faster, then it would probably get more spam. With the filters on, I get about 1 message per hour, which is more acceptable. I don't like the idea of RBLs, but I see no other way to handle the problem - if you are a spammer, then I don't want to talk to you - ever. Stupid idiots. It is also interesting that all brute force attacks that I have observed start at 'a'. So the best passwords will start with 'z'.

    --
    Excuse me, but please get off my Pennisetum Clandestinum, eh!
  5. Re:Filters by xenocide2 · · Score: 2, Insightful

    Indeed, the PDF paper says this is measuring the rate of filtering AFTER using Spaumhaus black holes, and the measured rate is their custom "Cloudmark" spam detection tool. Importantly, if their tool sucks enough that people opt out of it entirely, all email is considered "not-spam". But as long as these effects are not influenced by the first letter, that's okay.

    Unfortunately, the paper tries very hard to present a very silly notion about 'a' versus 'z'. The important concept here isn't order, it's letter frequency, and they should have sorted the letters by that to plot their regression.

    Effectively spam is a combination of email harvesting and email guessing. Harvesting email addreses contributes to spam, but probably builds lists closely resembling the distribution of valid inboxes. Guessing attacks generally do not reflect the distribution of letters used in the English language (the language of the ISP's host nation, and presumably most of the users and domains hosted). The assumption isn't that these attacks stop before they make it to Z, but that they overweight z*@example.com. So more spam is sent to those addresses per valid inbox than more common letters. And the paper goes on to say a lot of those land in nonexistant mailboxes relative to more populated leading inbox letters.

    They go on to try to quantify the difference but seem to fail for various reasons, including the aforementioned spamhaus.

    --
    I Browse at +4 Flamebait

    Open Source Sysadmin

  6. Signal to Noise ratio by aembleton · · Score: 2, Insightful

    From looking at that graph; it would be more interesting to see the signal to noise ratio for each of the letters and numbers. Those names beginning with an 'A' do indeed receive more spam, but also far more non-spam. In fact it looks to be more like 50:51 (non-spam : spam), whereas from first glance those email addresses beginning with a 'P' receive 40:60.

  7. Re:Unexpected by nabsltd · · Score: 2, Insightful

    I think most of the spam targeted at a message ID comes from crawling USENET.

    On my server, I see lots of e-mail with a "rcpt to:" that matches the regex "(mpg\.)?[a-f0-9]+\@news\.domain\.com". This is the format that inn uses to create message IDs.